Setup K8s to use nvidia drivers

Setting up k8's to use nvidia

  • Setting up k8's to use nvidia
    • Prerequisites
    • Quick Start
      • Preparing your GPU Nodes
      • Enabling GPU Support in Kubernetes
      • Checks
      • Sample yaml file
      • References
      • Destroy

Prerequisites

The list of prerequisites for running the NVIDIA device plugin is described below:
  • NVIDIA drivers ~= 410.48
  • nvidia-docker version > 2.0 (see how to install and it's prerequisites)
  • docker configured with nvidia as the default runtime.
  • Kubernetes version >= 1.10kubeadm installation.
  • Post installation use flannel as the network plugin
    - $ sudo kubeadm init --pod-network-cidr=10.244.0.0/16
    - $ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/2140ac876ef134e0ed5af15c65e414cf26827915/Documentation/kube-flannel.yml
    - $ kubectl taint nodes --all node-role.kubernetes.io/master-
    

Quick Start

Preparing your GPU Nodes

The following steps need to be executed on all your GPU nodes. This README assumes that the NVIDIA drivers and nvidia-docker have been installed.
Note that you need to install the nvidia-docker2 package and not the nvidia-container-toolkit. This is because the new --gpus options haven't reached Kubernetes yet. Example:
# Add the package repositories
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

$ sudo apt-get update && sudo apt-get install -y nvidia-docker2
$ sudo systemctl restart docker
You will need to enable the nvidia runtime as your default runtime on your node. We will be editing the docker daemon config file which is usually present at /etc/docker/daemon.json:
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
if runtimes is not already present, head to the install page of nvidia-docker

Note

Don't forget to restart docker after changing daemon.json for your changes to reflect.
systemctl restart docker

Enabling GPU Support in Kubernetes

Once you have enabled this option on all the GPU nodes you wish to use, you can then enable GPU support in your cluster by deploying the following Daemonset:
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

Checks

docker run --rm nvidia/k8s-device-plugin:1.0.0-beta nvidia-device-plugin
Should render output as follows
(base) ubuntu@ip-172-31-42-125:~$ docker run --rm nvidia/k8s-device-plugin:1.0.0-beta nvidia-device-plugin
Unable to find image 'nvidia/k8s-device-plugin:1.0.0-beta' locally
1.0.0-beta: Pulling from nvidia/k8s-device-plugin
743f2d6c1f65: Pull complete
fcd797589536: Pull complete
Digest: sha256:f284efc70d5b4b4760cd7b60280e7e9370f64fca0b15f5e73d2742f4cfe7169f
Status: Downloaded newer image for nvidia/k8s-device-plugin:1.0.0-beta
2020/02/24 09:39:53 Loading NVML
2020/02/24 09:39:56 Fetching devices.
2020/02/24 09:39:56 Starting FS watcher.
2020/02/24 09:39:56 Failed to created FS watcher.
Now we are good to go to launch pod and access gpu's from pods.

Sample yaml file

(base) ubuntu@ip-172-31-42-125:~$ cat test_gpu_pod.yml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvidia/cuda:10.0-base
      command: ["sh", "-c", "tail -f /dev/null"]
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
Go ahead and create pod.
(base) ubuntu@ip-172-31-42-125:~$ kubectl create -f test_gpu_pod.yml
pod/gpu-pod created
(base) ubuntu@ip-172-31-42-125:~$ kubectl get pods
NAME      READY   STATUS    RESTARTS   AGE
gpu-pod   1/1     Running   0          6s
(base) ubuntu@ip-172-31-42-125:~$ kubectl exec -it gpu-pod /bin/bash
root@gpu-pod:/# nvidia-smi
Mon Feb 24 12:27:10 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   33C    P8    31W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@gpu-pod:/# cd
root@gpu-pod:~# exit

Destroy:

If you want to destroy the setup.
sudo kubeadm reset

References:

Comments

Popular posts from this blog

Debug an IO throttle issue