Install Nvidia GPU on Proxmox K3s LXC

# Install Nvidia GPU on Proxmox K3s LXC
Guide for installing Nvidia drivers on a Proxmox privileged LXC to enable their use in K3s pods.

## Software
- Proxmox v8.2.2
- Debian LXC v12.7
- K3s v1.30.5

## Installing Nvidia Drivers on the Proxmox Host
Make note of the driver version you install as you'll need to install the same version later on the K3s LXC. Use the following instructions to install the Nvidia driver on  your Proxmox host ([Proxmox official guide](https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE#Preparation)):
```
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf

apt update
apt install -y dkms libc6-dev proxmox-default-headers --no-install-recommends

wget -O NVIDIA-Linux-x86_64-550.120.run https://us.download.nvidia.com/XFree86/Linux-x86_64/550.120/NVIDIA-Linux-x86_64-550.120.run
chmod +x NVIDIA-Linux-x86_64-550.120.run
./NVIDIA-Linux-x86_64-550.120.run --no-nouveau-check --dkms
```

You must also add the following udev rules to create the Nvidia devices on the Proxmox host:
```
cat << EOF > /etc/udev/rules.d/70-nvidia.rules
KERNEL=="nvidia", RUN+="/bin/bash -c '/usr/bin/nvidia-smi -L && /bin/chmod 666 /dev/nvidia*'"
KERNEL=="nvidia_uvm", RUN+="/bin/bash -c '/usr/bin/nvidia-modprobe -c0 -u && /bin/chmod 0666 /dev/nvidia-uvm*'"
EOF
```

Reboot your Proxmox host to load the Nvidia driver and create the udev devices. You can verify the drivers are working with `nvidia-smi`:
```
root@pve-media:~# nvidia-smi
Wed Oct 16 11:17:11 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1080        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   35C    P8             16W /  210W |       0MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
```

## Configuring the LXC
Once the driver is loaded and the Nvidia devices exist you need to obtain their major device numbers. Run the following on the Proxmox host to get your numbers. In my case they are 195 and 236:
```
root@pve-media:~# ls -la /dev/nvid*                                                                                                                                                                                       
crw-rw-rw- 1 root root 195,   0 Sep 27 19:40 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Sep 27 19:40 /dev/nvidiactl
crw-rw-rw- 1 root root 236,   0 Sep 27 19:40 /dev/nvidia-uvm
crw-rw-rw- 1 root root 236,   1 Sep 27 19:40 /dev/nvidia-uvm-tools

/dev/nvidia-caps:
total 0
drwxr-xr-x  2 root root     80 Sep 27 19:40 .
drwxr-xr-x 20 root root   5060 Sep 27 20:08 ..
cr--------  1 root root 239, 1 Sep 27 19:40 nvidia-cap1
cr--r--r--  1 root root 239, 2 Sep 27 19:40 nvidia-cap2
```

Next you must add to following to the LXC config changing your device numbers as needed.

Edit: `/etc/pve/lxc/<lxc_id>.conf`
```
mp1: /usr/lib/modules,mp=/usr/lib/modules
lxc.cgroup2.devices.allow: c 195:* rw
lxc.cgroup2.devices.allow: c 236:* rw
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
```

These lines perform the following for your LXC (in order):
- Mount host kernel headers so gpu-operator Helm chart on K3s can build the Nvidia drivers
- Create cgroup2 allowist entries for the major device numbers of the Nvidia devices
- Passthrough Nvidia devices to LXC

### Installing Nvidia Driver on LXC
For the LXC you are going to install the same Nvidia driver version but instead with the `--no-kernel-module` option as the LXC shares the same kernel as your Proxmox host:
```
wget -O NVIDIA-Linux-x86_64-550.120.run https://us.download.nvidia.com/XFree86/Linux-x86_64/550.120/NVIDIA-Linux-x86_64-550.120.run
chmod +x NVIDIA-Linux-x86_64-550.120.run
./NVIDIA-Linux-x86_64-550.120.run --no-kernel-module
```

### Installing the Nvidia Container Toolkit
Next you need to install the Nvidia container toolkit. Start by running the following to add the repository to Apt ([Nvidia official guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)):
```
apt install -y gpg curl

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg --yes \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
```

Then install the container toolkit:
```
apt update
apt install -y nvidia-container-runtime
```

Debian is not officially supported so we have to create a soft link for `ldconfig` so the [nvidia-container-cli](https://github.com/NVIDIA/nvidia-container-toolkit/issues/147) can find it:
```
ln -s /sbin/ldconfig /sbin/ldconfig.real
```

Reboot the LXC and verify the drivers are loaded and working with `nvidia-smi`:
```
root@k3s-media:~# nvidia-smi
Wed Oct 16 11:15:20 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1080        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   35C    P8             16W /  210W |       0MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
```

## Configuring K3s
If you have done everything correctly [K3s should automatically detect the Nvidia container runtime](https://docs.k3s.io/advanced#nvidia-container-runtime-support) when the service is started. You can verify this by running:
```
root@k3s-media:~# grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
```

You will then need to add a `RuntimeClass` definition to your cluster:
```
cat << EOF > nvidia-runtime-class.yaml
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
EOF

kubectl apply -f nvidia-runtime-class.yaml
```

### Installing the gpu-operator
First add the `nvidia` Helm repo:
```
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
```

Next install the `gpu-operator` Helm chart using the following values file tailored for K3s:
```
cat << EOF > gpu-operator-values.yaml
toolkit:
  env:
    - name: CONTAINERD_CONFIG
      value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
    - name: CONTAINERD_SOCKET
      value: /run/k3s/containerd/containerd.sock
EOF

helm install gpu-operator nvidia/gpu-operator --create-namespace --values gpu-operator-values.yaml
```

This will create a bunch of pods in the `gpu-operator` namespace that will build the Nvidia drivers. You will see some of these pods restarting, this is normal as some pods are dependant on others completion. Overall the build process should take a couple of minutes. If it's taking longer than 10 minutes you likely have an issue and should look at the logs of the `gpu-operator-node-feature-discovery-worker` pods (This is how I figure out you need to mount the Proxmox host kernel headers on the LXC as the pod couldn't find the kernel modules).

You can verify that everything is working correctly by spinning up a pod that uses the GPU:
```
cat << EOF > gpu-benchmark-pod.yaml
---
apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda-container
    image: nvcr.io/nvidia/k8s/cuda-sample:nbody
    args: ["nbody", "-gpu", "-benchmark"]
    resources:
      limits:
        nvidia.com/gpu: 1
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all
EOF

kubectl apply -f gpu-benchmark-pod.yaml
```

GPU successfully identified and benchmarked:
```
root@k3s:~# k logs nbody-gpu-benchmark 
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
        -fullscreen       (run n-body simulation in fullscreen mode)
        -fp64             (use double precision floating point values for simulation)
        -hostmem          (stores simulation data in host memory)
        -benchmark        (run benchmark to measure performance) 
        -numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
        -device=<d>       (where d=0,1,2.... for the CUDA device to use)
        -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
        -compare          (compares simulation results running once on the default GPU and once on the CPU)
        -cpu              (run n-body simulation on the CPU)
        -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Pascal" with compute capability 6.1

> Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1080]
20480 bodies, total time for 10 iterations: 14.370 ms
= 291.883 billion interactions per second
= 5837.668 single-precision GFLOP/s at 20 flops per interaction
```

## Final Notes
When running pods that utilize your GPU you will only be able to see the processes by running `nvidia-smi` on your Proxmox host. Running `nvidia-smi` in the LXC will not show any GPU processes even if they are running.

# References
- https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE#Preparation
- https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
- https://github.com/UntouchedWagons/K3S-NVidia
- https://theorangeone.net/posts/lxc-nvidia-gpu-passthrough/
- https://forum.proxmox.com/threads/sharing-gpu-to-lxc-container-failed-to-initialize-nvml-unknown-error.98905/
- https://docs.k3s.io/advanced#nvidia-container-runtime-support