commit ba5eae

Commit `ba5eae`

2024-10-16 15:53:40 admin: initial final draft

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

`kubernetes/use nvidia gpu on proxmox k3s lxc.md` ..
@@ 83,7 83,7 @@
	```

	These lines perform the following for your LXC (in order):
-	- Mount host kernel headers so gpu-operator helm chart on K3s can build the Nvidia drivers
+	- Mount host kernel headers so gpu-operator Helm chart on K3s can build the Nvidia drivers
	- Create cgroup2 allowist entries for the major device numbers of the Nvidia devices
	- Passthrough Nvidia devices to LXC

@@ 142,9 142,117 @@
	+-----------------------------------------------------------------------------------------+
	```

+	## Configuring K3s
+	If you have done everything correctly [K3s should automatically detect the Nvidia container runtime](https://docs.k3s.io/advanced#nvidia-container-runtime-support) when the service is started. You can verify this by running:
+	```
+	root@k3s-media:~# grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
+	[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
+	[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
+	BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
+	```
+
+	You will then need to add a `RuntimeClass` definition to your cluster:
+	```
+	cat << EOF > nvidia-runtime-class.yaml
+	---
+	apiVersion: node.k8s.io/v1
+	kind: RuntimeClass
+	metadata:
+	name: nvidia
+	handler: nvidia
+	EOF
+
+	kubectl apply -f nvidia-runtime-class.yaml
+	```
+
+	### Installing the gpu-operator
+	First add the `nvidia` Helm repo:
+	```
+	helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
+	helm repo update
+	```
+
+	Next install the gpu-operator Helm chart using the values tailored for K3s:
+	```
+	cat << EOF > gpu-operator-values.yaml
+	toolkit:
+	env:
+	- name: CONTAINERD_CONFIG
+	value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
+	- name: CONTAINERD_SOCKET
+	value: /run/k3s/containerd/containerd.sock
+	EOF
+
+	helm install gpu-operator nvidia/gpu-operator --create-namespace --values gpu-operator-values.yaml
+	```
+
+	This will create a bunch of pods in the `gpu-operator` namespace that will build the Nvidia drivers. You will see some of these pods restarting, this is normal as some pods are dependant on others to complete. Overall the build process should take a couple of minutes. If it's taking longer than 10 minutes you likely have an issue and should look at the logs of the `gpu-operator-node-feature-discovery-worker` pods (This is how I figure out you need to mount the Proxmox host kernel headers on the LXC as the pod couldn't find the kernel modules).
+
+	You can verify that everything is working correctly by spinning up a pod that uses the GPU:
+	```
+	cat << EOF > gpu-benchmark-pod.yaml
+	---
+	apiVersion: v1
+	kind: Pod
+	metadata:
+	name: nbody-gpu-benchmark
+	namespace: default
+	spec:
+	restartPolicy: OnFailure
+	runtimeClassName: nvidia
+	containers:
+	- name: cuda-container
+	image: nvcr.io/nvidia/k8s/cuda-sample:nbody
+	args: ["nbody", "-gpu", "-benchmark"]
+	resources:
+	limits:
+	nvidia.com/gpu: 1
+	env:
+	- name: NVIDIA_VISIBLE_DEVICES
+	value: all
+	- name: NVIDIA_DRIVER_CAPABILITIES
+	value: all
+	EOF
+
+	kubectl apply -f gpu-benchmark-pod.yaml
+	```
+
+	GPU successfully identified and benchmarked:
+	```
+	root@k3s:~# k logs nbody-gpu-benchmark
+	Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
+	-fullscreen (run n-body simulation in fullscreen mode)
+	-fp64 (use double precision floating point values for simulation)
+	-hostmem (stores simulation data in host memory)
+	-benchmark (run benchmark to measure performance)
+	-numbodies=<N> (number of bodies (>= 1) to run in simulation)
+	-device=<d> (where d=0,1,2.... for the CUDA device to use)
+	-numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)
+	-compare (compares simulation results running once on the default GPU and once on the CPU)
+	-cpu (run n-body simulation on the CPU)
+	-tipsy=<file.bin> (load a tipsy model file for simulation)
+
+	NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
+
+	> Windowed mode
+	> Simulation data stored in video memory
+	> Single precision floating point simulation
+	> 1 Devices used for simulation
+	GPU Device 0: "Pascal" with compute capability 6.1
+
+	> Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1080]
+	20480 bodies, total time for 10 iterations: 14.370 ms
+	= 291.883 billion interactions per second
+	= 5837.668 single-precision GFLOP/s at 20 flops per interaction
+	```
+
+	## Final Notes
+	When running pods that utilize your GPU you will only be able to see the processes by running `nvidia-smi` on your Proxmox host. Running `nvidia-smi` in the LXC will not show any GPU processes even if they are running.
+
	# References
	- https://pve.proxmox.com/wiki/NVIDIA_vGPU_on_Proxmox_VE#Preparation
	- https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
	- https://github.com/UntouchedWagons/K3S-NVidia
	- https://theorangeone.net/posts/lxc-nvidia-gpu-passthrough/
	- https://forum.proxmox.com/threads/sharing-gpu-to-lxc-container-failed-to-initialize-nvml-unknown-error.98905/
+	- https://docs.k3s.io/advanced#nvidia-container-runtime-support

Commit ba5eae

Commit `ba5eae`