Skip to main content

Nvidia workloads on Nomad

· 4 min read
Stephan Hochdörfer
Head of IT Business Operations

The Nomad NVIDIA device plugin allows you to use NVIDIA graphical processing units (GPUs) within your Nomad workloads.

Configure the Nvidia driver for Linux

To get started, ensure your Linux system is correctly set up to work with the Nvidia driver. I tried several tutorials, but this one was the only working for me.

On our Ubuntu server, I installed the necessary additional drivers and retrieved the appropriate Nvidia driver for my system using the following command:

apt install ubuntu-drivers-common -y

ubuntu-drivers install

To enable general-purpose processing on a graphics processing unit (GPU), we also need to install the NVIDIA Compute Unified Device Architecture (CUDA) toolkit:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb

dpkg -i cuda-keyring_1.1-1_all.deb

apt-get update

apt-get -y install cuda-toolkit-12-6

To support running Docker container workloads on Nomad, we also need to install the NVIDIA Container Toolkit:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list

apt-get update && apt-get install -y nvidia-container-toolkit

After installing the NVIDIA Container Toolkit, you must configure it to work with the Docker engine. Then, restart the Docker daemon to apply the changes:

nvidia-ctk runtime configure --runtime=docker

systemctl restart docker

To verify that the setup is working correctly, run the following test commands:

nvidia-smi

This should give you an output similar to this:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01 Driver Version: 550.163.01 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10 Off | 00000000:81:00.0 Off | 0 |
| 0% 22C P8 11W / 150W | 4MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

Next, verify that Docker can access the NVIDIA hardware by running the following command:

docker run --rm -it --gpus all nvcr.io/nvidia/pytorch:22.03-py3

Configure the Nvidia device plugin for Nomad

With Docker configured to work with the Nvidia driver, you can already run Nvidia workloads on Nomad. However, to enable Nomad to recognize the Nvidia hardware, you'll need to install the Nvidia device plugin. This plugin allows Nomad to schedule Nvidia workloads on nodes with compatible hardware, ensuring optimal resource allocation.

Note that the Nvidia device plugin is not included in the standard Nomad distribution and requires manual installation. To install it, download the plugin from the driver release page and then copy it to the /var/nomad/plugins/ directory.

To enable Nomad to use the Nvidia device plugin, you need to update the Nomad configuration. Start by specifying the plugins directory in the Nomad configuration file by adding the following line:

plugin_dir = "/var/nomad/plugins"

Next, you'll also need to add the plugin's configuration settings to the Nomad configuration file:

plugin "nomad-device-nvidia" {
config {
enabled = true
ignored_gpu_ids = []
fingerprint_period = "5m"
}
}

Restart your Nomad client, e.g. with systemctl restart nomad to apply the configuration changes.

Once done, you can query the node status with the following command:

nomad node status 9b3dc768-c0e3-c062-b45e-e356980119e0

Which should give you an output similar to this:

Device Resource Utilization
nvidia/gpu/NVIDIA A10[GPU-02ce5b56-7506-8d34-0ea6-f3084f4cf894] 525 / 23028 MiB

Schedule Nomad workloads

Once everything is working, we can schedule a Nomad workload on our GPU server.

With the Nvidia device plugin installed, you can now target specific nodes with GPU resources by including the nvidia/gpu device requirement in your task definitions:

resources {
device "nvidia/gpu" {
count = 1
}
}

Alternatively, if your cluster includes multiple GPU servers, you can target specific types of GPU servers by filtering them, for example:

resources {
device "nvidia/gpu" {
count = 1

affinity {
attribute = "${device.model}"
value = "NVIDIA A10"
weight = 50
}
}
}

For a complete example, refer to the following Nomad job definition:

job "ollama" {
datacenters = ["dc1"]
type = "service"

group "ollama" {
count = 1

network {
port "app" {
to = 11434
}
}

task "ollama-server" {
driver = "docker"

config {
image = "ollama/ollama:latest"
ports = ["app"]
force_pull = true
}

resources {
device "nvidia/gpu" {
count = 1
}
}

service {
name = "ollama-prod"
provider = "nomad"
port = "app"

check {
name = "alive"
type = "tcp"
interval = "10s"
timeout = "2s"
}
}
}
}
}