Docker with nvidia-container-toolkit does not run

moelrobi · July 31, 2025, 1:21pm

Hello!

I am fairly new to NixOS, but I am loving the experience so far.

I have now run into a problem which was already described in an earlier post here.

Since the thread in mention is over a year old, I didn’t want to “reawaken” the thread and decided to create a new one.

I am trying to create a new docker container with the following docker-compose.yaml:

services:
  ray-worker:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: ray-worker
    restart: unless-stopped
    volumes:
      - /opt/models:/models
    environment:
      - VLLM_HOST_IP=${workerIp}
      - NCCL_DEBUG=INFO
      - RAY_DEDUP_LOGS=0
      - NCCL_NET=Socket
      - NCCL_IB_DISABLE=0
    gpus: all
    network_mode: host
    ipc: host
    shm_size: '100gb'
    deploy:
      resources:
        reservations:
          devices:
          - driver: cdi
            capabilities: [gpu]
            device_ids:
            - nvidia.com/gpu=all

While being successful in creating my docker container while starting, I am getting the following error:

Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

But while running this command found in the article:

[user@lab-compute:~]$ docker run --rm -it --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-e08e8553-34f5-08e9-ad9c-27dc0632d5bc)
GPU 1: Tesla T4 (UUID: GPU-5480d59b-6c81-06b6-2f44-51e1ed9b69cc)
GPU 2: Tesla T4 (UUID: GPU-5931eb7f-3cf0-4b12-78e0-12c8fa956123)
GPU 3: Tesla T4 (UUID: GPU-8d6128ad-66f3-0b9d-7668-883ef3cc4b95)

I have the following hardware configuration in my nix file:

    virtualisation.docker = {
      enable = true;
      enableOnBoot = true;
    };
    hardware.graphics.enable = true;
    services.xserver.videoDrivers = [ "nvidia" ];
    hardware.nvidia = {
      modesetting.enable = true;
      nvidiaSettings = true;
      open = true;
      package = config.boot.kernelPackages.nvidiaPackages.production;
    };
    hardware.nvidia-container-toolkit.enable = true;

And have with this reached a dead end. Have I overlooked something? Help would greatly be appreciated since this is driving me insane the last 3 days.

ruffsl · July 31, 2025, 1:59pm

What happens if you omit capabilities: [gpu] from your compose file?

          devices:
          - driver: cdi
-            capabilities: [gpu]
            device_ids:
            - nvidia.com/gpu=all

I don’t see capabilities declared in prior posts:

moelrobi · July 31, 2025, 2:33pm

Still failing

validating /opt/ray-cluster/docker-compose.yaml: services.ray-worker.deploy.resources.reservations.devices.0 capabilities is required

malloc · July 31, 2025, 3:47pm

Hi @moelrobi - seems more like a docker compose configuration issue rather than a nix/nixOS issue to me.

You are able to invoke nvidia-smi without explicitly defining the driver and seems container is able to recognize the host GPUs.

In the course of your 3-day debugging session, did you try removing services.ray-worker.deploy.resources.reservations.devices[0].driver from your docker-compose.yaml file?

malloc · July 31, 2025, 4:09pm

Something else to consider trying, if you haven’t already:

Try wrapping each string in capabilities array with double quotes.

…devices[0].capabilities: ["gpu"] [1]

I thought docker compose would automatically infer/coerce into array of strings, but based on your output. Seems it was interpreting it as a literal string?

[1] https://docs.docker.com/reference/compose-file/deploy/#capabilities

moelrobi · August 1, 2025, 8:12am

Okay, I found it, and I am sorry for wasting time.

services:
  ray-worker:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: ray-worker
    restart: unless-stopped
    volumes:
      - /opt/models:/models
    environment:
      - VLLM_HOST_IP=${workerIp}
      - NCCL_DEBUG=INFO
      - RAY_DEDUP_LOGS=0
      - NCCL_NET=Socket
      - NCCL_IB_DISABLE=0
    network_mode: host
    ipc: host
    shm_size: '100gb'
    devices:
      - nvidia.com/gpu=all

You don’t need to define gpus: all only defining the devices you want to hang to the container.
Thank you, @malloc, for the suggestion, with your input I tried to think out of the box.

Marking this as the solution