Hello!
I am fairly new to NixOS, but I am loving the experience so far.
I have now run into a problem which was already described in an earlier post here .
Since the thread in mention is over a year old, I didn’t want to “reawaken” the thread and decided to create a new one.
I am trying to create a new docker container with the following docker-compose.yaml:
services:
ray-worker:
build:
context: .
dockerfile: Dockerfile
container_name: ray-worker
restart: unless-stopped
volumes:
- /opt/models:/models
environment:
- VLLM_HOST_IP=${workerIp}
- NCCL_DEBUG=INFO
- RAY_DEDUP_LOGS=0
- NCCL_NET=Socket
- NCCL_IB_DISABLE=0
gpus: all
network_mode: host
ipc: host
shm_size: '100gb'
deploy:
resources:
reservations:
devices:
- driver: cdi
capabilities: [gpu]
device_ids:
- nvidia.com/gpu=all
While being successful in creating my docker container while starting, I am getting the following error:
Error response from daemon: could not select device driver "" with capabilities: [[gpu]]
But while running this command found in the article:
[user@lab-compute:~]$ docker run --rm -it --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-e08e8553-34f5-08e9-ad9c-27dc0632d5bc)
GPU 1: Tesla T4 (UUID: GPU-5480d59b-6c81-06b6-2f44-51e1ed9b69cc)
GPU 2: Tesla T4 (UUID: GPU-5931eb7f-3cf0-4b12-78e0-12c8fa956123)
GPU 3: Tesla T4 (UUID: GPU-8d6128ad-66f3-0b9d-7668-883ef3cc4b95)
I have the following hardware configuration in my nix file:
virtualisation.docker = {
enable = true;
enableOnBoot = true;
};
hardware.graphics.enable = true;
services.xserver.videoDrivers = [ "nvidia" ];
hardware.nvidia = {
modesetting.enable = true;
nvidiaSettings = true;
open = true;
package = config.boot.kernelPackages.nvidiaPackages.production;
};
hardware.nvidia-container-toolkit.enable = true;
And have with this reached a dead end. Have I overlooked something? Help would greatly be appreciated since this is driving me insane the last 3 days.
ruffsl
July 31, 2025, 1:59pm
2
What happens if you omit capabilities: [gpu]
from your compose file?
devices:
- driver: cdi
- capabilities: [gpu]
device_ids:
- nvidia.com/gpu=all
I don’t see capabilities declared in prior posts:
Hello @Traktorbek !
You need to adapt your docker-compose file so that it uses the CDI driver, like documented in Nixpkgs Reference Manual . In your case, it should be along the lines:
pipeline:
image: '${DOCKER_IMAGE_PIPELINE?Variable not set}:${TAG-latest}'
build:
target: development
context: ./pipeline
args:
- PIPELINE_BASE_IMAGE=${PIPELINE_BASE_IMAGE}
environment:
- CONFIG_PATH=${PIPELINE_CONFIG_PATH}
runtime: ${DOCKER_RUNTIME:-runc}
restart: always
deploy:
…
I’ve been using this:
deploy:
resources:
reservations:
devices:
- driver: cdi
device_ids:
- nvidia.com/gpu=all
Container runs, but vulkan doesn’t work for me - not sure if cuda would work…
If you find out anything else, I’d be happy to hear about it… Trying to switch to NixOS as main OS and this is kinda stopping me
Still failing
validating /opt/ray-cluster/docker-compose.yaml: services.ray-worker.deploy.resources.reservations.devices.0 capabilities is required
malloc
July 31, 2025, 3:47pm
4
Hi @moelrobi - seems more like a docker compose configuration issue rather than a nix/nixOS issue to me.
You are able to invoke nvidia-smi
without explicitly defining the driver
and seems container is able to recognize the host GPUs.
In the course of your 3-day debugging session, did you try removing services.ray-worker.deploy.resources.reservations.devices[0].driver
from your docker-compose.yaml file?
malloc
July 31, 2025, 4:09pm
5
Something else to consider trying, if you haven’t already:
Try wrapping each string in capabilities
array with double quotes.
…devices[0].capabilities: ["gpu"]
[1]
I thought docker compose would automatically infer/coerce into array of strings, but based on your output. Seems it was interpreting it as a literal string?
[1] https://docs.docker.com/reference/compose-file/deploy/#capabilities
Okay, I found it, and I am sorry for wasting time.
services:
ray-worker:
build:
context: .
dockerfile: Dockerfile
container_name: ray-worker
restart: unless-stopped
volumes:
- /opt/models:/models
environment:
- VLLM_HOST_IP=${workerIp}
- NCCL_DEBUG=INFO
- RAY_DEDUP_LOGS=0
- NCCL_NET=Socket
- NCCL_IB_DISABLE=0
network_mode: host
ipc: host
shm_size: '100gb'
devices:
- nvidia.com/gpu=all
You don’t need to define gpus: all
only defining the devices you want to hang to the container.
Thank you, @malloc , for the suggestion, with your input I tried to think out of the box.
Marking this as the solution