How do I select which GPU to run a job on?

问题

In a multi-GPU computer, how do I designate which GPU a CUDA job should run on?

As an example, when installing CUDA, I opted to install the NVIDIA_CUDA-<#.#>_Samples then ran several instances of the nbody simulation, but they all ran on one GPU 0; GPU 1 was completely idle (monitored using watch -n 1 nvidia-dmi). Checking CUDA_VISIBLE_DEVICES using

echo $CUDA_VISIBLE_DEVICES

I found this was not set. I tried setting it using

CUDA_VISIBLE_DEVICES=1

then running nbody again but it also went to GPU 0.

I looked at the related question, how to choose designated GPU to run CUDA program?, but deviceQuery command is not in the CUDA 8.0 bin directory. In addition to $CUDA_VISIBLE_DEVICES$ , I saw other posts refer to the environment variable $CUDA_DEVICES but these were not set and I did not find information on how to use it.

While not directly related to my question, using nbody -device=1 I was able to get the application to run on GPU 1 but using nbody -numdevices=2 did not run on both GPU 0 and 1.

I am testing this on a system running using the bash shell, on CentOS 6.8, with CUDA 8.0, 2 GTX 1080 GPUs, and NVIDIA driver 367.44.

I know when writing using CUDA you can manage and control which CUDA resources to use but how would I manage this from the command line when running a compiled CUDA executable?

回答1:

The problem was caused by not setting the CUDA_VISIBLE_DEVICES variable within the shell correctly.

To specify CUDA device 1 for example, you would set the CUDA_VISIBLE_DEVICES using

export CUDA_VISIBLE_DEVICES=1

CUDA_VISIBLE_DEVICES=1 ./cuda_executable

The former sets the variable for the life of the current shell, the latter only for the lifespan of that particular executable invocation.

If you want to specify more than one device, use

export CUDA_VISIBLE_DEVICES=0,1

CUDA_VISIBLE_DEVICES=0,1 ./cuda_executable

回答2:

export NVIDIA_VISIBLE_DEVICES=gpu_id export CUDA_VISIBLE_DEVICES=0

where gpu_id is the 0-based ID of the GPU made available to the guest system (e.g. to the docker container environment).

More info

Note that this exposes only a single card to the system (with local ID zero), hence we hard-code CUDA_VISIBLE_DEVICES to zero. You can verify that different card is selected by inspecting nvidia-smi's Bus-Id.

The accepted solution which was based on CUDA_VISIBLE_DEVICES alone does not hide (see e.g. in nvidia-smi) all other unavailable cards (different from the pinned one), and thus causes computation errors if you try to use them in your modeling algos. With this solution, other cards are simply not visible to the guest system, but other users still can access them and share their compute power on an equal basis, just like with CPU's (verified).

This is also preferable to solutions using Kubernetes / Openshift controlers (resources.limits.nvidia.com/gpu), that would impose a lock on the allocated card, removing it from the pool of available resources (so the number of containers with GPU access could not exceed the number of physical cards).

If you want GPU load balancing, make gpu_id random at each guest system start.

This has been tested under CUDA 8.0, 9.0 and 10.1 in docker containers running Ubuntu 18.04 orchestrated by Openshift 3.11.

来源：https://stackoverflow.com/questions/39649102/how-do-i-select-which-gpu-to-run-a-job-on

标签

cuda

nvidia