SLURM job taking up entire node when using just one GPU

问题

I am submitting multiple jobs to a SLURM queue. Each job uses 1 GPU. We have 4 GPUs per node. However once a job is running, it takes up the entire node, leaving 3 GPUs idle. Is there any way to avoid this, so that I can send multiple jobs to one node, using one GPU each?

My script looks like this:

#SLURM --gres=gpu:1
#SLURM --ntasks-per-node 1
#SLURM -p ghp-queue
myprog.exe

回答1:

I was also unable to run multiple jobs on different GPUs. What helped was adding OverSubscribe=FORCE to the partition configuration in slurm.conf, like this:

PartitionName=compute Nodes=ALL ... OverSubscribe=FORCE

After that, I was able to run four jobs with --gres=gpu:1, and each one took a different GPU (a fifth job is queued, as expected).

来源：https://stackoverflow.com/questions/49405613/slurm-job-taking-up-entire-node-when-using-just-one-gpu

标签

slurm

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!