slurm

Emulating SLURM on Ubuntu 16.04

99封情书 提交于 2019-12-02 12:35:42
I want to emulate SLURM on Ubuntu 16.04. I don't need serious resource management, I just want to test some simple examples. I cannot install SLURM in the usual way , and I am wondering if there are other options. Other things I have tried: A Docker image . Unfortunately, docker pull agaveapi/slurm; docker run agaveapi/slurm gives me errors: /usr/lib/python2.6/site-packages/supervisor/options.py:295: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c"

SLURM: After allocating all GPUs no more cpu job can be submitted

末鹿安然 提交于 2019-12-02 02:48:06
问题 We have just started using slurm for managing our GPUs (currently just 2). We use ubuntu 14.04 and slurm-llnl. ​I have configured gres.conf and srun works. The problem is that if I run two jobs with --gres=gpu:1 then the two GPUs are successfully allocated and the jobs start running; now I expect to be able to run more jobs (in addition to the 2 GPU jobs) without --gres=gpu:1 (i.e. jobs than only use CPU and ram) but it is not possible. The error message says that it could not allocate

SLURM: After allocating all GPUs no more cpu job can be submitted

≯℡__Kan透↙ 提交于 2019-12-02 00:08:41
We have just started using slurm for managing our GPUs (currently just 2). We use ubuntu 14.04 and slurm-llnl. ​I have configured gres.conf and srun works. The problem is that if I run two jobs with --gres=gpu:1 then the two GPUs are successfully allocated and the jobs start running; now I expect to be able to run more jobs (in addition to the 2 GPU jobs) without --gres=gpu:1 (i.e. jobs than only use CPU and ram) but it is not possible. The error message says that it could not allocate required resources (even though there are 24 CPU cores). This is my gres.conf: Name=gpu Type=titanx File=/dev

getExecutorMemoryStatus().size() not outputting correct num of executors

只谈情不闲聊 提交于 2019-12-01 13:47:19
In short , I need the number of executors/workers in the Spark cluster, but using sc._jsc.sc().getExecutorMemoryStatus().size() gives me 1 when in fact there are 12 executors. With more details , I'm trying to determine the number of executors and use that number as the number of partitions I ask Spark to distribute my RDD across. I do this to leverage the parallelism, as my initial data is just a range of numbers but then every one of them gets processed in a rdd#foreach method. The process is both memory-wise and computationally heavy, so I want the range of numbers originally to reside in

getExecutorMemoryStatus().size() not outputting correct num of executors

允我心安 提交于 2019-12-01 12:09:07
问题 In short , I need the number of executors/workers in the Spark cluster, but using sc._jsc.sc().getExecutorMemoryStatus().size() gives me 1 when in fact there are 12 executors. With more details , I'm trying to determine the number of executors and use that number as the number of partitions I ask Spark to distribute my RDD across. I do this to leverage the parallelism, as my initial data is just a range of numbers but then every one of them gets processed in a rdd#foreach method. The process

How to import a local python module when using the sbatch command in SLURM

孤者浪人 提交于 2019-12-01 08:15:29
I was using the cluster manager slurm and I was running a submission script with sbatch (with a python interpeter). The sbatch submission imported one of my modules called main_nn.py . The module is located in the same place as my submission directory, however, python fails to find it even though the file exists. I am having a hard time figuring it out why this is happening. My python file looks as follow: #!/usr/bin/env python #SBATCH --job-name=Python print('hi') import main_nn however the output of my slurm dump file is: hi Traceback (most recent call last): File "/home/slurm/slurmd

SLURM job taking up entire node when using just one GPU

落花浮王杯 提交于 2019-12-01 00:17:59
I am submitting multiple jobs to a SLURM queue. Each job uses 1 GPU. We have 4 GPUs per node. However once a job is running, it takes up the entire node, leaving 3 GPUs idle. Is there any way to avoid this, so that I can send multiple jobs to one node, using one GPU each? My script looks like this: #SLURM --gres=gpu:1 #SLURM --ntasks-per-node 1 #SLURM -p ghp-queue myprog.exe I was also unable to run multiple jobs on different GPUs. What helped was adding OverSubscribe=FORCE to the partition configuration in slurm.conf , like this: PartitionName=compute Nodes=ALL ... OverSubscribe=FORCE After

SLURM job taking up entire node when using just one GPU

断了今生、忘了曾经 提交于 2019-11-30 19:16:05
问题 I am submitting multiple jobs to a SLURM queue. Each job uses 1 GPU. We have 4 GPUs per node. However once a job is running, it takes up the entire node, leaving 3 GPUs idle. Is there any way to avoid this, so that I can send multiple jobs to one node, using one GPU each? My script looks like this: #SLURM --gres=gpu:1 #SLURM --ntasks-per-node 1 #SLURM -p ghp-queue myprog.exe 回答1: I was also unable to run multiple jobs on different GPUs. What helped was adding OverSubscribe=FORCE to the

What does the --ntasks or -n tasks does in SLURM?

依然范特西╮ 提交于 2019-11-30 06:44:36
问题 I was using SLURM to use some computing cluster and it had the -ntasks or -n . I have obviously read the documentation for it (http://slurm.schedmd.com/sbatch.html): sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the Slurm controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task

Use slurm job id

僤鯓⒐⒋嵵緔 提交于 2019-11-30 00:18:42
When I launch a computation on the cluster, I usually have a separate program doing the post-processing at the end : sbatch simulation sbatch --dependency=afterok:JOBIDHERE postprocessing I want to avoid mistyping and automatically have the good job id inserted. Any idea? Thanks You can do something like this: RES=$(sbatch simulation) && sbatch --dependency=afterok:${RES##* } postprocessing The RES variable will hold the result of the sbatch command, something like Submitted batch job 102045 . The construct ${RES##* } isolates the last word (see more info here ), in the current case the job id