slurm | 易学教程

Emulating SLURM on Ubuntu 16.04

阅读更多关于 Emulating SLURM on Ubuntu 16.04

I want to emulate SLURM on Ubuntu 16.04. I don't need serious resource management, I just want to test some simple examples. I cannot install SLURM in the usual way , and I am wondering if there are other options. Other things I have tried: A Docker image . Unfortunately, docker pull agaveapi/slurm; docker run agaveapi/slurm gives me errors: /usr/lib/python2.6/site-packages/supervisor/options.py:295: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c"

SLURM: After allocating all GPUs no more cpu job can be submitted

阅读更多关于 SLURM: After allocating all GPUs no more cpu job can be submitted

问题 We have just started using slurm for managing our GPUs (currently just 2). We use ubuntu 14.04 and slurm-llnl. I have configured gres.conf and srun works. The problem is that if I run two jobs with --gres=gpu:1 then the two GPUs are successfully allocated and the jobs start running; now I expect to be able to run more jobs (in addition to the 2 GPU jobs) without --gres=gpu:1 (i.e. jobs than only use CPU and ram) but it is not possible. The error message says that it could not allocate

SLURM: After allocating all GPUs no more cpu job can be submitted

阅读更多关于 SLURM: After allocating all GPUs no more cpu job can be submitted

We have just started using slurm for managing our GPUs (currently just 2). We use ubuntu 14.04 and slurm-llnl. I have configured gres.conf and srun works. The problem is that if I run two jobs with --gres=gpu:1 then the two GPUs are successfully allocated and the jobs start running; now I expect to be able to run more jobs (in addition to the 2 GPU jobs) without --gres=gpu:1 (i.e. jobs than only use CPU and ram) but it is not possible. The error message says that it could not allocate required resources (even though there are 24 CPU cores). This is my gres.conf: Name=gpu Type=titanx File=/dev

getExecutorMemoryStatus().size() not outputting correct num of executors

阅读更多关于 getExecutorMemoryStatus().size() not outputting correct num of executors

In short , I need the number of executors/workers in the Spark cluster, but using sc._jsc.sc().getExecutorMemoryStatus().size() gives me 1 when in fact there are 12 executors. With more details , I'm trying to determine the number of executors and use that number as the number of partitions I ask Spark to distribute my RDD across. I do this to leverage the parallelism, as my initial data is just a range of numbers but then every one of them gets processed in a rdd#foreach method. The process is both memory-wise and computationally heavy, so I want the range of numbers originally to reside in

getExecutorMemoryStatus().size() not outputting correct num of executors

阅读更多关于 getExecutorMemoryStatus().size() not outputting correct num of executors

问题 In short , I need the number of executors/workers in the Spark cluster, but using sc._jsc.sc().getExecutorMemoryStatus().size() gives me 1 when in fact there are 12 executors. With more details , I'm trying to determine the number of executors and use that number as the number of partitions I ask Spark to distribute my RDD across. I do this to leverage the parallelism, as my initial data is just a range of numbers but then every one of them gets processed in a rdd#foreach method. The process

How to import a local python module when using the sbatch command in SLURM

阅读更多关于 How to import a local python module when using the sbatch command in SLURM

I was using the cluster manager slurm and I was running a submission script with sbatch (with a python interpeter). The sbatch submission imported one of my modules called main_nn.py . The module is located in the same place as my submission directory, however, python fails to find it even though the file exists. I am having a hard time figuring it out why this is happening. My python file looks as follow: #!/usr/bin/env python #SBATCH --job-name=Python print('hi') import main_nn however the output of my slurm dump file is: hi Traceback (most recent call last): File "/home/slurm/slurmd

SLURM job taking up entire node when using just one GPU

阅读更多关于 SLURM job taking up entire node when using just one GPU

I am submitting multiple jobs to a SLURM queue. Each job uses 1 GPU. We have 4 GPUs per node. However once a job is running, it takes up the entire node, leaving 3 GPUs idle. Is there any way to avoid this, so that I can send multiple jobs to one node, using one GPU each? My script looks like this: #SLURM --gres=gpu:1 #SLURM --ntasks-per-node 1 #SLURM -p ghp-queue myprog.exe I was also unable to run multiple jobs on different GPUs. What helped was adding OverSubscribe=FORCE to the partition configuration in slurm.conf , like this: PartitionName=compute Nodes=ALL ... OverSubscribe=FORCE After

SLURM job taking up entire node when using just one GPU

阅读更多关于 SLURM job taking up entire node when using just one GPU

问题 I am submitting multiple jobs to a SLURM queue. Each job uses 1 GPU. We have 4 GPUs per node. However once a job is running, it takes up the entire node, leaving 3 GPUs idle. Is there any way to avoid this, so that I can send multiple jobs to one node, using one GPU each? My script looks like this: #SLURM --gres=gpu:1 #SLURM --ntasks-per-node 1 #SLURM -p ghp-queue myprog.exe 回答1: I was also unable to run multiple jobs on different GPUs. What helped was adding OverSubscribe=FORCE to the

What does the --ntasks or -n tasks does in SLURM?

阅读更多关于 What does the --ntasks or -n tasks does in SLURM?

问题 I was using SLURM to use some computing cluster and it had the -ntasks or -n . I have obviously read the documentation for it (http://slurm.schedmd.com/sbatch.html): sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the Slurm controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task

Use slurm job id

阅读更多关于 Use slurm job id

When I launch a computation on the cluster, I usually have a separate program doing the post-processing at the end : sbatch simulation sbatch --dependency=afterok:JOBIDHERE postprocessing I want to avoid mistyping and automatically have the good job id inserted. Any idea? Thanks You can do something like this: RES=$(sbatch simulation) && sbatch --dependency=afterok:${RES##* } postprocessing The RES variable will hold the result of the sbatch command, something like Submitted batch job 102045 . The construct ${RES##* } isolates the last word (see more info here ), in the current case the job id