slurm

How to get the ID of GPU allocated to a SLURM job on a multiple GPUs node?

妖精的绣舞 提交于 2019-12-10 16:56:16
问题 When I submit a SLURM job with the option --gres=gpu:1 to a node with two GPUs, how can I get the ID of the GPU which is allocated for the job? Is there an environment variable for this purpose? The GPUs I'm using are all nvidia GPUs. Thanks. 回答1: You can get the GPU id with the environment variable CUDA_VISIBLE_DEVICES . This variable is a comma separated list of the GPU ids assigned to the job. 回答2: Slurm stores this information in an environment variable, SLURM_JOB_GPUS . One way to keep

difference between slurm sbatch -n and -c

纵然是瞬间 提交于 2019-12-10 11:06:51
问题 The cluster that I work with recently switched from SGE to SLURM. I was wondering what the difference between sbatch options --ntasks and --cpus-per-task ? --ntasks seemed appropriate for some MPI jobs that I ran but did not seem appropriate for some OpenMP jobs that I ran. For the OpenMP jobs in my SLURM script, I specified: #SBATCH --ntasks=20 All the nodes in the partition are 20core machines, so only 1 job should run per machine. However, multiple jobs were running simultaneously on each

SLURM sacct shows 'batch' and 'extern' job names

扶醉桌前 提交于 2019-12-10 10:17:26
问题 I have submitted a job to a SLURM queue, the job has run and completed. I then check the completed jobs using the sacct command. But looking at the results of the sacct command I notice additional results that I did not expect: JobID JobName State NCPUS Timelimit 5297048 test COMPLETED 1 00:10:00 5297048.bat+ batch COMPLETED 1 5297048.ext+ extern COMPLETED 1 Can anyone explain what the 'batch' and 'extern' jobs are and what their purpose is. Why does the extern job always complete even when

Expand columns to see full jobname in Slurm

时间秒杀一切 提交于 2019-12-09 08:59:15
问题 Is it possible to expand the number of characters used in the JobName column of the command sacct in SLURM? For example, I currently have: JobID JobName Elapsed NCPUS NTasks State ------------ ---------- ---------- ---------- -------- ---------- 12345 lengthy_na+ 00:00:01 4 1 FAILED and I would like: JobID JobName Elapsed NCPUS NTasks State ------------ ---------- ---------- ---------- -------- ---------- 12345 lengthy_name 00:00:01 4 1 FAILED 回答1: You should use the format option, with:

Use Bash variable within SLURM sbatch script

 ̄綄美尐妖づ 提交于 2019-12-08 15:00:54
问题 I'm trying to obtain a value from another file and use this within a SLURM submission script. However, I get an error that the value is non-numerical, in other words, it is not being dereferenced. Here is the script: #!/bin/bash # This reads out the number of procs based on the decomposeParDict numProcs=`awk '/numberOfSubdomains/ {print $2}' ./meshModel/decomposeParDict` echo "NumProcs = $numProcs" #SBATCH --job-name=SnappyHexMesh #SBATCH --output=./logs/SnappyHexMesh.log # #SBATCH --ntasks=`

Slurm job, knowing what node it is on

懵懂的女人 提交于 2019-12-08 08:52:13
问题 Is there a way in bash/slurm for the script to know which node it is running on? so I sbatch a bash script called wrapCode.sh, and I am monitoring script time as well as which node it is running on. I know how to monitor the script time, but is there a way to echo out at the end which node I was on? sstat does this, but I need to know what my job id is, which the script also doesn't seem to know (or at least I haven't been able to find it). 回答1: A simple, yet effective, and often used, way to

Specifying SLURM Resources When Executing Multiple Jobs in Parallel

。_饼干妹妹 提交于 2019-12-08 06:26:01
问题 According to the answers here What does the --ntasks or -n tasks does in SLURM? one can run multiple jobs in parallel via ntasks parameter for sbatch followed by srun . To ask a follow up question - how would one specify the amount of memory needed when running jobs in parallel like so? If say 3 jobs are running in parallel each needing 8G of memory, would one specify 24G of memory in sbatch (i.e. the sum of memory from all jobs) or not give memory parameters in sbatch but instead specify 8G

what is the minimum number of computers for a slurm cluster

非 Y 不嫁゛ 提交于 2019-12-08 00:25:20
问题 I would like to setup a SLURM cluster. How many machines do I need at minimum? Can I start with 2 machines (one being only client, and one being both client and server)? 回答1: You can start using only one machine, but 2 machines will be the most standard configuration, being one machine the controller and the other the "worker" node. With this model you can add as many machines to the cluster being "worker" nodes. This way the server will not execute jobs, and will be not suffering jobs

Questions on alternative ways to run 4 parallel jobs

青春壹個敷衍的年華 提交于 2019-12-06 13:57:38
问题 Below are three different sbatch scripts that produce roughly similar results. (I show only the parts where the scripts differ; the ## prefix indicates the output obtained by submitting the scripts to sbatch .) Script 0 #SBATCH -n 4 srun -l hostname -s ## ==> slurm-7613732.out <== ## 0: node-73 ## 1: node-73 ## 2: node-73 ## 3: node-73 Script 1 #SBATCH -n 1 #SBATCH -a 1-4 srun hostname -s ## ==> slurm-7613733_1.out <== ## node-72 ## ## ==> slurm-7613733_2.out <== ## node-73 ## ## ==> slurm

what is the minimum number of computers for a slurm cluster

▼魔方 西西 提交于 2019-12-06 07:37:56
I would like to setup a SLURM cluster . How many machines do I need at minimum? Can I start with 2 machines (one being only client, and one being both client and server)? You can start using only one machine, but 2 machines will be the most standard configuration, being one machine the controller and the other the "worker" node. With this model you can add as many machines to the cluster being "worker" nodes. This way the server will not execute jobs, and will be not suffering jobs interference. As @Carles wrote, you can use only one computer if you want, running both the controller (