slurm

Use slurm job id

我是研究僧i 提交于 2019-12-03 18:39:53
问题 When I launch a computation on the cluster, I usually have a separate program doing the post-processing at the end : sbatch simulation sbatch --dependency=afterok:JOBIDHERE postprocessing I want to avoid mistyping and automatically have the good job id inserted. Any idea? Thanks 回答1: You can do something like this: RES=$(sbatch simulation) && sbatch --dependency=afterok:${RES##* } postprocessing The RES variable will hold the result of the sbatch command, something like Submitted batch job

How to monitor resources during slurm job?

喜欢而已 提交于 2019-12-03 15:46:35
I'm running jobs on our university cluster (regular user, no admin rights), which uses the SLURM scheduling system and I'm interested in plotting the CPU and memory usage over time, i.e while the job is running. I know about sacct and sstat and I was thinking to include these commands in my submission script, e.g. something in the line of #!/bin/bash #SBATCH <options> # Running the actual job in background srun my_program input.in output.out & # While loop that records resources JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')" FIRST=0 #sleep time in seconds STIME=15 while [ "

Running slurm script with multiple nodes, launch job steps with 1 task

倾然丶 夕夏残阳落幕 提交于 2019-12-03 12:38:50
问题 I am trying to launch a large number of job steps using a batch script. The different steps can be completely different programs and do need exactly one CPU each. First I tried doing this using the --multi-prog argument to srun . Unfortunately, when using all CPUs assigned to my job in this manner, performance degrades massively. The run time increases to almost its serialized value. By undersubscribing I could ameliorate this a little. I couldn't find anything online regarding this problem,

Expand columns to see full jobname in Slurm

二次信任 提交于 2019-12-03 11:33:31
Is it possible to expand the number of characters used in the JobName column of the command sacct in SLURM? For example, I currently have: JobID JobName Elapsed NCPUS NTasks State ------------ ---------- ---------- ---------- -------- ---------- 12345 lengthy_na+ 00:00:01 4 1 FAILED and I would like: JobID JobName Elapsed NCPUS NTasks State ------------ ---------- ---------- ---------- -------- ---------- 12345 lengthy_name 00:00:01 4 1 FAILED You should use the format option, with: sacct --helpformat you'll see the parameters to show, for instance: sacct --format="JobID,JobName%30" will print

SLURM sbatch job array for the same script but with different input arguments run in parallel

怎甘沉沦 提交于 2019-12-03 08:31:31
I have a problem where I need to launch the same script but with different input arguments. Say I have a script myscript.py -p <par_Val> -i <num_trial> , where I need to consider N different par_values (between x0 and x1 ) and M trials for each value of par_values . Each trial of M is such that almost reaches the time limits of the cluster where I am working on (and I don't have priviledges to change this). So in practice I need to run NxM independent jobs. Because each batch jobs has the same node/cpu configuration, and invokes the same python script, except for changing the input parameters,

How to find from where a job is submitted in SLURM?

梦想与她 提交于 2019-12-03 07:28:15
问题 I submitted several jobs via SLURM to our school's HPC cluster. Because the shell scripts all have the same name, so the job names appear exactly the same. It looks like [myUserName@rclogin06 ~]$ sacct -u myUserName JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 12577766 run.sh general ourQueue_+ 4 RUNNING 0:0 12659777 run.sh general ourQueue_+ 8 RUNNING 0:0 12675983 run.sh general ourQueue_+ 16 RUNNING 0:0

Installing/emulating SLURM on an Ubuntu 16.04 desktop: slurmd fails to start

风流意气都作罢 提交于 2019-12-03 07:07:17
Edit What I am really looking for is a way to emulate SLURM, something interactive and reasonably user-friendly that I can install. Original post I want to test drive some minimal examples with SLURM, and I am trying to install it all on a local machine with Ubuntu 16.04. I am following the most recent slurm install guide I could find , and I got as far as "start slurmd with sudo /etc/init.d/slurmd start ". [....] Starting slurmd (via systemctl): slurmd.serviceJob for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl

Running slurm script with multiple nodes, launch job steps with 1 task

蹲街弑〆低调 提交于 2019-12-03 03:45:56
I am trying to launch a large number of job steps using a batch script. The different steps can be completely different programs and do need exactly one CPU each. First I tried doing this using the --multi-prog argument to srun . Unfortunately, when using all CPUs assigned to my job in this manner, performance degrades massively. The run time increases to almost its serialized value. By undersubscribing I could ameliorate this a little. I couldn't find anything online regarding this problem, so I assumed it to be a configuration problem of the cluster I am using. So I tried going a different

Adding time to a running slurm job

 ̄綄美尐妖づ 提交于 2019-12-03 02:03:07
I have a job running a linux machine managed by slurm. Now that the job is running for a few hours I realize that I underestimated the time required for it to finish and thus the value of the --time argument I specified is not enough. Is there a way to add time to an existing running job through slurm? Carles Fenoy Use the scontrol command to modify a job scontrol update jobid=<job_id> TimeLimit=<new_timelimit> Requires admin privileges on some machines. 来源: https://stackoverflow.com/questions/28413418/adding-time-to-a-running-slurm-job

How to find from where a job is submitted in SLURM?

跟風遠走 提交于 2019-12-02 22:20:24
I submitted several jobs via SLURM to our school's HPC cluster. Because the shell scripts all have the same name, so the job names appear exactly the same. It looks like [myUserName@rclogin06 ~]$ sacct -u myUserName JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 12577766 run.sh general ourQueue_+ 4 RUNNING 0:0 12659777 run.sh general ourQueue_+ 8 RUNNING 0:0 12675983 run.sh general ourQueue_+ 16 RUNNING 0:0 How can I know from which directory a job is submitted so that I can differentiate the jobs? You can