one-to-one dependency between two job arrays in SLURM

怎甘沉沦 提交于 2019-12-04 15:02:00

Since version 16.05, Slurm has an option of --dependency=aftercorr:job_id[:jobid...]

A task of this job array can begin execution after the corresponding task ID in the specified job has completed successfully (ran to completion with an exit code of zero).

It does what you need.

It however has the drawback you describe; jobs in the second array will keep waiting indefinitely if the corresponding job in the first array crashes. You have several courses of action, none of which is perfect:

  1. if job crashes can be detected from within the submission script, and crashes are random, you can simply requeue the job with scontrol requeue $SLURM_JOB_ID so that it runs again.

  2. otherwise, you can add, at the end of the jobs in the second array, a piece of Bash code that would check whether any job from the first array is still in the queue, and if not, cancel all remaining jobs in the second array ; something like this (untested) [[ $(squeue --noheader --name events | wc -l) == 0 ]] && scancel $SLURM_JOB_ID

  3. finally, a last option is to use a full-fledge workflow system. See this for a short introduction and pointers.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!