slurm: DependencyNeverSatisfied error even after crashed job re-queued

谁说胖子不能爱 提交于 2019-12-04 07:38:41

After I have re-queued a cancelled job, its dependent job remain as DependencyNeverSatisfied and even dependent job completed nothing happens. Is there any way to update dependent job's state, if cancelled job is re-queued again?

Yes, it's quite simple. Reset the dependency with scontrol.

scontrol update jobid=[dependent job id] dependency=after:[requeued job id]

I've done this as an example with Slurm version 17.11:

$ sbatch --begin=now+60 --wrap="exit 1"                   
Submitted batch job 540912

$ sbatch --dependency=afterok:540912 --wrap=hostname 
Submitted batch job 540913

$ squeue 
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        540912     debug     wrap marshall PD       0:00      1 (BeginTime)
        540913     debug     wrap marshall PD       0:00      1 (Dependency)
$ scancel 540912
$ scontrol requeue 540912
$ squeue 
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        540912     debug     wrap marshall PD       0:00      1 (BeginTime)
        540913     debug     wrap marshall PD       0:00      1 (DependencyNeverSatisfied)

At this point, I've replicated your situation. Job 540912 has been requeued, and job 540913 has the reason "DependencyNeverSatisfied".

Now, you can fix it by issuing scontrol update job:

$ scontrol update jobid=540913 dependency=after:540912
$ squeue 
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        540912     debug     wrap marshall PD       0:00      1 (BeginTime)
        540913     debug     wrap marshall PD       0:00      1 (Dependency)

The state is fixed! Once the job runs, the dependent job also runs:

$ scontrol update jobid=540912 starttime=now
$ squeue 
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        540912     debug     wrap marshall CG       0:00      1 v1
        540913     debug     wrap marshall PD       0:00      1 (Dependency)
$ squeue 
         JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

squeue's output is empty because the job already completed.

You can see the jobs after they've completed with sacct:

$ sacct -j 540912,540913
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
540912             wrap      debug       test          2     FAILED      1:0 
540912.batch      batch                  test          2     FAILED      1:0 
540912.exte+     extern                  test          2  COMPLETED      0:0 
540913             wrap      debug       test          2  COMPLETED      0:0 
540913.batch      batch                  test          2  COMPLETED      0:0 
540913.exte+     extern                  test          2  COMPLETED      0:0 
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!