Python: cluster jobs management

别来无恙 提交于 2021-01-28 02:25:12

问题


I am running python scripts on a computing cluster (slurm) with two stages and they are sequential. I wrote two python scripts, one for Stage 1 and another for Stage 2. Every morning I check if all Stage 1 jobs are completed visually. Only then, I start Stage 2.

Is there a more elegant/automated way by combining all stages and job management in a single python script? How can I tell if the job has completed?

The workflow is similar to the following:

while not job_list.all_complete():
    for job in job_list:
        if job.empty():
            job.submit_stage1()

        if job.complete_stage1():
            job.submit_stage2()

    sleep(60)

回答1:


You have several courses of action:

  • use the Slurm Python API to manage the jobs
  • use job dependencies (search for --dependency in the sbatch man page)
  • have the submission script for stage 1 submit the job for stage 2 when it finished
  • use a workflow management system such as
    • Fireworks https://materialsproject.github.io/fireworks/
    • Bosco https://osg-bosco.github.io/docs/
    • Slurm pipelines https://github.com/acorg/slurm-pipeline
    • Luigi https://github.com/spotify/luigi



回答2:


You haven't given a lot to go off of for how to determine if a job is finished, but a common way to solve this problem is to have the jobs create a sentinel file that you can look for, something like COMPLETE.

To do this you just add something like

# At the end of stage 1,
job_num = 1234
open('/shared/file/system/or/server/JOB_{job_num}/COMPLETE', 'x').close()

And then you just poll every once in a while to see if you have a COMPLETE file for all of the jobs before starting stage 2.



来源:https://stackoverflow.com/questions/55404236/python-cluster-jobs-management

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!