How to monitor resources during slurm job?

喜欢而已 提交于 2019-12-03 15:46:35

Slurm offers a plugin to record a profile of a job (PCU usage, memory usage, even disk/net IO for some technologies) into a HDF5 file. The file contains a time series for each measure tracked, and you can choose the time resolution.

You can activate it with

#SBATCH --profile=<all|none|[energy[,|task[,|filesystem[,|network]]]]>

See the documentation here.

To check that this plugin is installed, run

scontrol show config | grep AcctGatherProfileType

It should output AcctGatherProfileType = acct_gather_profile/hdf5.

As for your script, you could try replacing sstat with an SSH connection to the compute nodes to run ps. Assuming pdsh or clush is installed, you could run something like:

pdsh -j $SLURM_JOB_ID ps -u $USER -o pid,state,cputime,%cpu,rssize,command --columns 100 >> usage.txt

This will give you CPU and memory usage per process.

As a final note, your job never terminates simply because it will terminate when the while loop terminates, and the while loop will terminate when the job terminates... The condition "$JobStatus" == "COMPLETED" will never be observed from within the script. When the job is completed, the script is killed.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!