PBS/TORQUE: how do I submit a parallel job on multiple nodes?

泄露秘密 提交于 2019-12-11 03:13:54

问题


So, right now I'm submitting jobs on a cluster with qsub, but they seem to always run on a single node. I currently run them by doing

#PBS -l walltime=10
#PBS -l nodes=4:gpus=2
#PBS -r n
#PBS -N test

range_0_total = $(seq 0 $(expr $total - 1)) 

for i in $range_0_total
do
    $PATH_TO_JOB_EXEC/job_executable &
done
wait

I would be incredibly grateful if you could tell me if I'm doing something wrong, or if it's just that my test tasks are too small.


回答1:


With your approach, you need to have your for loop go through all of the entries in the file pointed to by $PBS_NODEFILE and then inside of you loop you would need "ssh $i $PATH_TO_JOB_EXEC/job_executable &".

The other, easier way to do this would be to replace the for loop and wait with:

pbsdsh $PATH_TO_JOB_EXEC/job_executable

This would run a copy of your program on each core assigned to your job. If you need to modify this behavior take a look at the options available in the pbsdsh man page.



来源:https://stackoverflow.com/questions/30881147/pbs-torque-how-do-i-submit-a-parallel-job-on-multiple-nodes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!