Ensuring one Job Per Node on StarCluster / SunGridEngine (SGE)

寵の児 提交于 2021-02-09 02:48:00

问题


When qsubing jobs on a StarCluster / SGE cluster, is there an easy way to ensure that each node receives at most one job at a time? I am having issues where multiple jobs end up on the same node leading to out of memory (OOM) issues.

I tried using -l cpu=8 but I think that does not check the number of USED cores just the number of cores on the box itself.

I also tried -l slots=8 but then I get:

Unable to run job: "job" denied: use parallel environments instead of requesting slots explicitly.

回答1:


In your config file (.starcluster/config) add this section:

[plugin sge]
setup_class = starcluster.plugins.sge.SGEPlugin
slots_per_host = 1



回答2:


Largely depends on how the cluster resources are configured i.e. memory limits, etc. However, one thing to try is to request a lot of memory for each job:

-l h_vmem=xxG

This will have side-effect of excluding other jobs from running on a node by virtue that most of the memory on that node is already requested by another previously running job.

Just make sure the memory you request is not above the allowable limit for the node. You can see if it bypassing this limit by checking the output of qstat -j <jobid> for errors.




回答3:


I accomplished this by setting the number of slots on each my nodes to 1 using: qconf -aattr queue slots "[nodeXXX=1]" all.q



来源:https://stackoverflow.com/questions/25672896/ensuring-one-job-per-node-on-starcluster-sungridengine-sge

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!