torque

PBS/TORQUE: how do I submit a parallel job on multiple nodes?

泄露秘密 提交于 2019-12-11 03:13:54
问题 So, right now I'm submitting jobs on a cluster with qsub , but they seem to always run on a single node. I currently run them by doing #PBS -l walltime=10 #PBS -l nodes=4:gpus=2 #PBS -r n #PBS -N test range_0_total = $(seq 0 $(expr $total - 1)) for i in $range_0_total do $PATH_TO_JOB_EXEC/job_executable & done wait I would be incredibly grateful if you could tell me if I'm doing something wrong, or if it's just that my test tasks are too small. 回答1: With your approach, you need to have your

Setup torque/moab cluster to use multiple cores per node with a single loop

允我心安 提交于 2019-12-10 10:46:47
问题 This is a followup on [How to set up doSNOW and SOCK cluster with Torque/MOAB scheduler? I have a memory limited script that only uses 1 foreach loop but I'd like to get 2 iterations running on node1 and 2 iterations running on node2. The above linked question allows you to start a SOCK cluster to each node for the outer loop and then MC cluster for the inner loop and I think doesn't make use of the multiple cores on each node. I get the warning message Warning message: closing unused

How to wait for a torque job array to complete

六眼飞鱼酱① 提交于 2019-12-08 03:12:23
问题 I have a script that splits a data structure into chunks. The chunks are processed using a torque job array and then merged back into a single structure. The merge operation is dependent on the job array completing. How do I make the merge operation wait for the torque job array to complete? $ qsub --version Version: 4.1.6 My script is as follows: # Splits the data structure and processes the chunks qsub -t 1-100 -l nodes=1:ppn=40,walltime=48:00:00,vmem=120G ./job.sh # Merges the processed

can torque pbs output error messages to file in real time

假装没事ソ 提交于 2019-12-08 02:17:14
问题 The errors and results are written into *.err(PBS -e) and *.out(PBS -o) files, after the torque pbs jobs are finished. Can torque pbs output ERROR messages to *.err in real time when jobs are running ? Can torque pbs output OUTPUT messages to *.out in real time when jobs are running ? How to config pbs_server or something else? Thanks. 回答1: The way to do this is to set $spool_as_final_name true in the config file for the mom's. This is located in /mom_priv/config. This is documented here. 来源:

Exclude certain nodes when submitting jobs with qsub / torque?

∥☆過路亽.° 提交于 2019-12-07 01:02:35
问题 When submitting batch jobs with qsub, is there a way to exclude a certain node (by hostname)? Something like # this is just a pseudo command: qsub myscript.sh --exclude computer01 回答1: Depending on how many nodes you would like available, there are a couple of options. You could specify by name a specific node that is acceptable: qsub -l nodes=n006+n007 To exclude, say, one node out of a group, I would ask the administrator to assign a dummy property to all nodes but the one you want excluded

can torque pbs output error messages to file in real time

别来无恙 提交于 2019-12-06 12:00:18
The errors and results are written into *.err(PBS -e) and *.out(PBS -o) files, after the torque pbs jobs are finished. Can torque pbs output ERROR messages to *.err in real time when jobs are running ? Can torque pbs output OUTPUT messages to *.out in real time when jobs are running ? How to config pbs_server or something else? Thanks. The way to do this is to set $spool_as_final_name true in the config file for the mom's. This is located in /mom_priv/config. This is documented here. 来源: https://stackoverflow.com/questions/21251810/can-torque-pbs-output-error-messages-to-file-in-real-time

Loading shared library in open-mpi/ mpi-run

心已入冬 提交于 2019-12-04 07:32:54
I'm trying to run my program using torque scheduler using mpi run. Though in my pbs file I load all the library by export LD_LIBRARY_PATH=/path/to/library yet it gives error i.e. error while loading shared libraries: libarmadillo.so.3: cannot open shared object file: No such file or directory. I guess error lies in variable LD_LIBRARY_PATH not set in all the nodes. How would I make it work? LD_LIBRARY_PATH is not exported automatically to MPI processes, spawned by mpirun . You should use mpirun -x LD_LIBRARY_PATH ... to push the value of LD_LIBRARY_PATH . Also make sure that the specified path

How fast can one submit consecutive and independent jobs with qsub?

*爱你&永不变心* 提交于 2019-12-03 08:36:21
This question is related to pbs job no output when busy . i.e Some of the jobs I submit produce no output when PBS/Torque is 'busy'. I imagine that it is busier when many jobs are being submitted one after another, and as it so happens, of the jobs submitted in this fashion, I often get some that do not produce any output. Here're some codes. Suppose I have a python script called "x_analyse.py" that takes as its input a file containing some data, and analyses the data stored in the file: ./x_analyse.py data_1.pkl Now, suppose I need to: (1) Prepare N such data files: data_1.pkl, data_2.pkl, ..

How to submit a job to a specific node in PBS

无人久伴 提交于 2019-12-03 07:03:30
问题 How do I send a job to a specific node in PBS/TORQUE? I think you must specify the node name after nodes. #PBS -l nodes=abc However, this doesn't seem to work and I'm not sure why. This question was asked here on PBS and specify nodes to use Here is my sample code #!/bin/bash #PBS nodes=node9,ppn=1, hostname date echo "This is a script" sleep 20 # run for a while so I can look at the details date Also, how do I check which node the job is running on? I saw somewhere that $PBS_NODEFILE shows

How can I increase OpenFabrics memory limit for Torque jobs?

心不动则不痛 提交于 2019-12-02 07:20:27
问题 When I run MPI job over InfiniBand, I get the following worning. We use Torque Manager. -------------------------------------------------------------------------- WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. This can cause MPI jobs to run with erratic performance, hang, and/or crash. This may be caused by your OpenFabrics vendor limiting the amount of physical memory that can be registered. You should investigate the