slurm

How to determine at which point in python script step memory exceeded in SLURM

强颜欢笑 提交于 2019-12-12 23:15:43
问题 I have a python script that I am running on a SLURM cluster for multiple input files: #!/bin/bash #SBATCH -p standard #SBATCH -A overall #SBATCH --time=12:00:00 #SBATCH --output=normalize_%A.out #SBATCH --error=normalize_%A.err #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=20 #SBATCH --mem=240000 HDF5_DIR=... OUTPUT_DIR=... NORM_SCRIPT=... norm_func () { local file=$1 echo "$file" python $NORM_SCRIPT -data $file -path $OUTPUT_DIR } # Doing normalization in parallel for file in

How does one make sure that the python submission script in slurm is in the location from where the sbatch command was given?

可紊 提交于 2019-12-12 18:45:09
问题 I have a python submission script that I run with sbatch using slurm : sbatch batch.py when I do this things do not work properly because I assume, the batch.py process does not inherit the right environment variables. Thus instead of running batch.py from where the sbatch command was done, its ran from somewhere else ( / I believe). I have managed to fix this by doing wrapping the python script with a bash script: #!/usr/bin/env bash cd path/to/scripts python script.py this temporary hack

Is it possible to force SLURM to have access to only job's running folder and not alter any other file?

折月煮酒 提交于 2019-12-12 10:00:28
问题 I observe that when I run a SLURM job, it could create files on other folder paths and also could remove them. It seems dangerous that via SLURM job they can access others folders/files and make changes on them. $ sbatch run.sh run.sh: #!/bin/bash #SBATCH -o slurm.out # STDOUT #SBATCH -e slurm.err # STDERR echo hello > /home/avatar/completed.txt rm /home/avatar/completed.txt [Q] Is it possible to force SLURM to only have access to its own running folder and not others? 回答1: Files access is

slurm: jobs are pending even though resources are available

浪子不回头ぞ 提交于 2019-12-12 04:35:31
问题 We have recently started to work with SLURM. We are operating a cluster with a number of nodes with 4 GPUs each, and some nodes with only CPUs. We would like to start jobs using GPUs with higher priority. Therefore, we have two partitions, however, with overlapping node lists. The partition with GPUs, called 'batch' has a higher 'PriorityTier' value. The partition without GPUs is called 'cpubatch'. The main reason for this construbtion is that we want to use the idle CPUs on the nodes with

What is this `job_desc_msg_t` format that I need to submit jobs to SLURM via the Perl API?

孤街醉人 提交于 2019-12-11 19:23:50
问题 The Perl API for SLURM indicates that to submit a job with the API requires that we give it a "job description" ( $job_desc or $job_desc_msg ), which has the structure job_desc_msg_t but it doesn't tell what job_desc_msg_t is. I found it in slurm.h starting at line 1160, so I'm guessing that I will need to pass in a hash with a similar structure. I plan to play with it and post an answer later today or tomorrow, once I've had a chance to try it out. 回答1: That's exactly what you must do

How to pass variables in an sbatch script for multiple job sumissions

故事扮演 提交于 2019-12-11 17:24:08
问题 I need to be submitting multiple SLURM jobs at a time, but all of them need to have some common variables which I would like to be able to pass from the command line. What I have in mind is something like this: A command line input which looks like bash MasterScript.sh -variable1 var1 -variable2 var2 where MasterScript.sh would be sbatch JobSubmitter.sh -variable1in var1 -variable2in var2 -version 1 sbatch JobSubmitter.sh -variable1in var1 -variable2in var2 -version 2 sbatch JobSubmitter.sh

Why there is a sudden spike in memory usage when using multiprocessing.Process and shared memory?

蓝咒 提交于 2019-12-11 15:25:43
问题 I am running a Python ( python3 ) script that spawns (using fork and not spawn ) lots of processes through multiprocessing.Process (e.g 20-30 of them) at the same time. I make sure all of these processes are done ( .join() ) and don't become zombies. However, despite I am running the same code with the same random seed my job crashes due to a huge spike in memory usage at completely random times (memory usage goes up to a random value between 30GB s to 200GB s from the requested 14GB s all of

Setting SGE for running an executable with different input files on different nodes

北慕城南 提交于 2019-12-11 09:33:08
问题 I used to work with a cluster using SLURM scheduler, but now I am more or less forced to switch to a SGE-based cluster, and I'm trying to get a hang of it. The thing I was working on SLURM system involves running an executable using N input files, and set a SLURM configuration file in this fashion, slurmConf.conf SLURM configuration file 0 /path/to/exec /path/to/input1 1 /path/to/exec /path/to/input2 2 /path/to/exec /path/to/input3 3 /path/to/exec /path/to/input4 4 /path/to/exec /path/to

SLURM: When we reboot the node, does jobID assignments start from 0?

喜你入骨 提交于 2019-12-11 04:53:15
问题 For example: sacct --start=1990-01-01 -A user returns job table with latest jobID as 136, but when I submit a new job as sbatch -A user -N1 run.sh submitted bash job returns 100 which is smaller than 136. And seems like sacct -L -A user returns a list which ends with 100. So it seems like submitted batch jobs overwrites to previous jobs' informations, which I don't want. [Q] When we reboot the node, does jobID assignments start from 0? If yes, what should I do it to continue from latest jobID

Matlab: -maxNumCompThreads, hyperthreading, and parpool

[亡魂溺海] 提交于 2019-12-11 02:45:39
问题 I'm running Matlab R2014a on a node in a Linux cluster that has 20 cores and hyperthreading enabled. I know this has been discussed before, but I'm looking for some clarification. Here's what my understanding is of the threads vs. cores issue in Matlab: Matlab has inherent multithreading capabilities, and will utilize extra cores on a multicore machine. Matlab runs its threads in such a way that putting multiple Matlab threads on the same core (i.e. hyperthreading) isn't useful. So by default