How to determine at which point in python script step memory exceeded in SLURM

问题

I have a python script that I am running on a SLURM cluster for multiple input files:

#!/bin/bash

#SBATCH -p standard
#SBATCH -A overall 
#SBATCH --time=12:00:00
#SBATCH --output=normalize_%A.out
#SBATCH --error=normalize_%A.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem=240000

HDF5_DIR=...
OUTPUT_DIR=...
NORM_SCRIPT=...

norm_func () {
  local file=$1
  echo "$file"
  python $NORM_SCRIPT -data $file -path $OUTPUT_DIR
}

# Doing normalization in parallel
for file in $HDF5_DIR/*; do norm_func "$file" & done
wait

The python script is just loading a dataset (scRNAseq), does its normalization and saves as .csv file. Some major lines of code in it are:

        f = h5py.File(path_to_file, 'r')
        rawcounts = np.array(rawcounts)

        unique_code = np.unique(split_code)
        for code in unique_code:
            mask = np.equal(split_code, code)
            curr_counts = rawcounts[:,mask]

            # Actual TMM normalization
            mtx_norm = gmn.tmm_normalization(curr_counts)

            # Writing the results into .csv file
            csv_path = path_to_save + "/" + file_name + "_" + str(code) + ".csv"
            with open(csv_path,'w', encoding='utf8') as csvfile:
                writer = csv.writer(csvfile, delimiter=',')
                writer.writerow(["", cell_ids])
                for idx, row in enumerate(mtx_norm):
                    writer.writerow([gene_symbols[idx], row])

I keep getting step memory exceeded error for datasets that are above 10Gb and I am not sure why. How I can change my .slurm script or python code to reduce its memory usage? How can I actually identify what causes the memory problem, is there a particular way of debugging the memory in this case? Any suggestions would be greatly appreciated.

回答1:

You can get refined information by using srun to start the python scripts:

srun python $NORM_SCRIPT -data $file -path $OUTPUT_DIR

Slurm will then create one 'step' per instance of your python script, and report information (errors, return codes, memory used, etc.) for each step independently in the accounting, which you can interrogate with the sacct command.

If configured by the administrators, use the --profile option to get a timeline of the memory usage of each step.

In your python script you can use the memory_profile module to get a feedback on the memory usage of your scripts.

来源：https://stackoverflow.com/questions/52229942/how-to-determine-at-which-point-in-python-script-step-memory-exceeded-in-slurm

标签

python

memory

slurm