Slurm: Use cores from multiple nodes for R parallelization

问题

I want to parallelize an R script on a HPC with a Slurm scheduler.

SLURM is configured with SelectType: CR_Core_Memory.

Each compute node has 16 cores (32 threads).

I pass the R script to SLURM with the following configuration using the clustermq as the interface to Slurm.

#!/bin/sh
#SBATCH --job-name={{ job_name }}
#SBATCH --partition=normal
#SBATCH --output={{ log_file | /dev/null }} # you can add .%a for array index
#SBATCH --error={{ log_file | /dev/null }}
#SBATCH --mem-per-cpu={{ memory | 2048 }}
#SBATCH --cpus-per-task={{ n_cpus }}
#SBATCH --array=1-{{ n_jobs }}
#SBATCH --ntasks={{ n_tasks }}
#SBATCH --nodes={{ n_nodes }}

#ulimit -v $(( 1024 * {{ memory | 4096 }} ))
R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

Within the R script I do "multicore" parallelization with 30 cores. I would like to use cores from multiple nodes to satisfy the requirement of 30 cpus, i.e. 16 cores from node1, 14 from node2.

I tried using n_tasks = 2 and cpus-per-task=16. With this, the job gets assigned to two nodes. However, only one node is doing compuation (on 16 cores). The second node is assigned to the job but does nothing.

In this question srun is used to split parallelism across nodes with foreach and Slurm IDs. I do not neither use srun nor foreach. Is there a way to achieve what I want with SBATCH and multicore parallelism?

(I know that I could use SelectType=CR_CPU_Memory and have 32 threads available per node. However, the question is how to use cores/threads from multiple nodes in general to be able to scale up parallelism).

回答1:

Summary from my comments:

The answer is you cannot do this because your task is using a bunch of CPUs from within a single R process. You're asking a single R process to parallelize a task across more CPUs than the physical machine has. You cannot split a single R process across multiple nodes. Those nodes do not share memory, so you can't combine CPUs from different nodes, at least not with typical cluster architecture. It's possible if you had a distributed operating system like DCOS.

In your case, the solution is that you need to do is split your job up outside of those R processes. Run 2 (or 3, or 4) separate R processes, each on its own node, and then restrict each R process to the maximum number of CPUs your machines have.

来源：https://stackoverflow.com/questions/54905099/slurm-use-cores-from-multiple-nodes-for-r-parallelization

标签

parallel-processing

hpc

slurm