In general, R scripts can be run just like any other kind of program on an HPC (high-performance computing) system. However, there are a few peculiarities. This document compiles some helpful practices; it should be useful to people who are familiar with R but unfamiliar to HPC, and the reverse.
Some of these instructions will be specific to Compute Canada ca. 2022, and particularly to the Graham cluster.
There is also useful information at the Digital Research Alliance of Canada wiki (en français aussi.
I assume that you’re slightly familiar with HPC machinery (i.e. you’ve taken the Compute Canada orientation session, know how to use sbatch
/squeue
/etc. to work with the batch scheduler)
Below, “batch mode” means running R code from an R script rather than starting R and typing commands at the prompt (i.e. “interactive mode”); “on a worker” means running a batch-mode script via the SLURM scheduler (i.e. using sbatch
) rather than in a terminal session on the head node. Commands to be run within R will use an R>
prompt, those to be run in the shell will use sh>
.
Given an R script stored in a .R
file, there are a few ways to run it in batch mode:
r
: an improved batch-R version by Dirk Eddelbuettel. You can install it by installing the littler
package from CRAN (see “Installing Packages” below) and running
mkdir ~/bin
cd ~/bin
ln -s ~/R/x86_64-pc-linux-gnu-library/4.1/littler/bin/r
in the shell. (You may need to adjust the path name for your R version.)
Rscript <filename>
: by default, output will be printed to the standard output (which will end up in your .log
file)
Rscript
is that it does not load the methods
package by default, which may occasionally surprise you - if your script directly or indirectly uses stuff from methods
you need to load it explicitly with library("methods")
R CMD BATCH <filename>
: this is similar but automatically sends output to <filename>.out
This StackOverflow question says that r
> Rscript
> R CMD BATCH
(according to the author of r
…)
R is often missing from the set of programs that is available by default on HPC systems. Most HPC systems use the module
command to make different programs, and different versions of those programs, available for your use.
R
in interactive mode on Graham, the system will pop up a long list of possible modules to load and ask you which one you want. The first choice (currently r/4.1.2
) is the default, and generally the best option.module load r/4.1.2
at the shell prompt.module load r/4.1.2
to your batch script.options(repos = c(CRAN = "https://cloud.r-project.org"))
(this is a safe default value)~/R/x86_64-pc-linux-gnu-library/<R-version>
). (If you are in batch mode you’ll get an error.)install.packages("<pkg>"
in an interactive R session, or by running an R script that defines and installs a long list of packages, e.g. r pkgs <- c("broom", "mvtnorm", <...>) install.packages(pkgs)
It’s generally OK to run short (<10 minutes) interactive jobs like this in interactive mode, on the head node.mypkg_0.1.5.tar.gz
, use install.packages("mypkg_0.1.5.tar.gz", repos=NULL)
from within R or R CMD INSTALL mypkg_0.1.5.tar.gz
from the shell to install it.MAX_ARRAY_SIZE
. This means even if you use steps in your array indices ex. --array=0-20000:10000
where the number of jobs is only 3, the job array will not run because the maximum index (20000) is larger than MAX_ARRAY_SIZE
-1. Run something like $scontrol show config | grep -E 'MaxArraySize|MaxJobCount'
to determine SLURM configuration. (slurm job array support)commandArgs()
to read command-line arguments from within an R script. For example, if batch.R
contains:cc <- commandArgs(trailingOnly = TRUE)
intarg <- as.integer(cc[1])
chararg <- cc[2]
cat(sprintf("int arg = %d, char arg = %s\n", intarg, chararg))
then running Rscript batch.R 1234 hello
will produce
int arg = 1234, char arg = hello
(note that all command-line arguments are passed as character, must be converted to numeric as necessary). If you want fancier argument processing than base R provides (e.g. default argument values, named rather than positional arguments), see this Stack Overflow question for some options.
While Compute Canada is generally meant to be run in batch mode, it is sometimes convenient to do some development/debugging in short (<3 hour) interactive sessions.
rstudio-server-...
module--cpus-per-task
in single threaded computingfor
loop, they prefer job arrays), (2) much of the document focuses on general performance tips for high-performance computing in R (using vectorization, packages for out-of-memory computation, etc..) that are not specific to running on HPC clustersOpenMP
system) within C++ code (e.g. glmmTMB
).
OpenMP
is usually controlled by setting the shell environment export OMP_NUM_THREADS=1
, but there may be controls within an R package as well.RhpcBLASctl::blas_set_num_threads()
(see here)RMPI
package (see below) or the parallel
package: foreach
, doParallel
, future
, furrr
, …
N
and then define a virtual cluster with N
cores within R (e.g. parallel::makeCluster(N)
) and set ntasks==N
in your submission script (and let the scheduler pick the number of CPUs etc.)Determining the number of nodes/cores/processes to request using SLURM will depend on which R package is used for parallelization. The foreach
package supports both multi-core (multiple cores on a single node/computer) and multiprocessing (multiple processes within a single node or across multiple nodes in a cluster) parellelization. This is an example on how to run both using foreach
, including how to ensure R and SLURM are communicating via the shell script, https://docs.alliancecan.ca/wiki/R#Exploiting_parallelism_in_R
You should not let the R package you are using detect and try to use the number of available cores when using HPC, you should instead always specify the number to use.
When setting SLURM #SBATCH
arguments, here are some helpful notes: - A task
in SLURM is a process - a process uses one CPU core if it is single threaded. - How tasks are allocated across cores and nodes can be specified using the arguments --nodes
, --ntasks
, and --ntasks-per-node
(--cpus-per-task
is specific to multi-threading). Some helpful task allocation examples: https://support.ceci-hpc.be/doc/_contents/SubmittingJobs/SlurmFAQ.html#q05-how-do-i-create-a-parallel-environment - The task allocation you choose will affect job scheduling. Requesting multiple tasks without specifying the number of nodes (if you don’t require all tasks to be on the same node) puts fewer constraints on the system. Requesting a full node --nodes=1 --ntasks-per-node=32
on the Graham cluster has a scheduling advantage but can be seen as abuse if this is not required. https://docs.alliancecan.ca/wiki/Job_scheduling_policies#Whole_nodes_versus_cores