In general, R scripts can be run just like any other kind of program on an HPC (high-performance computing) system. However, there are a few peculiarities. This document compiles some helpful practices; it should be useful to people who are familiar with R but unfamiliar to HPC, and the reverse.

Some of these instructions will be specific to Compute Canada ca. 2022, and particularly to the Graham cluster.

There is also useful information at the Digital Research Alliance of Canada wiki (en français aussi.

I assume that you’re slightly familiar with HPC machinery (i.e. you’ve taken the Compute Canada orientation session, know how to use sbatch/squeue/etc. to work with the batch scheduler)

Below, “batch mode” means running R code from an R script rather than starting R and typing commands at the prompt (i.e. “interactive mode”); “on a worker” means running a batch-mode script via the SLURM scheduler (i.e. using sbatch) rather than in a terminal session on the head node. Commands to be run within R will use an R> prompt, those to be run in the shell will use sh>.

running scripts in batch mode

Given an R script stored in a .R file, there are a few ways to run it in batch mode:

loading modules

R is often missing from the set of programs that is available by default on HPC systems. Most HPC systems use the module command to make different programs, and different versions of those programs, available for your use.

installing packages

running R jobs via job array

cc <- commandArgs(trailingOnly  = TRUE)
intarg <- as.integer(cc[1])
chararg <- cc[2]
cat(sprintf("int arg = %d, char arg = %s\n", intarg, chararg))

then running Rscript batch.R 1234 hello will produce

int arg = 1234, char arg = hello

(note that all command-line arguments are passed as character, must be converted to numeric as necessary). If you want fancier argument processing than base R provides (e.g. default argument values, named rather than positional arguments), see this Stack Overflow question for some options.

general performance tips

interactive sessions

While Compute Canada is generally meant to be run in batch mode, it is sometimes convenient to do some development/debugging in short (<3 hour) interactive sessions.

in RStudio (or Jupyter notebook)

  • log into with your Compute Canada username/password
  • click on the ‘softwares’ icon (left margin), load the rstudio-server-... module
  • an RStudio icon will appear – click it!
  • this session does not have internet access, but it does see all of the files in your user space (including packages that you have installed locally)
  • You can run Jupyter notebooks, etc. too (I don’t know if there is a way to run a Jupyter notebook with a Python kernel …)
  • it might make sense to ‘reserve’ your session in advance (so you don’t have to wait a few minutes for it to start up), not yet sure how to do that …

questions for SHARCnet folks

levels of parallelization

parallelization and SLURM

Determining the number of nodes/cores/processes to request using SLURM will depend on which R package is used for parallelization. The foreach package supports both multi-core (multiple cores on a single node/computer) and multiprocessing (multiple processes within a single node or across multiple nodes in a cluster) parellelization. This is an example on how to run both using foreach, including how to ensure R and SLURM are communicating via the shell script,

You should not let the R package you are using detect and try to use the number of available cores when using HPC, you should instead always specify the number to use.

When setting SLURM #SBATCH arguments, here are some helpful notes: - A task in SLURM is a process - a process uses one CPU core if it is single threaded. - How tasks are allocated across cores and nodes can be specified using the arguments --nodes, --ntasks, and --ntasks-per-node (--cpus-per-task is specific to multi-threading). Some helpful task allocation examples: - The task allocation you choose will affect job scheduling. Requesting multiple tasks without specifying the number of nodes (if you don’t require all tasks to be on the same node) puts fewer constraints on the system. Requesting a full node --nodes=1 --ntasks-per-node=32 on the Graham cluster has a scheduling advantage but can be seen as abuse if this is not required.