https://github.com/bencardoen/singularity_slurm_cuda
Example on how to get started with Singularity and CUDA on a SLURM cluster
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary
Keywords
Repository
Example on how to get started with Singularity and CUDA on a SLURM cluster
Basic Info
Statistics
- Stars: 6
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
A quick example on how to get up and running with singularity on a cluster with CUDA
Note: if you copy paste these examples, at a minimum verify you know what they do. These are listed only as examples, without any warranty, you should know if and how they apply to your use case and cluster
See slides.md for a slidedeck, and pdf version made with HackMD/Reveal.js
Required
- HPC cluster account
- You know your account/group info
- You've configured ssh key access
- Basic Linux CLI interaction
You do not need Singularity on your own machine, though for more advanced use cases you probably will want to.
If you do not have Linux to work with Singularity on your home machine, try a VM using VirtualBox or similar software, or WSL2.
Walkthrough
Login to the cluster
bash
ssh you@cluster.country
Get the image
We'll use a tensorflow image from NVidia. We'll assume for now there's a temporary directory on a fast local disk at $SLURM_TMPDIR. This may not be the case, so please adjust to your setting. If you don't set these variables, singularity will write to $HOME, which you never want.
bash
module load singularity
if [[ "$SLURM_TMPDIR" ]]; then export STMP=$SLURM_TMPDIR; else export STMP="/scratch/$USER"; fi
This ensures that, if you're in a compute node, you use its fast storage, if not, use scratch space.
bash
mkdir -p $STMP/singularity/{cache,tmp}
export SINGULARITY_TMPDIR="$STMP/singularity/tmp"
export SINGULARITY_CACHEDIR="$STMP/singularity/cache"
cd $SINGULARITY_TMPDIR
Now pull (~ download) the image. This is a docker image, so Singularity will convert it on the fly.
bash
singularity pull tensorflow-19.11-tf1-py3.sif docker://nvcr.io/nvidia/tensorflow:19.11-tf1-py3
The pull image can take ~20 mins or depending on network, disk, ... .
Pull is too slow ...
In that case, run the pull command locally, and copy the resulting image to the cluster.
Store the image where compute nodes can access it
For example: ``` cp tensorflow-19.11-tf1-py3.sif /scratch/$USER
or
cp tensorflow-19.11-tf1-py3.sif /project/$USER ``` Filesystems on clusters specialize usually for 2 orthogonal use cases: fast and temporary, slow and permanent. Your cluster documentation will tell you which is which.
Get an interactive node
bash
salloc --time=3:0:0 --ntasks=1 --cpus-per-task=4 --mem-per-cpu=4G --account=<YOURGROUP> --gres=gpu:1
After getting the node
```bash
Make sure environment is clean
module purge
module load singularity module load cuda
if [[ "$SLURMTMPDIR" ]]; then export STMP=$SLURMTMPDIR; else export STMP="/scratch/$USER"; fi mkdir -p $STMP/singularity/{cache,tmp} export SINGULARITYTMPDIR="$STMP/singularity/tmp" export SINGULARITYCACHEDIR="$STMP/singularity/cache" cd $SINGULARITY_TMPDIR
cp /scratch/$USER/tensorflow-19.11-tf1-py3.sif . # Change if needed
singularity shell --nv tensorflow-19.11-tf1-py3.sif
Now you can execute code inside the container
Singularity> python
import tensorflow as tf tf.test.isgpuavailable() ``` This should print a lot of info on CUDA version, GPU type etc, and evaluate to True.
SBATCH mode
Check singularitysbatch.sh as an example. Make sure you modify the account, email, and image location entries.
sbatch singularitysbatch.sh
Notes
Creating your own images
You can create your own images in 2 x 2 ways: - local vs remote - definition file or stateful
Local v remote
For most non-trivial images you will need sudo rights on the machine where you build singularity. If you do not have that on your current machine, fear not, you have these options:
- Sylabs.io Remote Builder
- Azure
- AWS
- Run a VM in Virtualbox
- On windows, use WSL2, VM, ...
- Integrate with a pipeline using automated testing e.g CircleCI
When in doubt, go with the first option, all you need is your definition file, the builder will even do syntax checking, that won't be the case if you build yourself.
Building an image shouldn't take longer than ~ 30 minutes, well within the free tier of cloud providers.
Definition v stateful
A definition file a pristine recipe that is interpretable, someone who wants to know what the image contains or how it is built only needs to read that file. Sometimes you may need to 'edit' the image, that is, you convert the image to writable folders, open a shell, modify, and rebuild. In 99.99% of all cases, however, a definition file is the way to go. Editing an image is an option if you want to figure out how to improve it in a way that isn't working by definition file, iow you figure out interactively what commands are needed, then rebuild the image. If it works, then add your commands to the definition file. The Singularity docs detail precisely how to achieve either case.
Recipe
Create this file, e.g. recipe.def
```toml
Bootstrap: docker
From: nvcr.io/nvidia/pytorch:21.12-py3
%post echo "Hi" # Add post install instructions you need to customize
%labels Version v0.0.1
%help
This is a demo container used to illustrate a def file.
build it
bash
singularity build myimage.sif recipe.def
```
Accessing data
singularity shell --nv -B <somedir>:<mountpoint> tensorflow-19.11-tf1-py3.sif
Now
Extra resources
Compute Canada Wiki on Singularity
But I want PyTorch
singularity pull image.sif docker://nvcr.io/nvidia/pytorch:21.12-py3
More tags at [NVidia NVCR][https://catalog.ngc.nvidia.com/containers]
Owner
- Name: Ben Cardoen
- Login: bencardoen
- Kind: user
- Location: Vancouver
- Company: https://github.com/sfu-mial
- Twitter: BenCardoen
- Repositories: 29
- Profile: https://github.com/bencardoen
PhD Student Computing Science @sfu-mial Simon Fraser University
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0