https://github.com/bencardoen/singularity_slurm_cuda

Example on how to get started with Singularity and CUDA on a SLURM cluster

https://github.com/bencardoen/singularity_slurm_cuda

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.0%) to scientific vocabulary

Keywords

cuda nvidia singularity-container slurm-cluster tensorflow
Last synced: 5 months ago · JSON representation

Repository

Example on how to get started with Singularity and CUDA on a SLURM cluster

Basic Info
  • Host: GitHub
  • Owner: bencardoen
  • License: agpl-3.0
  • Language: Shell
  • Default Branch: main
  • Homepage:
  • Size: 214 KB
Statistics
  • Stars: 6
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
cuda nvidia singularity-container slurm-cluster tensorflow
Created about 4 years ago · Last pushed almost 3 years ago
Metadata Files
Readme License

README.md

A quick example on how to get up and running with singularity on a cluster with CUDA

Note: if you copy paste these examples, at a minimum verify you know what they do. These are listed only as examples, without any warranty, you should know if and how they apply to your use case and cluster

See slides.md for a slidedeck, and pdf version made with HackMD/Reveal.js

Required

  • HPC cluster account
    • You know your account/group info
    • You've configured ssh key access
  • Basic Linux CLI interaction

You do not need Singularity on your own machine, though for more advanced use cases you probably will want to.

If you do not have Linux to work with Singularity on your home machine, try a VM using VirtualBox or similar software, or WSL2.

Walkthrough

Login to the cluster

bash ssh you@cluster.country

Get the image

We'll use a tensorflow image from NVidia. We'll assume for now there's a temporary directory on a fast local disk at $SLURM_TMPDIR. This may not be the case, so please adjust to your setting. If you don't set these variables, singularity will write to $HOME, which you never want.

bash module load singularity if [[ "$SLURM_TMPDIR" ]]; then export STMP=$SLURM_TMPDIR; else export STMP="/scratch/$USER"; fi This ensures that, if you're in a compute node, you use its fast storage, if not, use scratch space. bash mkdir -p $STMP/singularity/{cache,tmp} export SINGULARITY_TMPDIR="$STMP/singularity/tmp" export SINGULARITY_CACHEDIR="$STMP/singularity/cache" cd $SINGULARITY_TMPDIR Now pull (~ download) the image. This is a docker image, so Singularity will convert it on the fly. bash singularity pull tensorflow-19.11-tf1-py3.sif docker://nvcr.io/nvidia/tensorflow:19.11-tf1-py3 The pull image can take ~20 mins or depending on network, disk, ... .

Pull is too slow ...

In that case, run the pull command locally, and copy the resulting image to the cluster.

Store the image where compute nodes can access it

For example: ``` cp tensorflow-19.11-tf1-py3.sif /scratch/$USER

or

cp tensorflow-19.11-tf1-py3.sif /project/$USER ``` Filesystems on clusters specialize usually for 2 orthogonal use cases: fast and temporary, slow and permanent. Your cluster documentation will tell you which is which.

Get an interactive node

bash salloc --time=3:0:0 --ntasks=1 --cpus-per-task=4 --mem-per-cpu=4G --account=<YOURGROUP> --gres=gpu:1 After getting the node ```bash

Make sure environment is clean

module purge

module load singularity module load cuda

if [[ "$SLURMTMPDIR" ]]; then export STMP=$SLURMTMPDIR; else export STMP="/scratch/$USER"; fi mkdir -p $STMP/singularity/{cache,tmp} export SINGULARITYTMPDIR="$STMP/singularity/tmp" export SINGULARITYCACHEDIR="$STMP/singularity/cache" cd $SINGULARITY_TMPDIR

cp /scratch/$USER/tensorflow-19.11-tf1-py3.sif . # Change if needed

singularity shell --nv tensorflow-19.11-tf1-py3.sif Now you can execute code inside the container Singularity> python

import tensorflow as tf tf.test.isgpuavailable() ``` This should print a lot of info on CUDA version, GPU type etc, and evaluate to True.

SBATCH mode

Check singularitysbatch.sh as an example. Make sure you modify the account, email, and image location entries. sbatch singularitysbatch.sh

Notes

Creating your own images

You can create your own images in 2 x 2 ways: - local vs remote - definition file or stateful

Local v remote

For most non-trivial images you will need sudo rights on the machine where you build singularity. If you do not have that on your current machine, fear not, you have these options:

When in doubt, go with the first option, all you need is your definition file, the builder will even do syntax checking, that won't be the case if you build yourself.

Building an image shouldn't take longer than ~ 30 minutes, well within the free tier of cloud providers.

Definition v stateful

A definition file a pristine recipe that is interpretable, someone who wants to know what the image contains or how it is built only needs to read that file. Sometimes you may need to 'edit' the image, that is, you convert the image to writable folders, open a shell, modify, and rebuild. In 99.99% of all cases, however, a definition file is the way to go. Editing an image is an option if you want to figure out how to improve it in a way that isn't working by definition file, iow you figure out interactively what commands are needed, then rebuild the image. If it works, then add your commands to the definition file. The Singularity docs detail precisely how to achieve either case.

Recipe

Create this file, e.g. recipe.def ```toml Bootstrap: docker From: nvcr.io/nvidia/pytorch:21.12-py3

%post echo "Hi" # Add post install instructions you need to customize

%labels Version v0.0.1

%help This is a demo container used to illustrate a def file. build it bash singularity build myimage.sif recipe.def ```

Accessing data

singularity shell --nv -B <somedir>:<mountpoint> tensorflow-19.11-tf1-py3.sif Now will appear inside the container as .

Extra resources

Compute Canada Wiki on Singularity

Singularity documentation

Sylabs cloud builder

But I want PyTorch

singularity pull image.sif docker://nvcr.io/nvidia/pytorch:21.12-py3 More tags at [NVidia NVCR][https://catalog.ngc.nvidia.com/containers]

Owner

  • Name: Ben Cardoen
  • Login: bencardoen
  • Kind: user
  • Location: Vancouver
  • Company: https://github.com/sfu-mial

PhD Student Computing Science @sfu-mial Simon Fraser University

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Committers

Last synced: 6 months ago

All Time
  • Total Commits: 16
  • Total Committers: 1
  • Avg Commits per committer: 16.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
bencardoen 2****n 16

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels