https://github.com/biocomputingup/elastic-slurm-in-openstack

Configure an elastic Slurm cluster on OpenStack cloud

https://github.com/biocomputingup/elastic-slurm-in-openstack

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.0%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Configure an elastic Slurm cluster on OpenStack cloud

Basic Info
  • Host: GitHub
  • Owner: BioComputingUP
  • Language: Shell
  • Default Branch: main
  • Size: 69.3 KB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme

README.md

Slurm cluster in OpenStack cloud

These Ansible playbooks create and manage a dynamically allocated (elastic) Slurm cluster in an OpenStack cloud. The cluster is based on CentOS 8 (Rocky 8) and OpenHPC 2.x. Slurm configurations are based on the work contained in Jetstream_Cluster.

This repo is based on the project slurm-cluster-in-openstack adapted for use with CloudVeneto OpenStack cloud.

Run the following Ansible playbooks on your local PC, not on a Virtual Machine in the OpenStack cloud. Ensure your local machine is set up with the necessary keys and access credentials for the target OpenStack environment. Once the playbooks are executed, you'll have access to your personal and private elastic Slurm cluster in the cloud.

Prerequisites

Install Ansible

Run the install_ansible.sh command: bash ./install_ansible.sh

Configure CloudVeneto gateway (Gate) for SSH access

For this you should have a CloudVeneto account and access to the Gate machine (cv_user and cv_pass): ```bash

generate a new key pair locally (preferably with passphrase). Skip and adapt if you already have a key pair:

ssh-keygen -t ed25519 -f ~/.ssh/ided25519vm

copy the public key to Gate machine (it will ask for cv_pass):

cat ~/.ssh/ided25519vm.pub | \ ssh cvuser@gate.cloudveneto.it 'cat >ided25519vm.pub && \ mkdir -p .ssh && \ chmod 700 .ssh && \ mv ided25519vm.pub .ssh/ided25519vm.pub && \ cat .ssh/ided25519vm.pub >>.ssh/authorizedkeys'

copy the private key to Gate machine (it will ask for cv_pass):

cat ~/.ssh/ided25519vm | \ ssh cvuser@gate.cloudveneto.it \ 'cat >.ssh/ided25519vm && chmod 600 .ssh/ided25519_vm'

connect to Gate machine (it will ask for SSH key passphrase, if used):

ssh -i ~/.ssh/ided25519vm cvuser@gate.cloudveneto.it `` If you have also the credentials and IP of a VM running in the cloud (vmuser,vmpass,vmip), you can import the key pair to it: ``bash

copy the public key from the Gate machine to VM (it will ask for vm_pass)

cat ~/.ssh/ided25519vm.pub | \ ssh vmuser@vmip 'cat >.ssh/ided25519vm.pub && \ cat .ssh/ided25519vm.pub >>.ssh/authorized_keys'

test connection to VM from Gate machine (it will ask for SSH passphrase, if used)

ssh -i ~/.ssh/ided25519vm vmuser@vmip exit Accessing a VM from your local machine requires proxying the SSH connection through the CloudVeneto Gate. You can achieve this by using the following SSH command: bash

(optionally) add key to ssh-agent (it may ask for SSH key passphrase)

ssh-add ~/.ssh/ided25519vm

connect to VM via proxy

ssh -i ~/.ssh/ided25519vm \ -o StrictHostKeyChecking=accept-new \ -o ProxyCommand="ssh -i ~/.ssh/ided25519vm \ -W %h:%p cvuser@gate.cloudveneto.it" \ vmuser@vm_ip You can simplify the SSH connection to VM by configuring your SSH config file: bash

update ssh config with proxy and headnode

cat <<EOF | tee -a ~/.ssh/config

Host cvgate HostName gate.cloudveneto.it User cvuser IdentityFile ~/.ssh/ided25519_vm

Host vm HostName vmip User vmuser IdentityFile ~/.ssh/ided25519vm UserKnownHostsFile /dev/null StrictHostKeyChecking=accept-new ProxyJump cvgate EOF Test the connection: bash

connect to VM

ssh vm

copy files to and from VM with scp

scp localdir/file vm:remotedir/ scp vm:remotedir/file localdir/

or rsync

rsync -ahv localdir/ vm:remotedir/ rsync -ahv vm:remotedir/ localdir/ ```

Deploy Slurm Cluster

Download latest Rocky Linux 8 image

```bash wget https://dl.rockylinux.org/pub/rocky/8/images/x8664/Rocky-8-GenericCloud-Base.latest.x8664.qcow2

no need to upload it to OpenStack, Ansible will do it

openstack image create --disk-format qcow2 --container-format bare --file Rocky-8-GenericCloud-Base.latest.x86_64.qcow2 rocky-8

```

Configure cluster

Copy vars/main.yml.example to vars/main.yml and adjust to your needs.

Copy clouds.yaml.example to clouds.yaml and adjust with OpenStack credentials.

Deployment

Deployment is done in four steps: 1. Create the head node 2. Provision the head node 3. Create and provision the compute node 4. Create the compute node image

Create the head node

bash ansible-playbook create_headnode.yml

Provision the head node

bash ansible-playbook provision_headnode.yml

Create and provision the compute node

bash ansible-playbook create_compute_node.yml

Create compute node image

bash ansible-playbook create_compute_image.yml

All-in-one deployment

bash time ( \ ansible-playbook create_headnode.yml && \ ansible-playbook provision_headnode.yml && \ ansible-playbook create_compute_node.yml && \ ansible-playbook create_compute_image.yml && \ echo "Deployment completed" || echo "Deployment failed" ) or fancy with notifications: bash /bin/time -f "\n### overall time: \n### wall clock: %E" /bin/bash -c '\ /bin/time -f "\n### timing \"%C ...\"\n### wall clock: %E" ansible-playbook create_headnode.yml && \ /bin/time -f "\n### timing \"%C ...\"\n### wall clock: %E" ansible-playbook provision_headnode.yml && \ /bin/time -f "\n### timing \"%C ...\"\n### wall clock: %E" ansible-playbook create_compute_node.yml && \ /bin/time -f "\n### timing \"%C ...\"\n### wall clock: %E" ansible-playbook create_compute_image.yml && \ echo "Deployment completed" | tee /dev/tty | notify-send -t 0 "$(</dev/stdin)" || \ echo "Deployment failed" | tee /dev/tty | notify-send -t 0 "$(</dev/stdin)"'

Cleanup

Delete all cloud resources with: bash ansible-playbook destroy_cluster.yml

Usage examples

Connect to the head node

Connect to the head node of your Slurm cluster via SSH using the CloudVeneto proxy machine. Replace the SSH key with your proxy or OpenStack cloud key, cv_user with your proxy machine username and headnode_ip with the private IP of the head node.

bash ssh -i ~/.ssh/id_ed25519_vm \ -o StrictHostKeyChecking=accept-new \ -o ProxyCommand="ssh -i ~/.ssh/id_ed25519_vm \ -W %h:%p cv_user@gate.cloudveneto.it" \ rocky@headnode_ip

Check Slurm status

Show Slurm nodes and partitions info: bash sinfo Show jobs and scheduling info: bash squeue -al Monitor continuously Slurm queues and job status: bash watch -d "\ sinfo -N -S '-P' -o '%8N %9P %.5T %.13C %.8O %.8e %.6m %.8d %.6w %.8f %20E'|cut -c-\$COLUMNS; echo; echo; \ squeue --format='%12i %10j %6u %8N %4P %4C %7m %8M %10T %16R %o'|cut -c-\$COLUMNS; echo; echo; \ sacct -X -a --format JobID,User,JobName,Partition,AllocCPUS,State,ExitCode,End,ElapsedRaw|tail|tac|grep -v 'JobID\|^---'|awk 'BEGIN{print \" JobID User JobName Partition AllocCPUS State ExitCode End ElapsedRaw\n------------ --------- ---------- ---------- ---------- ---------- -------- ------------------- ----------\"}{print}'|cut -c-\$COLUMNS"

Submit test jobs

Run a quick test job: bash sbatch --wrap 'sleep 10' Submit a (stupid) CPU intensive task with two threads in parallel: ```bash

create work folder

mkdir slurm-test && cd slurm-test

create simple.sh worker script

cat <<'EOF' | tee simple.sh

!/bin/bash

SBATCH -J simplejob

SBATCH -o "%x"."%A"."%a".out

SBATCH -e "%x"."%A"."%a".err

SBATCH --mail-type=ALL

echo -e "$(date)\tStarting job $SLURMJOBID:$SLURMARRAYTASKID on $SLURMDNODENAME ..." if [ -n "$1" ]; then rnd=$1 else rnd=$(shuf -i 5-30 -n 1) fi echo "working for $rnd s ..."; yes > /dev/null & ypid=$! yes > /dev/null & ypid2=$! sleep $rnd echo "killing job $ypid ..." { kill $ypid && wait $ypid; } 2>/dev/null echo "killing job $ypid2 ..." { kill $ypid2 && wait $ypid2; } 2>/dev/null echo “all done, exiting with 0” ex=$? echo -e "$(date)\tJob $SLURMJOBID:$SLURMARRAYTASK_ID ended with $ex" exit $ex EOF

submit as an array job allocating 2 CPUs per job (max runtime of 1min; max 1G memory per job)

rm -f *.{err,out}; sbatch -n2 -a 1-5 --time 1 --mem=1G simple.sh

these longer jobs should timeout

rm -f *.{err,out}; sbatch -n2 -a 1-5 --time 1 --mem=1G simple.sh 120 Allocate an interactive session on a compute node (type `exit` to return back to head node): bash salloc --time 1-0 ```

Owner

  • Name: BioComputing Group, University of Padova
  • Login: BioComputingUP
  • Kind: organization
  • Email: biocomp@bio.unipd.it
  • Location: Italy

GitHub Events

Total
  • Issues event: 4
  • Issue comment event: 2
  • Push event: 3
Last Year
  • Issues event: 4
  • Issue comment event: 2
  • Push event: 3

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 3
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 2
  • Total pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • sgaravat (2)
  • ivanmicetic (1)
Pull Request Authors
Top Labels
Issue Labels
documentation (1)
Pull Request Labels