https://github.com/bencardoen/slurmmonitor.jl
A monitor for SLURM HPC schedulers that notifies on preset conditions (downtime, latency) and connects to Slack (optional)
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 6 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.9%) to scientific vocabulary
Keywords
Repository
A monitor for SLURM HPC schedulers that notifies on preset conditions (downtime, latency) and connects to Slack (optional)
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 3
- Releases: 0
Topics
Metadata Files
README.md
SlurmMonitor
SlurmMonitor monitors SLURM (an HPC scheduler) based clusters for status, records the data over time, and if configured can act on predefined conditions.
Linking to Slack
** You need admin rights to do this, and do not create public endpoints without realizing what they (can) do**
- Login to Slack
- Settings and Admin
- "Manage Apps"
- "Build"
- Create a new App
- Activate new webhook
- generates endpoint of form "https://hooks.slack.com/services/XXX/YYY/zzz"
- Save in a file 'endpoint.txt'
- Pass location of file to monitor.jl (see below)
Test if link works
bash
curl -X POST -H 'Content-type: application/json' --data '{"text":"Hello, World!"}' $URL
Installation
You install the monitor on a login node, and this assumes HPC admins are ok with you doing this.
bash
git clone <thisrepo>
cd SlurmMonitor.jl
Then start julia
bash
julia
julia
julia> using Pkg; Pkg.add(".")
or
bash
julia
then
julia
julia>using Pkg; Pkg.activate() # Activate env in current dir, optional
julia>using Pkg; Pkg.add(url=<thisrepo>)
Test integration with slack
bash
julia --project=. # assuming you're in the cloned directory
Then
julia
using SlurmMonitor
endpoint=readendpoint("endpoint.txt")
posttoslack("42 is the answer", endpoint)
That either posts the message, or tells you why it couldn't.
make sure the format of the url is /services/.../.../..
See slack app configuration page on how to fix this if invalid.
Usage
The monitor polls at intervals i, repeating r times, with minimum acceptable latency l and saving to output dir o. Triggers (node going down, latency spikes), trigger optional messages to Slack e. It needs and endpoint file (1 line), with a endpoint (see earlier). You'd use this within a tmux/screen session to keep it in the background.
Example
Every minute, for 1e4 minutes, run the monitor, and call Solar Slack if issues arise.
bash
julia --project=. src/monitor.jl -i 60 -r 10000 -o . -e endpoint_solar.txt -l 40
This will save a csv file, every z seconds, for k iterations, where 1 line represents the state of each node in the cluster, recording total/free CPU/RAM/GPU and node status (IDLE, ALLOC, ...).
On specified conditions (IDLE->DOWN) will send messages to a linked Slackbot, configured with the right endpoint.
If a node is not responsive (by network), a similar trigger is fired. Define the mininum average latency you consider as not-reachable in CLI.
Output
Saved to observed_state.csv.
Do Not move the csv file, it's continuously read/written to
See src/SlurmMonitor.jl, e.g. summarizestate($DATAFRAME, $ENDPOINT).
julia
using Pkg
Pkg.activate(".")
using DataFrames
using CSV
df = CSV.read("where.csv", DataFrame)
endpoint = readendpoint("whereendpointis.txt")
summarizestate(df, endpoint) ## Sends to slack
plotstats(df) ## Plots in svg
Dependencies
- Julia https://julialang.org/downloads/
- Requires a link to a Slackbot
- Requires SLURM + command line tools (sinfo, scontrol) to be installed
Warning
If you run this on a cluster, make sure you're authorized to do so. Calling scontrol and sinfo are RPC calls that cause a non-trivial load on the scheduler, if the cluster has 1000s of nodes, and you set the interval to 1s, that means 2000 RPC calls/1.
Note that it takes several seconds, if not more, for a node to change state anyway.
Do not do this unless you're a cluster admin.
Sane intervals are ~ 60-120 or more seconds.
Extra functionality
- Triggers can be anything, currently node state and latency are used
- Diskusage, nvidia drivers, etc are all implemented, not active (can trigger ssh lockout)
- Contact me if you need those active
Troublehshooting
Times seem wrong
Times are recorded in UTC. If you want this differently, it's not hard, I'd happily accept a properly documented PR.
Cite
If you find this useful, please cite
bibtex
@software{ben_cardoen_2022_7106106,
author = {Ben Cardoen},
title = {{SlurmMonitor.jl: A Slurm monitoring tool that
notifies slack on adverse SLURM HPC state changes
and records temporal statistics on utilization.}},
month = sep,
year = 2022,
note = {https://github.com/bencardoen/SlurmMonitor.jl},
publisher = {Zenodo},
version = {0.1.0},
doi = {10.5281/zenodo.7106106},
url = {https://doi.org/10.5281/zenodo.7106106}
}
Owner
- Name: Ben Cardoen
- Login: bencardoen
- Kind: user
- Location: Vancouver
- Company: https://github.com/sfu-mial
- Twitter: BenCardoen
- Repositories: 29
- Profile: https://github.com/bencardoen
PhD Student Computing Science @sfu-mial Simon Fraser University
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 4
- Total pull requests: 0
- Average time to close issues: 2 months
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- bencardoen (4)