https://github.com/bethgelab/slurm-monitoring-public

Monitor your high performance infrastructure configured over slurm using TIG stack

Keywords

grafana influxdb python slurm telegraf

Last synced: 9 months ago · JSON representation

Repository

Monitor your high performance infrastructure configured over slurm using TIG stack

Basic Info

Host: GitHub
Owner: bethgelab
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 275 KB

Statistics

Stars: 12
Watchers: 9
Forks: 5
Open Issues: 0
Releases: 0

Topics

grafana influxdb python slurm telegraf

Created about 4 years ago · Last pushed about 4 years ago

https://github.com/bethgelab/slurm-monitoring-public/blob/main/

## Monitoring SLURM with TIG stack.
[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
[![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)](https://lbesson.mit-license.org/)

## TIG Stack:
TIG means telegraf, influx database, grafana. Each component works together to take data from the HPC cluster, clean it, check its quality, and push it to the influx database. Grafana is used for near real-time display of the data collected.

Telegraf: As per the official documentation, telegraf is a server-based agent for collecting and sending all metrics and events from databases, systems, and IoT sensors. The telegraf written in go and offers numerous plugins. One of such plugins utilized in this project is the input.exec plugin to write the data out to influx db through a telegraf line protocol. More info of the ways the data can be sent can be looked up here: https://docs.influxdata.com/telegraf/v1.21/data_formats/input/influx/

Influx Database: It's a time-series database designed to collect data in a variety of formats and scale the database to accommodate larger data sets. It works nicely in conjunction with the telegraf.

Grafana: Is an open source dashboarding platform that can handle everything from querying, displaying, and alerting depending on data stored anywhere. The dashboard itself is adaptable and shareable. The use of grafana has the advantage of making data available to everybody within an organization. As a result, data-driven apps and processes can be developed.

## Our Architecture:

## Setting up:
1. Clone our repository into your environment:
```
cd /home/
git clone https://github.com/bethgelab/slurm-monitoring-public.git
```
2. Initialization of the environments:
- At source: Within the project repository `slurm-monitoring-public` contains an .env.template. Use this, to setup the required environmental variables. Create a new file called .env `touch .env` and fill up the values in accordance with `.env.template`.
```
#!/bin/bash
eval $(ssh-agent)
# Example: ~/.ssh/id_for_headnode
export KEY=/your/absolute/location/id
# Example: ~/.ssh/secrets/.myscrt -> points to the secret of the key
export SECRET_LOCATION=/your/absolute/location/.mysecret
cat $SECRET_LOCATION | SSH_ASKPASS=/bin/cat setsid -w ssh-add $KEY
```
If the private key contains a secret even that must be specfied. Note: This config is not tested again key not having a secret. As above, set all the required items.

- At scripts: Within the scripts folder, there is another `.env.template`, use that to setup the environments required for the parsing scripts. Create a new file within the scripts folder called `.env` and fill the values based on `.env.template` file.
```
#!/bin/bash
export USER=ironman
export ACCOUNT=marvel
export PATHS="/your/qb/location/$ACCOUNT|/your/qb/$ACCOUNT"
export QBPATHS="/your/qb/work|/your/qb/home|/your/qb"
export HEAD=
```
As above, specify the username for the head-node, which account the user belongs and has permission, PATHS, QBPATHS where it is located, and the HEAD_IP.

- At influx: Within the influx directory, there is an .env fill. It is required to fill important information within it.
```
# InfluxDB options
# Can be setup/upgrade
DOCKER_INFLUXDB_INIT_MODE=setup
DOCKER_INFLUXDB_INIT_USERNAME=influx
DOCKER_INFLUXDB_INIT_PASSWORD=influx
DOCKER_INFLUXDB_INIT_ORG=organization
DOCKER_INFLUXDB_INIT_BUCKET=bucketname # If you are importing the grafana_dashboard.json file, then make sure to change all flux query to point to this bucketname
# DOCKER_INFLUXDB_INIT_RETENTION
# 32 alphanumeric character token. if not specific will be auto-generated.
DOCKER_INFLUXDB_INIT_ADMIN_TOKEN=admintoken
```
The mode should be setup, username, password should be setup individually, influx requires mandatory organization, bucket is the place where all the measurements will be written to and admin token is either auto-generated if not specified or a 32 character token can be generated.

- At grafana: There is an .env within the grafana folder use that to setup the username and password for the application.
```
# Grafana options
GF_SECURITY_ADMIN_USER=grafana
GF_SECURITY_ADMIN_PASSWORD=grafana
GF_INSTALL_PLUGINS=
```
3. Starting influx, and grafana docker containers:
```
# If make is not installed, install it using sudo apt install make which is also mentioned below during the installation and setup for telegraf.

make up-influx # Will start the docker container for influx using the above envs
make up-grafana # Will start the docker container for grafana using the above envs

```
This will start the influx and grafana applications as docker containers. Please look into the `docker-compose.yml` for the ports being exposed for each of these services.

To verify if they are working, open the below urls:
```
http://:8085/ -> This will open the influx database. Use username and
password as per the environment configurations at influx/.env.
http://:3001 -> This will open the grafana interface. Use the username and password configured in the grafana/.env file to open up the grafana for usage.
```
4. There is one very important step to do which is to connect grafana data source with influx. Open grafana as described above, move to data sources, create a new data source, choose influx. Note that, the setup requires you pass all the configured information in the influx/.env in the application. For issues with not being able to connect, it should be due to the Authorization. If that occurs pass a custom HEADER as AUTHORIZATION and value token . If still error occurs, please look up within the telegraf discussion forum or github issues for flux connectivity.

5. Once the step-3 and step-4 is successfull, let us setup telegraf in the VM itself:
a. Installation of telegraf should done first. Use this documentation and perform the required operations https://docs.influxdata.com/telegraf/v1.21/introduction/installation/
b. Once, (a) is done, telegraf can be functional to perform collection of various metrics.
c. Install make with `sudo apt install make`
d. Once, the repository is cloned, initial environment are set as per step(2), and steps(3,4) are completed go to `telegraf/telegraf.conf` and change the influx configured information as shown below to your details comments details the information to be filled. Fill all the information and save telegraf.conf file:
```
[[outputs.influxdb_v2]]
# The URLs of the InfluxDB cluster nodes.
#
# Multiple URLs can be specified for a single cluster, only ONE of the
# urls will be written to each interval.
# urls exp: http://127.0.0.1:8086
# Place the IP-Address:PORT
urls = ["http://127.0.0.1:8086"]

# Token for authentication. (Created influx or generated token)
token = ""

# Organization is the name of the organization you wish to write to; must exist.
organization = ""

# Destination bucket to write into.
bucket = ""

insecure_skip_verify = true
```
e. Once a-d is all good, you can start the telegraf to collect metrics as `make start-collection`. To understand the command lookup `start-collecting.sh` file.
6. To check the output of this, tail -f nohup.out
7. Login to grafana and import the already stored dashboard.json file in the `grafana/slurm_dashboard.json`.
8. Now, you should be seeing the collected metrics.

## Adding, new parsers
To add new parsers and any logic to collect information of the slurm, you can add it in the `scripts/slurm` folder and point the execution in the `main.sh`. Sometimes it could happen that it would not be possible to collect many metrics simultaneously, there add a new input.exec plugin in the telegraf/telegraf.conf pointing to a new `file.sh` thus allowing the collection of the parsed data. Note, that every parser has to finally send out the data in the line protocol format (https://docs.influxdata.com/telegraf/v1.21/data_formats/input/influx/) only.

```
# Exec plugin example:
[[inputs.exec]]
## Commands array
commands = [
"scripts/main2.sh"
]
timeout = "10s"
interval = "30s"
data_format = "influx"
```

## Maintainers: [Tbingen AI Center](https://tuebingen.ai/)

## Acknowledgements:
1. https://github.com/slaclab/slurm-telegraf.git
2. Nicolas Chan. 2019. A Resource Utilization Analytics Platform Using Grafana and Telegraf for the Savio Supercluster. In Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning) (PEARC '19). Association for Computing Machinery, New York, NY, USA, Article 31, 16. DOI:https://doi.org/10.1145/3332186.3333053

## Our Grafana Dashboard:
[Grafana Dashboard](docs/slurm_dash1.png)

Owner

Name: Bethge Lab
Login: bethgelab
Kind: organization
Location: Tübingen

Website: http://bethgelab.org
Repositories: 23
Profile: https://github.com/bethgelab

Perceiving Neural Networks

GitHub Events

Total

Watch event: 4
Fork event: 1

Last Year

Watch event: 4
Fork event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/bethgelab/slurm-monitoring-public

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

https://github.com/bethgelab/slurm-monitoring-public/blob/main/

Owner

GitHub Events

Total

Last Year