https://github.com/ceems-dev/ceems

A Prometheus exporter and a REST API server to export metrics of compute units of resource managers like SLURM, Openstack, k8s, _etc_

https://github.com/ceems-dev/ceems

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.7%) to scientific vocabulary

Keywords

cloud containers dashboards ebpf emissions energy-monitor grafana green-computing hpc json-api kubernetes metrics-server metrics-visualization monitoring observability openstack performance-monitoring prometheus prometheus-exporter slurm

Keywords from Contributors

sequencing genomics interactive projection archival optim embedded autograding hacking shellcodes
Last synced: 5 months ago · JSON representation

Repository

A Prometheus exporter and a REST API server to export metrics of compute units of resource managers like SLURM, Openstack, k8s, _etc_

Basic Info
Statistics
  • Stars: 42
  • Watchers: 4
  • Forks: 4
  • Open Issues: 5
  • Releases: 32
Topics
cloud containers dashboards ebpf emissions energy-monitor grafana green-computing hpc json-api kubernetes metrics-server metrics-visualization monitoring observability openstack performance-monitoring prometheus prometheus-exporter slurm
Created over 2 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog Contributing License Security

README.md

Compute Energy & Emissions Monitoring Stack (CEEMS)

| | | | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | CI/CD | ci CircleCI Coverage | | Docs | docs Go Doc | | Package | Release | | Meta | GitHub License Go Report Card code style |

Compute Energy & Emissions Monitoring Stack (CEEMS) (pronounced as kiːms) contains a Prometheus exporter to export metrics of compute instance units and a REST API server that serves the metadata and aggregated metrics of each compute unit. Optionally, it includes a TSDB load balancer that supports basic access control on TSDB so that one user cannot access metrics of another user.

"Compute Unit" in the current context has a wider scope. It can be a batch job in HPC, a VM in cloud, a pod in k8s, etc. The main objective of the repository is to quantify the energy consumed and estimate emissions by each "compute unit". The repository itself does not provide any frontend apps to show dashboards and it is meant to use along with Grafana and Prometheus to show statistics to users.

Although CEEMS was born out of a need to monitor energy and carbon footprint of compute workloads, it supports monitoring performance metrics as well. In addition, it leverages eBPF framework to monitor IO and network metrics in a resource manager agnostic way. It also supports eBPF based zero instrumentation continuous profiling of compute units.

🎯 Features

  • Monitors energy, performance, IO and network metrics for different types of resource managers (SLURM, Openstack, k8s)
  • Supports different energy sources like RAPL, HWMON, Cray's PM Counters and BMC via IPMI or Redfish
  • Supports NVIDIA (MIG, time sharing, MPS and vGPU) and AMD GPUs (Partition like CPX, QPX, TPX, DPX)
  • Supports zero instrumentation eBPF based continuous profiling using Grafana Pyroscope as backend
  • Realtime access to metrics via Grafana dashboards or a simple CLI tool
  • Multi-tenancy and access control to Prometheus and Pyroscope datasources in Grafana
  • Stores aggregated metrics in a separate DB that can be retained for long time
  • CEEMS apps are capability aware

⚙️ Install CEEMS

[!WARNING] DO NOT USE pre-release versions as the API has changed quite a lot between the pre-release and stable versions.

Installation instructions of CEEMS components can be found in docs.

📽️ Demo

Access Demo

Openstack and SLURM have been deployed on a small cloud instance and monitored using CEEMS. As neither RAPL nor IPMI readings are available on cloud instances, energy consumption is estimated by assuming a Thermal Design Power (TDP) value and current usage of the instance. Several dashboards have been created in Grafana for visualizing metrics which are listed below. - [Overall usage of cluster](https://ceems-demo.myaddr.tools/d/adrenju36n2tcb/cluster-status?orgId=1&from=now-24h&to=now&var-job=openstack&var-host=$__all&var-provider=rte&var-country_code=FR&refresh=15m) - [Usage of different Projects/Accounts by SLURM and Openstack](https://ceems-demo.myaddr.tools/d/cdreu45pp9erkd/user-and-project-stats?orgId=1&from=now-90d&to=now&refresh=15m) - [Usage of Openstack resources by a given user and project](https://ceems-demo.myaddr.tools/d/be5x3it7gpx4wf/openstack-instance-summary?orgId=1&from=now-90d&to=now&var-user=gazoo&var-account=cornerstone&refresh=15m) - [Usage of SLURM resources by a given user and project](https://ceems-demo.myaddr.tools/d/fdsm8aom8hqf4fewfwe3123dascdsc/slurm-job-summary?orgId=1&from=now-90d&to=now&var-user=wilma&var-account=bedrock&refresh=15m) > [!WARNING] > All the dashboards provided in the demo instance are only meant to be for demonstrative purposes. They should not be used in production without properly protecting datasources. ## Visualizing metrics with Grafana Grafana can be used for visualization of metrics and below are some of the screenshots of dashboards. ### Time series compute unit CPU metrics

Time series compute unit GPU metrics

List of compute units of user with aggregate metrics

Aggregate usage metrics of a user

Aggregate usage metrics of a project

Energy usage breakdown between project members

Usage metrics via CLI tool

CEEMS ships a CLI tool for presenting usage metrics to end users for the deployments where Grafana usage is not possible or prohibitive.

bash cacct --starttime="2025-01-01" --endtime="2025-03-22" ┌─────────┬─────────┬──────────┬────────┬────────┬──────────┬──────────────────────────────────────┬────────┬────────┬──────────┬──────────────────────────────────────┐ │ JOB ID │ ACCOUNT │ ELAPSED │ CPU US │ CPU ME │ HOST ENE │ HOST EMISSIO │ GPU US │ GPU ME │ GPU ENER │ GPU EMISSION │ │ │ │ │ AGE(%) │ M. USA │ RGY(KWH) │ NS(GMS) │ AGE(%) │ M. USA │ GY(KWH) │ S(GMS) │ │ │ │ │ │ GE(%) │ │ │ │ GE(%) │ │ │ │ │ │ │ │ │ │ EMAPS_TOTAL │ OWID_TOTAL │ RTE_TOTAL │ │ │ │ EMAPS_TOTAL │ OWID_TOTAL │ RTE_TOTAL │ ├─────────┼─────────┼──────────┼────────┼────────┼──────────┼─────────────┼────────────┼───────────┼────────┼────────┼──────────┼─────────────┼────────────┼───────────┤ │ 106 │ bedrock │ 00:10:05 │ 99.32 │ 3.39 │ 0.053818 │ 4.725182 │ 5.648855 │ 3.860008 │ │ │ │ │ │ │ │ 108 │ bedrock │ 00:10:04 │ 99.60 │ 2.51 │ 0.055842 │ 5.091815 │ 5.840380 │ 4.197307 │ │ │ │ │ │ │ │ 118 │ bedrock │ 00:10:03 │ 99.65 │ 1.17 │ 0.061474 │ 4.450334 │ 6.512757 │ 3.683035 │ │ │ │ │ │ │ │ 131 │ bedrock │ 00:10:04 │ 99.71 │ 2.15 │ 0.055742 │ 1.835111 │ 5.562944 │ 1.245254 │ │ │ │ │ │ │ │ 134 │ bedrock │ 00:20:12 │ 0.53 │ 0.73 │ 0.004463 │ 0.030868 │ 0.100538 │ 0.021321 │ │ │ │ │ │ │ │ 138 │ bedrock │ 00:10:00 │ 99.61 │ 1.17 │ 0.056302 │ 2.595522 │ 5.570695 │ 1.837668 │ │ │ │ │ │ │ │ 150 │ bedrock │ 00:20:11 │ 0.54 │ 0.74 │ 0.003862 │ 0.076767 │ 0.086878 │ 0.058934 │ │ │ │ │ │ │ │ 154 │ bedrock │ 00:10:19 │ 99.48 │ 2.86 │ 0.055671 │ 4.906742 │ 6.610783 │ 4.127894 │ │ │ │ │ │ │ │ 162 │ bedrock │ 00:10:22 │ 96.51 │ 3.66 │ 0.055507 │ 3.274911 │ 4.711376 │ 2.497813 │ │ │ │ │ │ │ │ 163 │ bedrock │ 00:10:28 │ 99.71 │ 3.03 │ 0.051746 │ 3.673949 │ 4.392128 │ 2.780309 │ │ │ │ │ │ │ │ 169 │ bedrock │ 00:10:19 │ 99.71 │ 1.17 │ │ │ │ │ │ │ │ │ │ │ │ 181 │ bedrock │ 00:20:14 │ 0.56 │ 0.74 │ 0.001518 │ 0.115373 │ 0.085070 │ 0.081976 │ 36.31 │ 38.11 │ 0.184776 │ 14.042940 │ 10.354560 │ 9.977878 │ │ 183 │ bedrock │ 00:10:09 │ 99.68 │ 1.17 │ 0.049606 │ 3.676648 │ 2.779826 │ 2.926728 │ 37.87 │ 37.97 │ 0.187746 │ 13.919683 │ 10.521023 │ 11.077016 │ │ 229 │ bedrock │ 00:10:21 │ 99.57 │ 1.99 │ 0.048258 │ 1.930318 │ 2.704308 │ 1.109933 │ 38.71 │ 37.36 │ 0.197287 │ 7.891462 │ 11.055660 │ 4.537591 │ │ 232 │ bedrock │ 00:10:24 │ 99.63 │ 1.17 │ 0.050244 │ 1.385482 │ 2.815615 │ 0.954640 │ 31.90 │ 35.88 │ 0.131236 │ 3.618456 │ 7.354267 │ 2.493479 │ │ 269 │ bedrock │ 00:10:01 │ 99.69 │ 1.17 │ 0.048866 │ 2.738386 │ 2.123290 │ 22.18 │ 24.35 │ 0.0263 │ 1.477547 │ 1.141505 │ │ │ │ │ │ │ │ │ │ │ │ │ │ 67 │ │ │ │ │ │ 274 │ bedrock │ 00:10:16 │ 97.72 │ 3.49 │ 0.054060 │ 3.029430 │ 2.324568 │ │ │ │ │ │ │ │ ├─────────┼─────────┴──────────┴────────┴────────┴──────────┴─────────────┴────────────┴───────────┴────────┴────────┴──────────┴─────────────┴────────────┴───────────┤ │ Summary │ │ ├─────────┼─────────┬──────────┬────────┬────────┬──────────┬─────────────┬────────────┬───────────┬────────┬────────┬──────────┬─────────────┬────────────┬───────────┤ │ 20 │ bedrock │ 03:23:27 │ 69.84 │ 1.73 │ 0.706980 │ 37.769023 │ 59.189969 │ 33.830679 │ 35.74 │ 35.32 │ 0.727410 │ 39.472541 │ 40.763058 │ 29.227470 │ └─────────┴─────────┴──────────┴────────┴────────┴──────────┴─────────────┴────────────┴───────────┴────────┴────────┴──────────┴─────────────┴────────────┴───────────┘

⚡️ Talks and Demos

🤝 Adopters

  • Currently CEEMS is running on Jean Zay HPC platform that has a daily job churn rate of around 25k jobs with a scrape interval of 10s.

👍 Contributing

We welcome contributions to this project, we hope to see this project grow and become a useful tool for people who are interested in the energy and carbon footprint of their workloads. A comprehensive guide can be found in CONTRIBUTING.md.

Please feel free to open issues and/or discussions for any potential ideas of improvement.

🙏 Acknowledgements

  • Grid5000 platform, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations, has been extensively used in the development of CEEMS.

  • The demo instance has been deployed on the CROCC which was kindly sponsored by ISDM MESO in Montpellier, France.

⭐️ Project assistance

If you want to say thank you or/and support active development of CEEMS:

Owner

  • Name: CEEMS Project
  • Login: ceems-dev
  • Kind: organization
  • Location: France

Development of CEEMS and its related components to measure performance, energy and emissions of compute workloads of SLURM, Openstack and Kubernetes

GitHub Events

Total
  • Create event: 13
  • Issues event: 2
  • Release event: 2
  • Watch event: 1
  • Delete event: 9
  • Push event: 39
  • Pull request event: 24
Last Year
  • Create event: 13
  • Issues event: 2
  • Release event: 2
  • Watch event: 1
  • Delete event: 9
  • Push event: 39
  • Pull request event: 24

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 798
  • Total Committers: 5
  • Avg Commits per committer: 159.6
  • Development Distribution Score (DDS): 0.281
Past Year
  • Commits: 379
  • Committers: 5
  • Avg Commits per committer: 75.8
  • Development Distribution Score (DDS): 0.533
Top Committers
Name Email Commits
Mahendra Paipuri m****i@g****m 574
dependabot[bot] 4****] 130
CEEMS Bot b****t@c****m 91
Nacereddine Laddaoui l****r@g****m 2
wtripp180901 7****1 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 3
  • Total pull requests: 17
  • Average time to close issues: about 19 hours
  • Average time to close pull requests: about 23 hours
  • Total issue authors: 2
  • Total pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 12
  • Bot issues: 0
  • Bot pull requests: 10
Past Year
  • Issues: 3
  • Pull requests: 17
  • Average time to close issues: about 19 hours
  • Average time to close pull requests: about 23 hours
  • Issue authors: 2
  • Pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 12
  • Bot issues: 0
  • Bot pull requests: 10
Top Authors
Issue Authors
  • mahendrapaipuri (2)
  • vurmil (1)
Pull Request Authors
  • dependabot[bot] (10)
  • mahendrapaipuri (7)
Top Labels
Issue Labels
enhancement (2) priority:high (1) priority:medium (1)
Pull Request Labels
dependencies (10) go (7) enhancement (4) javascript (3) bug (2) breaking (1) maintenance (1) ci (1)

Packages

  • Total packages: 1
  • Total downloads: unknown
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 31
proxy.golang.org: github.com/ceems-dev/ceems
  • Versions: 31
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 5.3%
Average: 5.5%
Dependent repos count: 5.6%
Last synced: 6 months ago

Dependencies

go.mod go
  • github.com/alecthomas/kingpin/v2 v2.3.2
  • github.com/alecthomas/units v0.0.0-20211218093645-b94a6e3cc137
  • github.com/beorn7/perks v1.0.1
  • github.com/cespare/xxhash/v2 v2.2.0
  • github.com/containerd/cgroups/v3 v3.0.2
  • github.com/coreos/go-systemd/v22 v22.5.0
  • github.com/docker/go-units v0.4.0
  • github.com/go-kit/log v0.2.1
  • github.com/go-logfmt/logfmt v0.5.1
  • github.com/godbus/dbus/v5 v5.1.0
  • github.com/golang/protobuf v1.5.3
  • github.com/jpillora/backoff v1.0.0
  • github.com/kr/text v0.2.0
  • github.com/matttproud/golang_protobuf_extensions/v2 v2.0.0
  • github.com/mwitkow/go-conntrack v0.0.0-20190716064945-2f068394615f
  • github.com/opencontainers/runtime-spec v1.0.2
  • github.com/prometheus/client_golang v1.17.0
  • github.com/prometheus/client_model v0.5.0
  • github.com/prometheus/common v0.45.0
  • github.com/prometheus/exporter-toolkit v0.10.0
  • github.com/prometheus/procfs v0.12.0
  • github.com/stretchr/testify v1.8.4
  • github.com/xhit/go-str2duration/v2 v2.1.0
  • golang.org/x/crypto v0.14.0
  • golang.org/x/net v0.17.0
  • golang.org/x/oauth2 v0.12.0
  • golang.org/x/sync v0.3.0
  • golang.org/x/sys v0.13.0
  • golang.org/x/text v0.13.0
  • google.golang.org/appengine v1.6.7
  • google.golang.org/protobuf v1.31.0
  • gopkg.in/yaml.v2 v2.4.0
go.sum go
  • github.com/alecthomas/kingpin/v2 v2.3.2
  • github.com/alecthomas/units v0.0.0-20211218093645-b94a6e3cc137
  • github.com/beorn7/perks v1.0.1
  • github.com/cespare/xxhash/v2 v2.2.0
  • github.com/containerd/cgroups/v3 v3.0.2
  • github.com/coreos/go-systemd/v22 v22.5.0
  • github.com/creack/pty v1.1.9
  • github.com/davecgh/go-spew v1.1.0
  • github.com/davecgh/go-spew v1.1.1
  • github.com/docker/go-units v0.4.0
  • github.com/go-kit/log v0.2.1
  • github.com/go-logfmt/logfmt v0.5.1
  • github.com/godbus/dbus/v5 v5.0.4
  • github.com/godbus/dbus/v5 v5.1.0
  • github.com/golang/protobuf v1.3.1
  • github.com/golang/protobuf v1.5.0
  • github.com/golang/protobuf v1.5.3
  • github.com/google/go-cmp v0.5.5
  • github.com/google/go-cmp v0.5.9
  • github.com/jpillora/backoff v1.0.0
  • github.com/kr/pretty v0.3.1
  • github.com/kr/text v0.2.0
  • github.com/matttproud/golang_protobuf_extensions/v2 v2.0.0
  • github.com/mwitkow/go-conntrack v0.0.0-20190716064945-2f068394615f
  • github.com/opencontainers/runtime-spec v1.0.2
  • github.com/pmezard/go-difflib v1.0.0
  • github.com/prometheus/client_golang v1.17.0
  • github.com/prometheus/client_model v0.5.0
  • github.com/prometheus/common v0.45.0
  • github.com/prometheus/exporter-toolkit v0.10.0
  • github.com/prometheus/procfs v0.12.0
  • github.com/rogpeppe/go-internal v1.10.0
  • github.com/stretchr/objx v0.1.0
  • github.com/stretchr/testify v1.4.0
  • github.com/stretchr/testify v1.8.4
  • github.com/xhit/go-str2duration/v2 v2.1.0
  • golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2
  • golang.org/x/crypto v0.14.0
  • golang.org/x/net v0.0.0-20190603091049-60506f45cf65
  • golang.org/x/net v0.17.0
  • golang.org/x/oauth2 v0.12.0
  • golang.org/x/sync v0.3.0
  • golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a
  • golang.org/x/sys v0.13.0
  • golang.org/x/text v0.3.0
  • golang.org/x/text v0.3.2
  • golang.org/x/text v0.13.0
  • golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e
  • golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543
  • google.golang.org/appengine v1.6.7
  • google.golang.org/protobuf v1.26.0-rc.1
  • google.golang.org/protobuf v1.26.0
  • google.golang.org/protobuf v1.31.0
  • gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405
  • gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c
  • gopkg.in/yaml.v2 v2.2.2
  • gopkg.in/yaml.v2 v2.4.0
  • gopkg.in/yaml.v3 v3.0.1
Dockerfile docker
  • quay.io/prometheus/busybox-${OS}-${ARCH} latest build
.github/workflows/ci.yml actions
.github/workflows/release.yml actions
  • actions/checkout v3 composite
  • actions/setup-go v3 composite
.github/workflows/step_build.yml actions
  • actions/checkout v3 composite
  • actions/setup-go v3 composite
  • actions/upload-artifact v3 composite
.github/workflows/step_tests-e2e.yml actions
  • actions/checkout v3 composite
  • actions/setup-go v3 composite
.github/workflows/step_tests-lint.yml actions
  • actions/checkout v3 composite
  • actions/setup-go v3 composite
  • golangci/golangci-lint-action v3 composite
.github/workflows/step_tests-unit.yml actions
  • actions/checkout v3 composite
  • actions/setup-go v3 composite