https://github.com/ai4os/ai4-accounting

Accounting of the computing resources consumed by supported VOs.

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Accounting of the computing resources consumed by supported VOs.

Basic Info

Host: GitHub
Owner: ai4os
License: apache-2.0
Language: Python
Default Branch: master
Homepage:
Size: 107 KB

Statistics

Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 1

Created almost 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme License

Accounting for the AI4OS project resources

This repo allows to collect statistics based on periodic (6 hrs) snapshots taken from the Nomad cluster.

Usage

To use this, create first create a suitable Virtual Environment.

bash python -m venv --system-site-packages myenv source myenv/bin/activate pip install -r requirements.txt deactivate

To take a snapshot of the cluster: bash bash take_snapshot.sh (make sure to adapt the paths in the bash script)

You can generate stats for the accounting reports with the intended start and end dates (both included):

bash python usage_stats.py --ini-date 2024-09-01 --end-date 2025-02-28

And you will get the reports for both namespaces:

bash AI4EOSC accounting for the period 2023-09-01:2023-12-31 ┌────────────────┬──────────┐ │ cpu_num hours │ 3408 │ │ gpu_num hours │ 408 │ │ memoryMB hours │ 8280000 │ │ diskMB hours │ 10713600 │ └────────────────┴──────────┘ IMAGINE accounting for the period 2023-09-01:2023-12-31 ┌────────────────┬────────┐ │ cpu_num hours │ 336 │ │ gpu_num hours │ 0 │ │ memoryMB hours │ 768000 │ │ diskMB hours │ 727200 │ └────────────────┴────────┘

You can generate a daily summary of the logs, along with aggregation statistics per namespace/user. Then visualize some interactive plots showing the historical usage:

bash python summarize.py python interactive_plot.py

In addition, we keep a json database of users that can be updated using:

bash python update-user-db.py

This will add users with currently running deployments to the database, if not already present.

You can merge the summary user stats with the user database, using: bash python merge-userdb-stats.py and this will create a file summaries/***-users-agg-merged.csv.

Implementation notes

Different approaches for accounting

Three approaches were considered to keep the accounting:

Taking daily/hourly snapshots of the cluster state.
Using PAPI to save the relevant information about the job at delete time
Add an additional task in the Nomad job (with poststop lifecycle), so that information is saved at delete time

After considering the following pros/cons of each approach, we settle for approach (1) as the preferred solution.

(1) is able to account for jobs that are running but not yet deleted. Otherwise, with (2, 3), one might end up accounting in one period for the resources consumed in the previous period.
(1, 3) are able to account for jobs that have been deleted directly by admins in the cluster, not through the API.
(1) splits logs in several files, easier to process in chunks.
(1, 2) save results in a clean json file, while (3) possibly relies on Consul KV store to save job information as a long string (ugly!) somewhere in Consul.
(2, 3) generate less logs, as in (1) same long-lived jobs will appear in different snapshots. This can possibly be mitigated by consolidating logs.
(1) is independent of PAPI, so less code clutter.

Nomad permanently sends dead jobs to garbage after job_gc_threshold (4h), so snapshots must be taken with a least a 4h interval to be able to also account for jobs deleted between snapshots. Accounting is performed up to microsecond precision.

`usage_stats` vs `summarize`

Both (1) usage_stats and (2) summarize.py provide summaries of VO usage. But (1) is more precise because:

(2) averages the usage as a mean of the 6 daily snapshots, not taking into account the start/end exact datetimes of each deployment like (1) does.
if we missed snapshots (even if the cluster was still working) during a complete day, (2) will appear as if that day didn't consumed resources while (1) correctly accounts for it.
to convert back from resource/day (2) to resource/hour (1) you have to estimate how many hours on average the cluster has been running per day (which is less than 24hs because of the takedowns). So simply multiplying (2) by 24 tends to overestimate the real numbers provided by (1). This effect can be observed by taking a small window around a cluster takedown, eg. 2024-12-02.

`summarize` implementation

Each row in the summarize dataframe is a deployment status at a given snapshot time. An alternative, that would create smaller dataframes, is to merge all the info and have one row per deployment. Those rows would have a initial_date and final_date. And potentially we could regenerate the time series by filtering by dates.

The problem is that those dates are not unique because it happens that a single deployment cycles through the same status (eg. queued --> running --> dead --> running --> dead). So there's not an unique initial_date. Therefore having rows with a deployment status at a given snapshot time better reflects this behaviour.

Known issues

Due to some side issues, CPU frequency is not very reliable around Sep 2023.
Due to a code bug, deployments of some tools were not tracked:
- CVAT: not tracked in period [2024/11/13-2025/05/23]
- AI4Life loader: not tracked in period [2025/01/29-2025/05/23]
- LLM: not tracked in period [2025/03/03-2025/05/23]
- DevEnv: not tracked in period [2025/04/04-2025/05/23]
- NVFlare: not tracked in period [2025/04/07-2025/05/23]

Owner

Name: AI4OS
Login: ai4os
Kind: organization
Email: ai4eosc-po@listas.csic.es

Website: http://ai4eosc.eu
Twitter: AI4EOSC
Repositories: 1
Profile: https://github.com/ai4os

AI4OS is the software powering the AI4EOSC platform

GitHub Events

Total

Push event: 8

Last Year

Push event: 8

Dependencies

requirements.txt pypi

pandas ==1.4.1
python-nomad ==2.0.0
rich >=13.5.2
typer >=0.7.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/ai4os/ai4-accounting

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Accounting for the AI4OS project resources

Usage

Implementation notes

Different approaches for accounting

`usage_stats` vs `summarize`

`summarize` implementation

Known issues

Owner

GitHub Events

Total

Last Year

Dependencies

https://github.com/ai4os/ai4-accounting

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Accounting for the AI4OS project resources

Usage

Implementation notes

Different approaches for accounting

usage_stats vs summarize

summarize implementation

Known issues

Owner

GitHub Events

Total

Last Year

Dependencies

`usage_stats` vs `summarize`

`summarize` implementation