https://github.com/converged-computing/mummi-experiments

Experiments to test the Mummi Operator and new designs

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Experiments to test the Mummi Operator and new designs

Basic Info

Host: GitHub
Owner: converged-computing
License: mit
Language: Python
Default Branch: main
Size: 6.3 GB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License

README.md

Mummi Experiments

These Mummi Experiments will use the Mummi Operator and later derivatives to run Mummi.

test-january-2025: testing Mummi on 6 nodes, a base setup for GPU/CPU nodes.
test-february-2025: testing Mummi via the state machine operator

Experiment

High Level

Mummi is a workflow that represents a state machine, and warrants features that traditional HPC does not easily support (e.g., elasticity)
We will increasingly need to be aware of the cost / utilization of our resources (regardless of cloud or hpc) and so we want to run a workflow like Mummi in an optimized way.
The simplest unit to compare is the job - has a clear definition on HPC and in Kubernetes.
We could compare some performance of a component (e.g., gromacs) but arguably that is more benchmarking the node, network, etc. It's an interesting question but a different one.
The end-to-end total time for N work is likely what we want to compare across cases.

Questions we are interested in

Can we define the quantity of manual intervention required?
How much of an allocation (on HPC) does a run of Mummi burn?
- CPU/GPU hours
Something to do with reproducibility / resiliency of workflow
- Injecting faults into workflow and seeing if it can recover
- Delete a node and see what happens, hardware failure
- Inject some probability of failure into the application
- Valid for workflows in general, but not Mummi's case
How long does it take us to move from one platform to another?
How do we compare orchestration between HPC and cloud environments? (not performance of apps but of orchestration, time between things? events?)
- There are well-defined ways (makespan / critical path - amount of time all components of workflow need, under what circumstances run better)
- "Excess" -- utilization efficiency
What is the marginal benefit to adding cloud features?
- We can start with traditional Mummi, add the state machine and refactored ml runner, then elasticity (3 stages).
- Measure total times for running each component. We can compare the total time of the MLserver running to the time of each job.
- Measure excess - the number of MLserver simulations generated that aren't used.
- We should be able to measure the decrease an excess and improvement (or not) to total wall time of each component.
Something with simulation using the state machine operator?
- implement state machine library and have flux with backend
- we'd be able to measure behavior on HPC vs. cloud

License

HPCIC DevTools is distributed under the terms of the MIT license. All new contributions must be made under this license.

See LICENSE, COPYRIGHT, and NOTICE for details.

SPDX-License-Identifier: (MIT)

LLNL-CODE- 842614

Owner

Name: Converged Computing
Login: converged-computing
Kind: organization

Website: https://converged-computing.org
Repositories: 84
Profile: https://github.com/converged-computing

The best of cloud and high performance computing: technology and community combined.

GitHub Events

Total

Release event: 1
Push event: 7
Public event: 2
Pull request event: 2
Create event: 1

Last Year

Release event: 1
Push event: 7
Public event: 2
Pull request event: 2
Create event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/converged-computing/mummi-experiments

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Mummi Experiments

Experiment

High Level

Questions we are interested in

License

Owner

GitHub Events

Total

Last Year