https://github.com/converged-computing/mummi-experiments

Experiments to test the Mummi Operator and new designs

https://github.com/converged-computing/mummi-experiments

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.7%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Experiments to test the Mummi Operator and new designs

Basic Info
  • Host: GitHub
  • Owner: converged-computing
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 6.3 GB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme License

README.md

Mummi Experiments

These Mummi Experiments will use the Mummi Operator and later derivatives to run Mummi.

Experiment

High Level

  • Mummi is a workflow that represents a state machine, and warrants features that traditional HPC does not easily support (e.g., elasticity)
  • We will increasingly need to be aware of the cost / utilization of our resources (regardless of cloud or hpc) and so we want to run a workflow like Mummi in an optimized way.
  • The simplest unit to compare is the job - has a clear definition on HPC and in Kubernetes.
  • We could compare some performance of a component (e.g., gromacs) but arguably that is more benchmarking the node, network, etc. It's an interesting question but a different one.
  • The end-to-end total time for N work is likely what we want to compare across cases.

Questions we are interested in

  • Can we define the quantity of manual intervention required?
  • How much of an allocation (on HPC) does a run of Mummi burn?
    • CPU/GPU hours
  • Something to do with reproducibility / resiliency of workflow
    • Injecting faults into workflow and seeing if it can recover
    • Delete a node and see what happens, hardware failure
    • Inject some probability of failure into the application
    • Valid for workflows in general, but not Mummi's case
  • How long does it take us to move from one platform to another?
  • How do we compare orchestration between HPC and cloud environments? (not performance of apps but of orchestration, time between things? events?)
    • There are well-defined ways (makespan / critical path - amount of time all components of workflow need, under what circumstances run better)
    • "Excess" -- utilization efficiency
  • What is the marginal benefit to adding cloud features?

    • We can start with traditional Mummi, add the state machine and refactored ml runner, then elasticity (3 stages).
    • Measure total times for running each component. We can compare the total time of the MLserver running to the time of each job.
    • Measure excess - the number of MLserver simulations generated that aren't used.
    • We should be able to measure the decrease an excess and improvement (or not) to total wall time of each component.
  • Something with simulation using the state machine operator?

    • implement state machine library and have flux with backend
    • we'd be able to measure behavior on HPC vs. cloud

License

HPCIC DevTools is distributed under the terms of the MIT license. All new contributions must be made under this license.

See LICENSE, COPYRIGHT, and NOTICE for details.

SPDX-License-Identifier: (MIT)

LLNL-CODE- 842614

Owner

  • Name: Converged Computing
  • Login: converged-computing
  • Kind: organization

The best of cloud and high performance computing: technology and community combined.

GitHub Events

Total
  • Release event: 1
  • Push event: 7
  • Public event: 2
  • Pull request event: 2
  • Create event: 1
Last Year
  • Release event: 1
  • Push event: 7
  • Public event: 2
  • Pull request event: 2
  • Create event: 1