Melissa

Melissa: coordinating large-scale ensemble runs for deep learning and sensitivity analyses - Published in JOSS (2023)

https://gitlab.inria.fr/melissa/melissa

Science Score: 89.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
    13 of 33 committers (39.4%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

Modeling&Simulation high performance computing (HPC)

Keywords from Contributors

cryptocurrencies
Last synced: 10 months ago · JSON representation

Repository

Melissa is a file-avoiding, fault-tolerant, and elastic framework designed for large-scale sensitivity analysis and large-scale deep surrogate training on supercomputers.

Basic Info
  • Host: gitlab.inria.fr
  • Owner: melissa
  • License: bsd-3-clause
  • Default Branch: develop
Statistics
  • Stars: 2
  • Forks: 3
  • Open Issues:
  • Releases: 0
Topics
Modeling&Simulation high performance computing (HPC)
Created over 3 years ago

https://gitlab.inria.fr/melissa/melissa/blob/develop/

## Melissa

[![DOI](https://joss.theoj.org/papers/10.21105/joss.05291/status.svg)](https://doi.org/10.21105/joss.05291)

### Summary 

Melissa is a file-avoiding, fault-tolerant, and elastic framework designed for _large-scale sensitivity analysis_ and _large-scale deep surrogate training_ on supercomputers. Some of its largest studies have utilized up to 30,000 cores to run 80,000 parallel simulations while avoiding up to 288 TB of intermediate data storage (see [@ribes2022]).

![Melissa architecture](docs/assets/melissa-architecture.png)

Traditional sensitivity analysis and deep surrogate training involve running multiple simulation instances with different input parameters, storing the results on disk, and later retrieving them to train a neural network or compute required statistics. However, the storage demands can quickly become overwhelming, leading to long read times and inefficient data processing. To mitigate this, researchers often reduce study sizes by running lower-resolution simulations or down-sampling output data in space and time.

### How it works

Melissa (as shown in the figure below) overcomes storage limitations by eliminating intermediate file storage and processing data in transit, enabling large-scale data processing:

- **Sensitivity Analysis Server:** Melissa uses iterative statistical algorithms and an asynchronous client-server model for data transfer. Instead of storing simulation outputs on disk, it transmits them via NxM communication patterns to a parallelized server. This approach enables real-time statistical computations without requiring disk storage, allowing full-scale studies with oblivious statistical mapping for every mesh element and time step. Melissa supports various statistical measures (_e.g._ mean, variance, skewness, kurtosis, and Sobol indices) and can be extended with new algorithms.

- **Deep Learning Server:** Following a similar approach, client simulations send data in a round-robin manner to a parallelized, multithreaded server. The server manages a buffer for training batches, ensuring efficient memory use. Once the buffer reaches a predefined safety watermark, selected samples form training batches for distributed training on GPUs or CPUs. Memory is managed dynamically by selecting and evicting samples based on predefined policies, enabling both online and pseudo-offline training by adjusting the buffer size, watermark, and selection/eviction strategies.

![Overview of Melissa's deep learning framework](docs/assets/melissa-dl.png)

Both sensitivity analysis and deep surrogate training in Melissa depend on three key components:

1. **Melissa Client:** This is the parallel numerical simulation code, adapted to function as a client. Each client runs independently and sends mid-simulation output to the server whenever `melissa_send()` is called.

2. **Melissa Server:** A parallel executable responsible for computing statistics or training a Neural Network (more details [here](docs/melissa-server.md)). It updates statistics and generates training batches upon receiving new data from any connected client.

3. **Melissa Launcher:** A front-end Python script that orchestrates the execution of the study (more details [here](docs/melissa-launcher.md)). It automates large-scale job scheduling in `OpenMPI` and integrates with cluster schedulers like `slurm` and `OAR`, handling job submission, monitoring, and fault tolerance.

### User interface

To run an analysis with Melissa, users need to follow these steps:

1. **Instrument the Simulation Code:** Modify the simulation to use the Melissa API with three main calls`init`, `send`, and `finalize`so it functions as a Melissa client ([details here](docs/use-case/instrument-solver.md)).  

2. **Configure the Analysis:** Define how simulation parameters are sampled, select statistical computations, or specify the Neural Network architecture and training settings ([details here](docs/use-case/configuration-file.md)).  

3. **Launch the Analysis:** Run the Melissa launcher via the terminal or the supercomputer's front-end ([quick start guide](docs/first-dl-study.md)). Melissa handles resource allocation, execution monitoring, and automatic restarts for failed components.  

Melissas API currently supports C, Fortran, and Python solvers but can be extended to other languages by following the approach in the [API folder](https://gitlab.inria.fr/melissa/melissa/-/tree/develop/api).

### List of publications

* **MelissaDL x Breed: Towards Data-Efficient On-line Supervised Training of Multi-parametric Surrogates with Active Learning.** Sofya Dymchenko, Abhishek Purandare, Bruno Raffin [https://hal-lara.archives-ouvertes.fr/NUMPEX/hal-04712480v1](https://hal-lara.archives-ouvertes.fr/NUMPEX/hal-04712480v1)

* **Melissa: coordinating large-scale ensemble runs for deep learning and sensitivity analyses.** Marc Schouler, Robert Alexander Caulk, Lucas Meyer, Thophile Terraz, Christoph Conrads, Sebastian Friedemann, Achal Agarwal, Juan Manuel Baldonado, Bartlomiej Pogodziski, Anna Sekula, et al. [https://inria.hal.science/hal-04145897](https://inria.hal.science/hal-04145897)

* **Melissa: Large Scale In Transit Sensitivity Analysis Avoiding Intermediate Files.** Thophile Terraz, Alejandro Ribes, Yvan Fournier, Bertrand Iooss, Bruno Raffin. The International Conference for High Performance Computing, Networking, Storage and Analysis (Supercomputing), Nov 2017, Denver, United States. pp.1 - 14. [PDF](https://hal.inria.fr/hal-01607479/file/main-Sobol-SC-2017-HALVERSION.pdf)

* **The Challenges of In Situ Analysis for Multiple Simulations.** Alejandro Ribs, Bruno Raffin. ISAV 2020  In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, Nov 2020, Atlanta, United States. pp.1-6. [https://hal.inria.fr/hal-02968789](https://hal.inria.fr/hal-02968789)

Owner

  • Name: melissa
  • Login: melissa
  • Kind: organization

This group gathers the solutions based on the Melissa architecture for on-line processing of data produced from large scale ensemble runs (sensibility analysis, data assimilation,...)

JOSS Publication

Melissa: coordinating large-scale ensemble runs for deep learning and sensitivity analyses
Published
June 16, 2023
Volume 8, Issue 86, Page 5291
Authors
Marc Schouler ORCID
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, France
Robert Alexander Caulk ORCID
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, France
Lucas Meyer ORCID
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, France, Industrial AI Laboratory SINCLAIR, EDF Lab Paris-Saclay, France
Théophile Terraz
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, France
Christoph Conrads
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, France
Sebastian Friedemann
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, France
Achal Agarwal ORCID
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, France
Juan Manuel Baldonado
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, France
Bartłomiej Pogodziński
Institute of Bioorganic Chemistry Polish Academy of Sciences, Poznań Supercomputing and Networking Center
Anna Sekuła ORCID
Institute of Bioorganic Chemistry Polish Academy of Sciences, Poznań Supercomputing and Networking Center
Alejandro Ribes
Industrial AI Laboratory SINCLAIR, EDF Lab Paris-Saclay, France
Bruno Raffin
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, France
Editor
Patrick Diehl ORCID
Tags
supercomputing sensitivity analysis deep learning distributed systems orchestration

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 2,627
  • Total Committers: 33
  • Avg Commits per committer: 79.606
  • Development Distribution Score (DDS): 0.787
Past Year
  • Commits: 360
  • Committers: 2
  • Avg Commits per committer: 180.0
  • Development Distribution Score (DDS): 0.003
Top Committers
Name Email Commits
Marc Schouler m****r@i****r 559
Terraz Theophile t****z@i****r 547
rcaulk r****k@i****r 377
Abhishek Purandare a****e@i****r 359
Christoph Conrads c****s@i****r 196
robcaulk r****k@g****m 139
Bartlomiej Pogodzinski b****i@m****l 86
Lucas Meyer l****r@i****r 66
RAFFIN Bruno b****n@i****r 60
Anna s****a@g****m 59
Sebastian Friedemann s****n@g****t 58
Achal Agarwal a****1@g****m 19
sfriedem s****n@i****e 17
xy124 q****d@g****t 17
Achal Agarwal a****l@i****r 11
Bartłomiej Pogodziński b****i@g****m 10
friedems s****n@u****r 7
Anthony Geay a****y@e****r 6
Robert Caulk r****k@f****n 6
jbaldona u****o@j****r 5
tterraz t****z@t****r 5
sfriedem s****n@i****r 4
Marc Schouler s****c@g****m 3
jbaldona u****o@j****r 2
Adrien Faure a****e@p****m 1
DYMCHENKO Sofya s****o@i****r 1
Juan Manuel Baldonado j****o@f****r 1
Juan Manuel Baldonado j****o@g****r 1
Juan Manuel Baldonado j****o@M****l 1
Juan Manuel Baldonado j****o@e****r 1
and 3 more...

Issues and Pull Requests

Last synced: 10 months ago

Packages

  • Total packages: 3
  • Total downloads: unknown
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 0
    (may contain duplicates)
  • Total versions: 4
  • Total maintainers: 4
spack.io: melissa

Melissa is a file-avoiding, adaptive, fault-tolerant and elastic framework, to run large-scale sensitivity analysis on supercomputers.

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Average: 28.6%
Dependent packages count: 57.3%
Maintainers (2)
Last synced: 10 months ago
spack.io: melissa-api

Melissa is a file-avoiding, adaptive, fault-tolerant and elastic framework, to run large-scale sensitivity analysis or deep-surrogate training on supercomputers. This package builds the API used when instrumenting the clients.

  • Versions: 0
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Average: 29.0%
Dependent packages count: 57.9%
Maintainers (3)
Last synced: 10 months ago
spack.io: py-melissa-core

Melissa is a file-avoiding, adaptive, fault-tolerant and elastic framework, to run large-scale sensitivity analysis or deep-surrogate training on supercomputers. This package builds the launcher and server modules.

  • Versions: 0
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Average: 29.0%
Dependent packages count: 58.0%
Maintainers (2)
Last synced: 10 months ago