Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    3 of 12 committers (25.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: mlgym
  • License: apache-2.0
  • Language: Python
  • Default Branch: master
  • Size: 189 MB
Statistics
  • Stars: 20
  • Watchers: 5
  • Forks: 5
  • Open Issues: 54
  • Releases: 70
Created over 5 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md


a feature-rich deep learning framework providing full reproducibility of experiments.

CircleCI

Reproducibility is a recurring issue in deep learning (research) with models often being implemented in Jupyter notebooks or entire training and evaluation pipelines implemented from scratch with every new project. The lack of standardization and repetitive boilerplate code of experimental setups impede reproducibility.

MLgym aims to increase reproducibility by separating the experimental setup from the code and providing the entire infrastructure for e.g., model training, model evaluation, experiment logging, checkpointing and experiment analysis.

Specifically, MLgym provides an extensible set of machine learning components (e.g., trainer, evaluator, loss functions, etc.). The framework instantiates these components dynamically as specified and parameterized within a configuration file (see here, for an exemplary configuration) describing the entire experiment setup (i.e., training and evaluation pipeline). The separation of experimental setup and code maximizes the replicability and interpretability of ML experiments. The machine learning components cut down the implementational efforts significantly and lets your focus solely on your ideas.

Additionally, MLgym provides the following key features:

  • Component registry to register custom components and their dependencies.

  • Warm starts allowing to resume training after crash

  • Customizable checkpointing strategies

  • MLboard webservice for experiment tracking / analysis (live and offline) by subscribing to the websocket logging environment

  • Large scale, multi GPU training supporting grid search, nested cross validation and cross validation

  • Distributed logging via websockets and event sourcing, allowing location-independent logging and full replicability

  • Definition of training and evaluation pipeline in a configuration file, achieving separation of experiment setup and code.

Please note, that at the moment this code should be treated as experimental and is not production ready.

Install

there are two options to install MLgym, the easiest way is to install the framework from the pip repository:

bash pip install mlgym

For the latest version, one can directly install it from source by cd into the root folder and then running

bash pip install src/

Usage

We provide an easy-to-use example that lets you run a MLgym experiment setup.

Before running the experiments we need to setup the MLboard logging environment, i.e., the websocket service and the RESTful webservice. MLgym logs the training/evaluation progress and evaluation results via the websocket API, allowing the MLboard frontend to receive live updates. The RESTful webservice provides endpoints to receive checkpoints and experiment setups. For a full specification of both APIs see here.

We start the websocket service and the RESTful webservice on ports 5001 and 5002, respectively. Feel free to choose different ports if desired. Similarly, we specify the folder event_storage as the local event storage folder. Note, to access the websocket service from a different port, we need to specify the CORS allowed origins. In thise example, we only use the websocket service locally from 127.0.0.1:8080 via the MLboard frontend.

```sh mlboardwsendpoint --host 127.0.0.1 --port 5002 --eventstoragepath eventstorage --corsallowedorigins http://127.0.0.1:8080 http://127.0.0.1:5002

mlboardrestendpoint --port 5001 --eventstoragepath eventstorage

```

Next, we run the experiment setup. We cd into the example folder and run run.py with the respective run config whose path is passed via the parameter config_path.

The run_config.yml file contains all the parameters which is required for the mlGym to configure it self to run.

A preview of the yml file is given bellow:

```yaml runconfiguration: type: train # train, warmstart config: numepochs: 50 # Number of epochs numbatchesperepoch: 100 gsconfigpath: ./gsconfig.yml

environment: type: multiprocessing # multiprocessing, mainprocess, accelerate config: processcount: 3 # Max. number of processes running at a time. computationdeviceids: [0] # Indices of GPUs to distribute the GS over

logging: websocketloggingservers: # List of websocket logging servers, e.g., http://127.0.0.1:9090 http://127.0.0.1:8080 - http://127.0.0.1:5002 gsrestapi_endpoint: http://127.0.0.1:5001 # Endpoint for the grid search API, e.g., http://127.0.0.1:8080 ```

Let us get into the header run_configuration, the parameter type tells the mlGym what type of Job is to be executed. train for starting fresh training of a model, warmstart for resuming training from a specific epoch. num_epochs limits the maximum number of epochs to train a model, num_batches_per_epoch limits the maximum number of batches in an epoch to train a model and gs_config_path gives the path to the grid search config file.

In the header environment, the parameter type says if the training is to be done using only GPU (accelerate), CPU (main_process) or using both (multiprocessing). process_count specifies the number of experiments that we run in parallel and computation_device_ids indicates the indices of GPUs to distribute the GS over.

The header logging contains the parameters to be used to set up the logging adapter in MlGym. The parameter websocket_logging_servers indicates the server address of the websocket server and gs_rest_api_endpoint indicates the server address of the restful API to be used for REST HTTP communications in MlGym.

If the model performance does not improve substantially over time, the checkpointing strategy defined in gs_config.yml will stop training prematurely.

```sh

cd mlgym/example/gridsearchexample

python run.py --configpath runconfig.yml ```

To visualize the live updates, we run the MLboard frontend. We specify the server host and port that delivers the frontend and the endpoints of the REST webservice and the websocket service. The parameter run_id refers to the experiment run that we want to analyze and differs in your case. Each experiment runs is stored in separate folders within the event_storage path. The folder names refer to the respective experiment run ids.

sh ml_board --ml_board_host 127.0.0.1 --ml_board_port 8080 --rest_endpoint http://127.0.0.1:5001 --ws_endpoint http://127.0.0.1:5002 --run_id 2022-11-06--17-59-10

The script returns the parameterized URL pointing to the respective experiment run:

``` ====> ACCESS MLBOARD VIA http://127.0.0.1:8080?restendpoint=http%3A//127.0.0.1%3A5001&wsendpoint=http%3A//127.0.0.1%3A5002&run_id=2022-11-06--17-59-10

```

Note, that the Flask webservice delivers the compiled react files statically, which is why any changes to the frontend code will not be automatically reflected. As a solution, you can start the MLboard react app directly via yarn and call the URL with the respective URL search params in the browser

```sh cd mlgym/src/ml_board/frontend/dashboard

yarn start ```

To this day, the MLboard frontend is still under development and not all features have been implemented, yet. For a brief overview of its current state, checkout the MLboard section. Generally, it is possible analyze the log files directly in the event storage. All messages are logged as specified within the websocket API

To see the messages live cd into the event storage directory and tail the event_storage.log file.

sh cd event_storage/2022-11-06--17-59-10/ tail -f event_storage.log

MLboard

Since MLboard frontend is still under heavy development, we would like to give you an overview of its current state and a sneak peek about what is going to come in the foreseeable future.

In the GIF image below, we showcase the experiment run of the getting started tutorial above, which is a sweep over different learning rates. We replaced the experiment run id from a previous run with the current one in the browser and clicked on the throughput board menu button to check if we are successfully connected and receive messages. We then headed over to the analysis board by clicking on the analysis board menu button, providing us with a live feed of the metric and loss developments via the line charts. In the filter form at the buttom, we specified a regex expression filtering for F1 score metric and cross-entropy loss, to limit the scope to the most relevant information.

Over the course of the training, the line charts are populated with the respective scores for each epoch. The legend in the charts refers to the experiment ids. From an analysis point of view, we see that experiments 0, 1, and 2 fail to converge and are stopped after three epochs due to the early stopping criterion specified within the configuration file. In contrast, experiments 3, 4 and 5 learn the, illustrating the significance of learning rate choice anecdotically.

The technical progress of the experiment run can be tracked from the flowboard, as presented below. The flowboard summarizes the state of all experiments within a table including the job status (init, running, done), starting and finishing time. The overall progress is tracked via column epoch_progress. Within an epoch we different between two model states, namely training and evaluating, as tracked by column model_status and the current split and its progress are captures by current_split and batch_progress, respectively. The devicde column indicates the computation device that the model is sitting on.

Future work

Currently, we collected a lot of exciting ideas forthe frontend in our backlog.

Analogously to the idea of experiment reproducibility, we think about aiming in the direction of analysis reproducibility. To this end, we consider defining the entire analysis setup within the text input. The analysis setup could be exported or version controlled and shared with fellow researchers.

Additionally, we consider the following features/ideas for the flowbord page:

  • Right now, flowboard only shows the epoch and batch progress. Additional information such as the experiment's hyperparameter combination and the current metric/loss scores would be benificial.
  • Clicking on one of the experiments should visualize the experiment's entire training and evaluation pipeline configuration (e.g., pop up window)
  • A download/export functionality for the trained model
  • Higher level functionality for model selection, e.g., by defining the selection strategy within the text input.

Similarly, for the analysis board we consider:

  • Visualizing the influence of hyperparamter choice on the metrics and losses
  • Model selection routines

Implementing these features will require some time. On the plus side, we already collect and store all the necessary information on the client-side, ... so stay tuned! ;-)

Copyright

Copyright (c) 2023

For license see: https://github.com/mlgym/mlgym/blob/master/LICENSE

Owner

  • Name: mlgym
  • Login: mlgym
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Lübbering"
  given-names: "Max"
title: "MLgym, a Deep Learning Framework for Reproducible Machine Learning Research"
version: 0.0.51
date-released: 2021-12-22
url: "https://github.com/le1nux/mlgym"

GitHub Events

Total
  • Watch event: 10
Last Year
  • Watch event: 10

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 638
  • Total Committers: 12
  • Avg Commits per committer: 53.167
  • Development Distribution Score (DDS): 0.527
Top Committers
Name Email Commits
Max Luebbering l****x@u****m 302
Max Luebbering m****g@g****m 184
zengxyu s****u@o****m 134
SofiaTraba s****m@g****m 5
SofiaTraba 4****a@u****m 4
OsamaMSoliman O****n@u****e 2
Maren Pielka M****a@i****e 2
Lorenz Wickert L****t@g****e 1
PriyaTomar p****5@g****m 1
lwickertfhg 7****g@u****m 1
zeng 1****y 1
Maren Pielka m****n@m****e 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: about 2 years ago

All Time
  • Total issues: 51
  • Total pull requests: 115
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 21 days
  • Total issue authors: 4
  • Total pull request authors: 8
  • Average comments per issue: 0.31
  • Average comments per pull request: 0.4
  • Merged pull requests: 96
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 47
  • Pull requests: 89
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 9 days
  • Issue authors: 4
  • Pull request authors: 5
  • Average comments per issue: 0.28
  • Average comments per pull request: 0.49
  • Merged pull requests: 78
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • OsamaMSoliman (15)
  • le1nux (13)
  • vijulshah (10)
  • moinam (9)
Pull Request Authors
  • moinam (47)
  • le1nux (31)
  • OsamaMSoliman (18)
  • vijulshah (17)
  • dan-hoven (1)
  • PriyaTomar (1)
  • zengxyu (1)
  • sofiatraba (1)
Top Labels
Issue Labels
enhancement (14) backend (12) frontend (10) bug (8) refactoring (5) MLboard (3) test_coverage (1) wontfix (1)
Pull Request Labels
enhancement (10) refactoring (5) backend (4) frontend (4) bug (2) MLboard (2) documentation (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 436 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 71
  • Total maintainers: 1
pypi.org: mlgym

MLgym, a python framework for distributeda and reproducible machine learning model training in research.

  • Versions: 71
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 436 Last month
Rankings
Dependent packages count: 10.1%
Forks count: 14.3%
Stargazers count: 16.6%
Average: 18.8%
Dependent repos count: 21.6%
Downloads: 31.6%
Maintainers (1)
Last synced: 7 months ago

Dependencies

src/setup.py pypi
  • dashifyML *
  • datastack *
  • h5py *
  • pytest *
  • pytest-cov *
  • pyyaml *
  • scikit-learn *
  • scipy *
  • torch *
  • tqdm *