ritme

Target-driven optimization of feature representation and model selection for next-generation sequencing data

https://github.com/adamovanja/ritme

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.3%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

Target-driven optimization of feature representation and model selection for next-generation sequencing data

Basic Info

Host: GitHub
Owner: adamovanja
License: bsd-3-clause
Language: Python
Default Branch: main
Homepage:
Size: 6.49 MB

Statistics

Stars: 2
Watchers: 1
Forks: 0
Open Issues: 1
Releases: 21

Created about 3 years ago · Last pushed 7 months ago

Metadata Files

Readme License Citation

ritme

An optimized framework for finding the best feature representation and model class of next generation sequencing data in relation to a target of interest.

If you use this software, please cite it using the metadata from CITATION.cff.

Setup

ritme is available as a conda package on anaconda.org. To install it run the following command:

shell conda install -c adamova -c qiime2 -c conda-forge -c bioconda -c pytorch ritme

Usage

ritme provides three main functions to prepare your data, find the best model configuration (feature + model class) for the specified target and evaluate the best model configuration on a test set. All of them can be run in the CLI or via the Python API. To see the arguments needed for each function run ritme <function-name> --help or have a look at the examples in the notebook experiments/ritme_example_usage.ipynb.

| ritme function | Description | |------------------------|----------------------------------------------------------------------------------| | splittraintest | Split the dataset into train-test (with grouping option) | | findbestmodelconfig | Find the best model configuration (incl. feature representation and model class) | | evaluatetuned_models | Evaluate the best model configuration on the complete train and a left-out test set |

Finding the best model configuration

The configuration of the optimization is defined in a json file. To define a suitable configuration for your use case, please find the description of each variable in config/run_config.json here:

| Parameter | Description | |-----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | experimenttag | Name of the experiment. All outputs will be stored in the following folder: `<--path-store-model-logs>/experimenttagwith--path-store-model-logsbeing directly provided as parameter input to thefind-best-model-configmethod. | | fully_reproducible | Setting this tofalseensures efficient and fast execution of trials with aggressive early stopping of bad trials. If set totruethe trials are executed in a deterministic order at the cost of efficiency and potentially performance (see section [Note on reproducibility](#note-on-reproducibility)). | group_by_column | Column name to group train-test splits by (e.g. unique host_id) ensuring that rows with the same group value are not spread across multiple splits. If set to "null" no grouping is performed and default random train-test split is performed. | | target | Column name of the target variable in the metadata. | | ls_model_types | List of model types to explore sequentially - options include "linreg", "trac", "xgb", "nn_reg", "nn_class", "nn_corn" and "rf". | | num_trials | Total number of trials to try per model type: the larger this value the more space of the complete search space can be searched. | | max_cuncurrent_trials | Maximal number of concurrent trials to run. | | seed_data | Seed for data-related random operations. | | seed_model | Seed for model-related random operations. | | | tracking_uri | Which platform to use for experiment tracking either "wandb" for WandB or "mlruns" for MLflow. See [model tracking](#model-tracking) for set-up instructions and [model tracking evaluation](#model-tracking-evaluation) for tips on how to evaluate the training procedure in each platform. | | model_hyperparameters | Optional: For each model type the range of model and feature engineering hyperparameters to consider can be defined here. Note: in case this key is not provided, the default ranges are used as defined inmodelspace/staticsearchspace.py. You can find an example of a configuration file with all hyperparameters defined inritme/config/runconfigwhparams.json` |

If you want to parallelize the training of different model types, we recommend training each model in a separate experiment [1]. If you decide to run several model types in one experiment, be aware that the model types are trained sequentially. So, this will take longer to finish.

Once you have trained some models, you can check the progress of the trained models in the tracking software you selected (see sections on model tracking and model training evaluation).

[1] Funfact: One experiment consists of multiple trials.

Model tracking

In the run configuration file you can choose to track your trials with MLflow (tracking_uri=="mlruns") or with WandB (tracking_uri=="wandb").

Choice between WandB & MLflow

To choose which tracking set-up works best for you, it is best to review the respective services: WandB & MLflow. In our experience, when working on a HPC cluster with limited outgoing network traffic, MLflow works better than WandB.

Independent of your choice, ritme is set up such that no sample-specific information is stored remotely. Any sample-specific information is stored only on your local machine. As for aggregate metrics, WandB stores these on their servers while MLflow stores them locally.

Set-up WandB with ritme

In case of using WandB you need to store your WANDB_API_KEY & WANDB_ENTITY as a environment variable in .env. Make sure to ignore this file in version control (add to .gitignore)!

The WANDB_ENTITY is the project name you would like to store the results in. For more information on this parameter see the official webpage for initializing WandB here.

Also if you are running WandB from a HPC, you might need to set the proxy URLs to your respective URLs by exporting these variables: export HTTPS_PROXY=http://proxy.example.com:8080 export HTTP_PROXY=http://proxy.example.com:8080 For a template on how to evaluate your models see the section on model training evaluation.

Set-up MLflow with ritme

In case of using MLflow you can view your models with mlflow ui from within the path where the logs were saved (which is outputted when running find_best_model_config as "You can view the model logs by launching MLflow UI from within folder : "). This is rather slow when many trials or experiments were launched - then viewing logs via the Python API is better suited. For more information check out the official MLflow documentation.

For a template on how to evaluate your models see the section on model training evaluation.

Model training evaluation

We provide example templates to help you evaluate your ritme models for both supported tracking services: * for WandB visit this report - simply copy the template and update the run set at the end of the report to your own experiment.

for MLflow see the notebook experiments/evaluate_trials_mlflow.ipynb.

Note on reproducibility

When you enable "fully_reproducible": true in your experiment configuration, all runs on identical hardware will produce fully reproducible results, albeit with a potential impact on efficiency and performance. This guarantee becomes particularly relevant when executing a large number of trials in parallel. (For small-scale experiments — e.g. with 2 trials — you will often observe identical results even with "fully_reproducible": false.)

Contact

In case of questions or comments feel free to raise an issue in this repository.

License

If you use this software, please cite it using the metadata from CITATION.cff.

ritme is released under a BSD-3-Clause license. See LICENSE for more details.

Owner

Name: Anja Adamov
Login: adamovanja
Kind: user
Company: ETH Zürich

Repositories: 2
Profile: https://github.com/adamovanja

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Adamov
    given-names: Anja
    orcid: https://orcid.org/0000-0002-7506-1583
title: "ritme"
version: 1.0.0
doi: 10.5281/zenodo.14149081
date-released: 2024-11-13
url: "https://github.com/adamovanja/ritme"

GitHub Events

Total

Create event: 31
Issues event: 6
Release event: 12
Watch event: 2
Delete event: 18
Public event: 1
Push event: 80
Pull request event: 38

Last Year

Create event: 31
Issues event: 6
Release event: 12
Watch event: 2
Delete event: 18
Public event: 1
Push event: 80
Pull request event: 38

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 6
Total pull requests: 18
Average time to close issues: N/A
Average time to close pull requests: 4 days
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 13
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 6
Pull requests: 18
Average time to close issues: N/A
Average time to close pull requests: 4 days
Issue authors: 1
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 13
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

adamovanja (6)

Pull Request Authors

adamovanja (22)

Top Labels

Issue Labels

enhancement (5) good first issue (2) documentation (1)

Pull Request Labels

Dependencies

.github/workflows/ci.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite
chartboost/ruff-action v1 composite
codecov/codecov-action v3 composite
conda-incubator/setup-miniconda v2 composite

setup.py pypi

ritme

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

ritme

Setup

Usage

Finding the best model configuration

Model tracking

Choice between WandB & MLflow

Set-up WandB with ritme

Set-up MLflow with ritme

Model training evaluation

Note on reproducibility

Contact

License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies