galaxy-ml

Make machine learning simpler with Galaxy

https://github.com/goeckslab/galaxy-ml

Science Score: 46.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Committers with academic emails
    2 of 8 committers (25.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.3%) to scientific vocabulary

Keywords from Contributors

sequences dna usegalaxy genomics workflow-engine ngs bioinformatics interactive optimizing-compiler clade
Last synced: 7 months ago · JSON representation

Repository

Make machine learning simpler with Galaxy

Basic Info
Statistics
  • Stars: 11
  • Watchers: 2
  • Forks: 7
  • Open Issues: 7
  • Releases: 6
Created almost 7 years ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

Galaxy-ML

Galaxy-ML is a web machine learning end-to-end pipeline building framework, with special support to biomedical data. Under the management of unified scikit-learn APIs, cutting-edge machine learning libraries are combined together to provide thousands of different pipelines suitable for various needs. In the form of Galalxy tools, Galaxy-ML provides scalabe, reproducible and transparent machine learning computations.

Key features

  • simple web UI
  • no coding or minimum coding requirement
  • fast model deployment and model selection, specialized in hyperparameter tuning using GridSearchCV
  • high level of parallel and automated computation

Supported modules

A typic machine learning pipeline is composed of a main estimator/model and optional preprocessing component(s).

Model
  • scikit-learn
    • sklearn.ensemble
    • sklearn.linear_model
    • sklearn.naive_bayes
    • sklearn.neighbors
    • sklearn.svm
    • sklearn.tree
  • xgboost
    • XGBClassifier
    • XGBRegressor
  • mlxtend

    • StackingCVClassifier
    • StackingClassifier
    • StackingCVRegressor
    • StackingRegressor
  • Keras (Deep learning models are re-implemented to fully support sklearn APIs. Supports parameter, including layer subparameter, swaps or searches. Supports callbacks)

    • KerasGClassifier
    • KerasGRegressor
    • KerasGBatchClassifier (works best with online data generators, processing images, genomic sequences and so on)
  • BinarizeTargetClassifier/BinarizeTargetRegressor

  • IRAPSClassifier

Preprocessor
  • scikit-learn
    • sklearn.preprocessing
    • sklearn.feature_selection
    • sklearn.decomposition
    • sklearn.kernel_approximation
    • sklearn.cluster
  • imblanced-learn
    • imblearn.under_sampling
    • imblearn.over_sampling
    • imblearn.combine
  • skrebate
    • ReliefF
    • SURF
    • SURFstar
    • MultiSURF
    • MultiSURFstar
  • TDMScaler
  • DyRFE/DyRFECV
  • Z_RandomOverSampler
  • GenomeOneHotEncoder
  • ProteinOneHotEncoder
  • FastaDNABatchGenerator
  • FastaRNABatchGenerator
  • FastaProteinBatchGenerator
  • GenomicIntervalBatchGenerator
  • GenomicVariantBatchGenerator
  • ImageDataFrameBatchGenerator

Installation

APIs for models, preprocessors and utils implemented in Galaxy-ML can be installed separately.

Installing using anaconda (recommended)

conda install -c bioconda -c conda-forge Galaxy-ML

Installing using pip

pip install -U Galaxy-ML

Installing from source

python setup.py install

Using source code inplace

python install -e .

To install Galaxy-ML tools in Galaxy, please refer to https://galaxyproject.org/admin/tools/add-tool-from-toolshed-tutorial/.

Running the tests

Before running the tests, run the following commands:

conda create --name galaxy_ml python=3.9 conda activate galaxy_ml pip install -e . pip install nose nose-htmloutput pytest cd galaxy_ml

To run all tests and generate an HTML report: nosetests ./tests --with-html --html-file=./report.html

To run tests in a specific file (e.g., testkerasgalaxy.py file) and generate an HTML report nosetests ./tests/test_keras_galaxy.py --with-html --html-file=./report.html

To run a specific test in a specific file (e.g., testmultidimensionaloutput test in testkerasgalaxy.py file) and generate an HTML report ``` nosetests ./tests/testkerasgalaxy.py:testmultidimensionaloutput --with-html --html-file=./report.html ```

Examples for using Galaxy-ML custom models

```

handle imports

from keras.models import Sequential from keras.layers import Dense, Activation from sklearn.modelselection import GridSearchCV from galaxyml.kerasgalaxymodels import KerasGClassifier

build a DNN classifier

model = Sequential() model.add(Dense(64)) model.add(Activation(relu')) model.add((Dense(1, activation=sigmoid))) config = model.get_config()

classifier = KerasGClassifier(config, random_state=42)

clone a classifier

clf = clone(classifier)

Get parameters

params = clf.get_params()

Set parameters

newparams = dict( epochs=60, lr=0.01, layers1Denseconfigkernelinitializerconfigseed=999, layers0Denseconfigkernelinitializerconfigseed=999 ) clf.setparams(**new_params)

model evaluation using GridSearchCV

grid = GridSearchCV(clf, paramgrid={}, scoring=rocauc, cv=5, n_jobs=2) grid.fit(X, y) ```

Example for using Galaxy-ML to persist a sklearn/keras model

``` from galaxyml.modelpersist import (dumpmodeltoh5, loadmodelfromh5)

dump model to hdf5

dumpmodeltoh5(model, `savepath`, store_hyperparameter=True)

load model from hdf5

model = loadmodelfromh5(`pathto_hdf5) ``

Performance comparison

Galaxy-ML's HDF5 saving utils perform faster than cPickle for large, array-rich models.

``` Loading model using pickle... (1.2471628189086914 s)

Dumping model using pickle... (3.6942389011383057 s) File size: 930712861

Dumping model to hdf5... (3.006715774536133 s) File size: 930729696

Loading model from hdf5... (0.6420958042144775 s)

Pipeline(memory=None, steps=[('robustscaler', RobustScaler(copy=True, quantilerange=(25.0, 75.0), withcentering=True, withscaling=True)), ('kneighborsclassifier', KNeighborsClassifier(algorithm='auto', leafsize=30, metric='minkowski', metricparams=None, njobs=1, n_neighbors=100, p=2, weights='uniform'))], verbose=False) ```

Publication

Gu Q, Kumar A, Bray S, Creason A, Khanteymoori A, Jalili V, et al. (2021) Galaxy-ML: An accessible, reproducible, and scalable machine learning toolkit for biomedicine. PLoS Comput Biol 17(6): e1009014. https://doi.org/10.1371/journal.pcbi.1009014

Owner

  • Name: goeckslab
  • Login: goeckslab
  • Kind: organization

GitHub Events

Total
Last Year

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 609
  • Total Committers: 8
  • Avg Commits per committer: 76.125
  • Development Distribution Score (DDS): 0.082
Past Year
  • Commits: 30
  • Committers: 1
  • Avg Commits per committer: 30.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Qiang Gu g****1@g****m 559
Qiang Gu 3****u 26
Kaivan Kamali k****2@p****u 14
Anup Kumar a****z@g****m 6
Marcel Bargull m****l@u****u 1
kxk302 k****2@g****m 1
dependabot[bot] 4****] 1
Björn Grüning b****n@g****u 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 9
  • Total pull requests: 62
  • Average time to close issues: 5 days
  • Average time to close pull requests: 10 days
  • Total issue authors: 3
  • Total pull request authors: 6
  • Average comments per issue: 0.78
  • Average comments per pull request: 0.48
  • Merged pull requests: 54
  • Bot issues: 0
  • Bot pull requests: 4
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • qiagu (6)
  • kxk302 (2)
  • anuprulez (1)
Pull Request Authors
  • qiagu (50)
  • dependabot[bot] (4)
  • kxk302 (4)
  • qchiujunhao (3)
  • mbargull (1)
  • bgruening (1)
Top Labels
Issue Labels
tools / enhancement (2) tool / usage tips (1)
Pull Request Labels
dependencies (4) api / enhancement (1) tools / enhancement (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 26 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 18
  • Total maintainers: 1
pypi.org: galaxy-ml

Galaxy Machine Learning Library

  • Versions: 18
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 26 Last month
Rankings
Dependent packages count: 10.0%
Forks count: 12.5%
Average: 16.7%
Stargazers count: 17.1%
Dependent repos count: 21.7%
Downloads: 22.1%
Maintainers (1)
Last synced: 8 months ago

Dependencies

requirements.txt pypi
  • asteval >=0.9.14
  • bleach >=3.3.0
  • cython >=0.29.11
  • h5py >=3.1
  • imbalanced-learn >=0.8.0,<0.9
  • joblib >=0.13.2,<1.0
  • matplotlib >=3.1.1
  • mlxtend >=0.17,<0.18
  • numpy >=1.18.0,<1.21
  • pandas >=1.0,<1.3
  • plotly >=4.10.0,<5.0
  • pyfaidx *
  • pytabix *
  • scikit-learn >=0.24,<0.25
  • scikit-optimize >=0.9
  • scipy >=1.3.1
  • six <=1.15.0
  • skrebate >=0.60,<0.70
  • tensorflow >=2.5.0,<2.6
  • xgboost >=1.3,<1.4
.github/workflows/ci.yaml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • actions/download-artifact v2 composite
  • actions/setup-python v1 composite
  • actions/upload-artifact v2 composite
  • galaxyproject/planemo-ci-action v1 composite
  • peter-evans/create-or-update-comment v1 composite
  • postgres 11 docker
.github/workflows/pr.yaml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • actions/download-artifact v2 composite
  • actions/setup-python v1 composite
  • actions/upload-artifact v2 composite
  • galaxyproject/planemo-ci-action v1 composite
  • postgres 11 docker
.github/workflows/slash.yaml actions
  • peter-evans/slash-command-dispatch v2 composite
Dockerfile docker
  • python 3.9-slim build
setup.py pypi