https://github.com/cbg-ethz/pybda

:computer::computer::computer: A commandline tool for analysis of big biological data sets for distributed HPC clusters.

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
3 of 4 committers (75.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.0%) to scientific vocabulary

Keywords

apache-spark big-data machine-learning python snakemake

Last synced: 5 months ago · JSON representation

Repository

:computer::computer::computer: A commandline tool for analysis of big biological data sets for distributed HPC clusters.

Basic Info

Host: GitHub
Owner: cbg-ethz
License: gpl-3.0
Language: Python
Default Branch: master
Homepage: https://pybda.rtfd.io
Size: 362 MB

Statistics

Stars: 9
Watchers: 2
Forks: 3
Open Issues: 7
Releases: 4

Topics

apache-spark big-data machine-learning python snakemake

Created over 7 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Code of conduct

PyBDA

A commandline tool for analysis of big biological data sets for distributed HPC clusters.

About

PyBDA is a Python library and command line tool for big data analytics and machine learning scaling to big, high-dimensional data sets.

In order to make PyBDA scale to big data sets, we use Apache Spark's DataFrame API which, if developed against, automatically distributes data to the nodes of a high-performance cluster and does the computation of expensive machine learning tasks in parallel. For scheduling, PyBDA uses Snakemake to automatically execute pipelines of jobs. In particular, PyBDA will first build a DAG of methods/jobs you want to execute in succession (e.g. dimensionality reduction into clustering) and then compute every method by traversing the DAG. In the case of a successful computation of a job, PyBDA will write results and plots, and create statistics. If one of the jobs fails PyBDA will report where and which method failed (owing to Snakemake's scheduling) such that the same pipeline can effortlessly be continued from where it failed the last time.

For instance, if you want to first reduce your data set into a lower dimensional space, cluster it using several cluster centers, and fit a random forest you would first specify a config file similar to this:

```bash $ cat data/pybda-usecase.config

spark: spark-submit infile: data/singlecellimagingdata.tsv predict: data/singlecellimagingdata.tsv outfolder: data/results meta: data/metacolumns.tsv features: data/featurecolumns.tsv dimensionreduction: pca ncomponents: 5 clustering: kmeans ncenters: 50, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200 regression: forest family: binomial response: isinfected sparkparams: - "--driver-memory=3G" - "--executor-memory=6G" debug: true ```

Executing PyBDA, and calling the methods above, is then as easy as this:

bash $ pybda run data/pybda-usecase.config local

Installation

I recommend installing PyBDA from Bioconda:

bash $ conda install -c bioconda pybda

You can however also directly install using PyPI:

bash $ pip install pybda

Otherwise you could download the latest release and install that.

Documentation

Check out the documentation here. The documentation will walk you through

the installation process,
setting up Apache Spark,
using pybda.

Author

Simon Dirmeier simon.dirmeier@bsse.ethz.ch

Owner

Name: Computational Biology Group (CBG)
Login: cbg-ethz
Kind: organization
Location: Basel, Switzerland

Website: https://www.bsse.ethz.ch/cbg
Twitter: cbg_ethz
Repositories: 91
Profile: https://github.com/cbg-ethz

Beerenwinkel Lab at ETH Zurich

GitHub Events

Total

Delete event: 1
Issue comment event: 1
Pull request event: 2
Create event: 1

Last Year

Delete event: 1
Issue comment event: 1
Pull request event: 2
Create event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 985
Total Committers: 4
Avg Commits per committer: 246.25
Development Distribution Score (DDS): 0.21

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
dirmeier	s**r@b**h	778
Simon Dirmeier	s**r@w**e	205
Simon Dirmeier	s**i@l**h	1
Simon Dirmeier	s**i@l**h	1

Committer Domains (Top 20 + Academic)

lo-login-01.leonhard.ethz.ch: 1 lo-login-02.leonhard.ethz.ch: 1 bsse.ethz.ch: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 18
Total pull requests: 2
Average time to close issues: 11 days
Average time to close pull requests: almost 2 years
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.17
Average comments per pull request: 0.5
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 2

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 1

View more stats

Top Authors

Issue Authors

dirmeier (18)

Pull Request Authors

dependabot[bot] (4)

Top Labels

Issue Labels

enhancement (7) bug (1)

Pull Request Labels

dependencies (4)

Packages

Total packages: 1
Total downloads:
- pypi 26 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 6
Total maintainers: 1

pypi.org: pybda

Analysis of big biological data sets for distributed HPC clusters.

Homepage: https://github.com/cbg-ethz/pybda
Documentation: https://pybda.readthedocs.io/
License: GPLv3
Latest release: 0.1.0
published over 6 years ago

Versions: 6
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 26 Last month

Rankings

Dependent packages count: 10.1%

Forks count: 15.3%

Stargazers count: 17.7%

Dependent repos count: 21.6%

Average: 26.5%

Downloads: 68.0%

Maintainers (1)

dirmeier

Last synced: 6 months ago

Dependencies

docs/requirements.txt pypi

nbsphinx *
sphinx *
sphinx_fontawesome *
sphinxcontrib-fulltoc *

setup.py pypi

click >=6.7
joypy >=0.1.9
matplotlib >=2.2.3
numpy >=1.15.0
pandas >=0.23.3
pyspark ==2.4.0
scipy >=1.0.0
seaborn >=0.9.0
snakemake >=5.7.1
sparkhpc >=0.3.post4

https://github.com/cbg-ethz/pybda

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

PyBDA

About

Installation

Documentation

Author

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: pybda

Rankings

Maintainers (1)

Dependencies