https://github.com/aehrc/variantspark

machine learning for genomic variants

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
5 of 11 committers (45.5%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.8%) to scientific vocabulary

Keywords

association-studies aws bioinformatics databricks emr genome gwas notebook random-forest variant-spark variantspark vcf

Keywords from Contributors

standards interactive projection archival sequences autograding hacking shellcodes modular network-simulation

Last synced: 5 months ago · JSON representation

Repository

machine learning for genomic variants

Basic Info

Host: GitHub
Owner: aehrc
License: other
Language: JavaScript
Default Branch: master
Homepage: http://bioinformatics.csiro.au/variantspark
Size: 75.1 MB

Statistics

Stars: 146
Watchers: 18
Forks: 45
Open Issues: 33
Releases: 9

Topics

association-studies aws bioinformatics databricks emr genome gwas notebook random-forest variant-spark variantspark vcf

Created about 9 years ago · Last pushed 6 months ago

Metadata Files

Readme Contributing License

Variant Spark

variant-spark is a scalable toolkit for genome-wide association studies optimized for GWAS-like datasets.

Machine learning methods and, in particular, random forests (RFs) are promising alternatives to standard single SNP analyses in genome-wide association studies (GWAS). RFs provide variable importance measures to rank SNPs according to their predictive power. Although there are several existing random forest implementations available, some even parallel or distributed such as Random Jungle, ranger, or SparkML, most of them are not optimized to deal with GWAS datasets, which usually come with thousands of samples and millions of variables.

variant-spark currently provides the basic functionality of building a random forest model and estimating variable importance with the mean decrease gini method. The tool can operate on VCF and CSV files. Future extensions will include support for other importance measures, variable selection methods, and data formats.

variant-spark utilizes a novel approach of building random forests from data in transposed representation, which allows it to efficiently deal with even extremely wide GWAS datasets. Moreover, since the most common genomics variant calls file format, i.e. VCF, which uses the transposed representation, variant-spark can work directly with the VCF data, without the costly pre-processing required by other tools.

variant-spark is built on top of Apache Spark – a modern distributed framework for big data processing, which gives variant-spark the ability to scale horizontally on both bespoke cluster and public clouds.

The potential users include:

Medical researchers seeking to perform GWAS-like analysis on large cohort data of genome-wide sequencing data or imputed SNP array data.
Medical researchers or clinicians seeking to perform clustering on genomic profiles to stratify large-cohort genomic data.
General researchers with classification or clustering needs of datasets with millions of features.

Community

Please feel free to add issues and/or upvote issues you care about. Also, join the Gitter chat. We also started ReadTheDocs and there is always this repo's issues page for you to add requests. Thanks for your support.

Learn More

To learn more watch this video from HUGO Conference 2020.

Building

variant-spark requires java jdk 1.8+ and maven 3+

In order to build the binaries use:

mvn clean install

For Python variant-spark requires Python 3.6+ with pip. The other packages required for development are listed in dev/dev-requirements.txt and can be installed with:

pip install -r dev/dev-requirements.txt

or with:

./dev/py-setup.sh

The complete build including all checks can be run with:

./dev/build.sh

Running

variant-spark requires an existing spark 3.1+ installation (either a local one or a cluster one).

To run variant-spark use:

./variant-spark [(--spark|--local) <spark-options>* --] [<command>] <command-options>*

To obtain the list of the available commands use:

./variant-spark -h

To obtain help for a specific command (for example importance) use:

./variant-spark importance -h

You can use --spark marker before the command to pass spark-submit options to variant-spark. The list of spark options needs to be terminated with --, e.g:

./variant-spark --spark --master yarn-client --num-executors 32 -- importance ....

Please, note that --spark needs to be the first argument of variant-spark

You can also run variant-spark in the --local mode. In this mode, variant-spark will ignore any Hadoop or Spark configuration files and run in the local mode for both Hadoop and Spark. In particular, in this mode, all file paths are interpreted as local file system paths. Also, any parameters passed after --local and before -- are ignored. For example:

./bin/variant-spark --local -- importance  -if data/chr22_1000.vcf -ff data/chr22-labels.csv -fc 22_16051249 -v -rn 500 -rbs 20 -ro

Note:

The difference between running in --local mode and in --spark with local master is that in the latter case, Spark uses the Hadoop filesystem configuration and the input files need to be copied to this filesystem (e.g. HDFS) Also, the output will be written to the location determined by the Hadoop filesystem settings. In particular paths without schema e.g. 'output.csv' will be resolved with the Hadoop default filesystem (usually HDFS) To change this behavior you can set the default filesystem in the command line using spark.hadoop.fs.default.name option. For example to use local filesystem as the default use:

./bin/variant-spark --spark ... --conf "spark.hadoop.fs.default.name=file:///" ... -- importance  ... -of output.csv

You can also use the full URI with the schema to address any filesystem for both input and output files e.g.:

./bin/variant-spark --spark ... --conf "spark.hadoop.fs.default.name=file:///" ... -- importance  -if hdfs:///user/data/input.csv ... -of output.csv

Running examples

There are multiple methods for running variant-spark examples

Manual Examples

variant-spark comes with a few example scripts in the scripts directory that demonstrate how to run its commands on sample data.

There are a few small data sets in the data directory suitable for running on a single machine. For example:

./examples/command-line/local_run-importance-ch22.sh

runs variable importance command on a small sample of the chromosome 22 VCF file (from 1000 Genomes Project)

The full-size examples require a cluster environment (the scripts are configured to work with Spark on YARN).

The data required for the examples can be obtained from the data folder https://github.com/aehrc/VariantSpark/tree/master/data

This repository uses the git Large File Support extension, which needs to be installed first (see: https://git-lfs.github.com/)

Clone the variant-spark-data repository and then install the test data into your Hadoop filesystem using:

./install-data

By default, the sample data will installed into the variant-spark-data\input sub-directory of your HDFS home directory.

You can choose a different location by setting the VS_DATA_DIR environment variable.

After the test data has been successfully copied to HDFS you can run examples scripts, e.g.:

./examples/command-line/yarn_run-importance-ch22.sh

Note: if you installed the data to a non-default location the VS_DATA_DIR needs to be set accordingly when running the examples

VariantSpark on the cloud

VariantSpark can easily be used in AWS and Azure. For more examples and information, check the cloud folder. For a quick start, check the few pointers below.

AWS Marketplace

VariantSpark is now available on AWS Marketplace. Please read the Guidlines for specifications and step-by-step instructions.

Azure Databricks

VariantSpark can be easily deployed in Azure Databricks through the button below. Please read the VariantSpark Azure manual for specifications and step-by-step instructions.

Contributions

JsonRfAnalyser

JsonRfAnalyser is a Python program that looks into the JSON RandomForest model and lists variables on each tree and branch. Please read README to see the complete list of functionalities.

WebVisualiser

rfview.html is a web program (run locally on your machine) where you can upload the JSON model produced by variant-spark and it visualizes trees in the model. You can identify which tree to be visualized. Node color and node labels could be set to different parameters such as the number of samples in the node or the node impurity. It uses vis.js for tree Visualisation.

Owner

Name: The Australian e-Health Research Centre
Login: aehrc
Kind: organization

Website: https://aehrc.com
Twitter: ehealthresearch
Repositories: 101
Profile: https://github.com/aehrc

The Australian e-Health Research Centre (AEHRC) is CSIRO’s digital health research program.

GitHub Events

Total

Watch event: 5
Push event: 10
Pull request event: 2
Pull request review event: 2
Pull request review comment event: 1
Create event: 2

Last Year

Watch event: 5
Push event: 10
Pull request event: 2
Pull request review event: 2
Pull request review comment event: 1
Create event: 2

Committers

Last synced: 9 months ago

All Time

Total Commits: 519
Total Committers: 11
Avg Commits per committer: 47.182
Development Distribution Score (DDS): 0.333

Past Year

Commits: 1
Committers: 1
Avg Commits per committer: 1.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Piotr Szul	p**l@c**u	346
Lynn Langit	l**t@l**m	90
Maciej Golebiewski	m**i@c**u	25
Roc Reguant	r**t@g**m	18
Denis C. Bauer	D**r@C**u	10
Xu, Qinying (H&B, Herston)	Q**u@c**u	9
Brendan Hosking	b**g@c**u	7
Yatish0833	y**3@g**m	5
plyte	b**x@y**m	4
ArashBayatDev	3****v	3
dependabot[bot]	4****]	2

Committer Domains (Top 20 + Academic)

csiro.au: 5

Issues and Pull Requests

Last synced: 5 months ago

All Time

Total issues: 86
Total pull requests: 55
Average time to close issues: almost 3 years
Average time to close pull requests: 11 days
Total issue authors: 23
Total pull request authors: 7
Average comments per issue: 1.15
Average comments per pull request: 0.15
Merged pull requests: 37
Bot issues: 0
Bot pull requests: 11

Past Year

Issues: 0
Pull requests: 6
Average time to close issues: N/A
Average time to close pull requests: about 19 hours
Issue authors: 0
Pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.17
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

piotrszul (35)
ArashBayatDev (17)
BauerLab (6)
natwine (3)
BMJHayward (3)
surak (2)
cmorris2945 (2)
rocreguant (2)
schonbej (2)
danking (1)
vishaln79 (1)
sshanshans (1)
fifdick (1)
amnonbleich (1)
TaniaCuppens (1)

Pull Request Authors

piotrszul (23)
dependabot[bot] (11)
ChristinaXu2017 (7)
rocreguant (7)
ArashBayatDev (3)
NickEdwards7502 (2)
BMJHayward (2)

Top Labels

Issue Labels

enhancement (9) techdebt (6) help wanted (2) critical enhancement (1) bug (1)

Pull Request Labels

dependencies (13) python (10) java (5) enhancement (2)

Packages

Total packages: 3
Total downloads:
- pypi 32 last-month

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 1
(may contain duplicates)
Total versions: 46
Total maintainers: 3

pypi.org: variant-spark

VariantSpark Python API

Homepage: https://bioinformatics.csiro.au/variantspark
Documentation: http://variantspark.readthedocs.io/en/latest
License: CSIRO Non-Commercial Source Code Licence Agreement v1.0
Latest release: 0.5.2
published over 3 years ago

Versions: 36
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 32 Last month

Rankings

Forks count: 6.1%

Stargazers count: 6.2%

Dependent packages count: 7.3%

Average: 16.8%

Dependent repos count: 22.1%

Downloads: 42.1%

Maintainers (3)

bhosking piotrszul rocreguant

Last synced: 6 months ago

repo1.maven.org: au.csiro.aehrc.variant-spark:variant-spark_2.12

Genomic variants interpretation toolkit

Homepage: http://bioinformatics.csiro.au/variantspark
Documentation: https://appdoc.app/artifact/au.csiro.aehrc.variant-spark/variant-spark_2.12/
License: CSIRO Non-Commercial Source Code Licence Agreement v1.0
Latest release: 0.5.2
published over 3 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Forks count: 17.0%

Stargazers count: 18.7%

Average: 29.1%

Dependent repos count: 32.0%

Dependent packages count: 48.9%

Last synced: 6 months ago

repo1.maven.org: au.csiro.aehrc.variant-spark:variant-spark_2.11

Genomic variants interpretation toolkit

Homepage: http://bioinformatics.csiro.au/variantspark
Documentation: https://appdoc.app/artifact/au.csiro.aehrc.variant-spark/variant-spark_2.11/
License: CSIRO Non-Commercial Source Code Licence Agreement v1.0
Latest release: 0.4.0
published over 3 years ago

Versions: 7
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Forks count: 17.0%

Stargazers count: 18.7%

Average: 29.2%

Dependent repos count: 32.0%

Dependent packages count: 48.9%

Last synced: 6 months ago

Dependencies

.github/workflows/ci.yml actions

actions/cache v2 composite
actions/checkout v2 composite
actions/setup-java v1 composite
actions/setup-python v2 composite

.github/workflows/publish-release.yml actions

actions/cache v2 composite
actions/checkout v2 composite
actions/create-release v1 composite
actions/setup-java v1 composite
actions/setup-python v2 composite
s4u/maven-settings-action v2.6.0 composite

pom.xml maven

au.csiro.aehrc.third.hail-is:hail_2.12_3.1 0.2.74-SNAPSHOT provided
org.apache.spark:spark-core_2.12 3.1.2 provided
org.apache.spark:spark-mllib_2.12 3.1.2 provided
org.ow2.asm:asm 5.1 provided
org.ow2.asm:asm-analysis 5.1 provided
org.ow2.asm:asm-util 5.1 provided
org.scala-lang:scala-library 2.12.14 provided
args4j:args4j 2.0.29
com.github.samtools:htsjdk 2.21.0
com.github.tototoshi:scala-csv_2.12 1.3.8
it.unimi.dsi:dsiutils 2.3.3
it.unimi.dsi:fastutil 7.0.8
joda-time:joda-time 2.7
org.json4s:json4s-ext_2.12 3.5.3
org.scala-graph:graph-core_2.12 1.12.3
au.csiro.aehrc.third.hail-is:hail_2.12_3.1 0.2.74-SNAPSHOT test
junit:junit 4.13.1 test
org.apache.commons:commons-csv 1.9.0 test
org.easymock:easymock 3.5.1 test

cloud/aws-emr/python/setup.py pypi

Click *
PyYAML *
awscli >=1.11
jsonmerge *
pystache *

dev/dev-requirements.txt pypi

Jinja2 ==3.0.3 development
Sphinx ==3.3.1 development
hail ==0.2.74 development
nbsphinx ==0.8.0 development
numpy ==1.21.2 development
pandas ==1.1.4 development
patsy ==0.5.2 development
pylint ==2.6.0 development
pyspark ==3.1.3 development
pytest ==6.2.2 development
scipy ==1.6.3 development
seaborn ==0.11.2 development
sphinx-rtd-theme ==0.5.0 development
statsmodels ==0.13.2 development
twine ==3.2.0 development
typedecorator ==0.0.5 development

dev/rtd-requirements.txt pypi

hail ==0.2.74 development
nbsphinx * development
pandas ==1.1.4 development
typedecorator >=0.0.5 development

python/requirements.txt pypi

Jinja2 ==3.0.3
hail ==0.2.74
numpy ==1.21.2
pandas ==1.1.4
patsy ==0.5.2
pyspark ==3.1.3
scipy ==1.6.3
seaborn ==0.11.2
statsmodels ==0.13.2
typedecorator ==0.0.5

python/setup.py pypi

typedecorator ==0.0.5

https://github.com/aehrc/variantspark

Science Score: 36.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Variant Spark

Community

Learn More

Building

Running

Running examples

Manual Examples

VariantSpark on the cloud

AWS Marketplace

Azure Databricks

Contributions

JsonRfAnalyser

WebVisualiser

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: variant-spark

Rankings

Maintainers (3)

repo1.maven.org: au.csiro.aehrc.variant-spark:variant-spark_2.12

Rankings

repo1.maven.org: au.csiro.aehrc.variant-spark:variant-spark_2.11

Rankings

Dependencies