https://github.com/aehrc/variantspark_gigascience

Scripts for the GigaScience publication

Last synced: 10 months ago · JSON representation

Repository

Scripts for the GigaScience publication

Basic Info

Host: GitHub
Owner: aehrc
License: other
Language: HTML
Default Branch: master
Size: 47.1 MB

Statistics

Stars: 1
Watchers: 8
Forks: 0
Open Issues: 0
Releases: 0

Created over 6 years ago · Last pushed about 6 years ago

Metadata Files

Readme License

VariantSpark_Gigascience

Createing an EMR clustr

we use CreateCluster.sh bash scrtip to create an EMR cluster using aws-cli. Te generated cluster has the following characteristics. You are responsible to terminate your cluster when once you finish your process. We have not created cloudformation template for this cluster yet.

Use Spot pricing
Use Uniform Instance
Without EC2 keypair
Master EC2 instance type: r4.2xlarge 8 vCPUs 61GB RAM
Core EC2 instance type: r4.4xlarge 16 vCPUs 122GB RAM
VariantSpark installed (through Bootstrap)

If you would like to change the above configuration (i.e. if you want to use OnDemand pricing or SpotFleet), you may create this cluster and then clone it in the aws console and change the parameter there.

Parameters: You should modify the script and manually change these parameters:

ClusterName: name of the cluster (i.e. C64 or C512)
InstanceCount: Number of core instances (i.e. 4 or 32)
LogURI: Path to S3 folder (or bucket) to store EMR logs

Important

You should install and configure awscli v2 on your machine (https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html). The above bash script uses aws2 (awscli v2) to create the EMR cluster. The aws v1 command throw an error on BidPrice.

VSdata: Syntetic genotype and phenotype generated by VariantSpark simulation module

We have used SimulateData.sh to create synthetic genotypes and phenotype using VariantSpark gen-features and gen-labels commands.

We have consider diffeerent number of samples and SNPs in the dataset.

Samples: 1,000 - 10,000 - 100,000 (1K, 10K, 100K)
SNPs: 100,000 - 1,000,000 - 10,000,000 - 100,000,000 (100K, 1M, 10M, 100M)

We ignore the dataset with 100K samples and 100M snps as its size were exteremly large.

Unfortunately gen-features and gen-labels does not work in the latest version of VariantSpark. Follow the instruction here to create an EC2 instance with proper version of VariantSpark installed use r4.16xlarge instance and allocage 1000GB EBS volume. Then run this script on that EC2 instance.

Parameters: You should modify the script and manually change this parameter:

-S3: Path to S3 folder (or bucket) to store results.

Genotype data are generated using VariantSpark gen-features command in parquet format. Randomly generated from a uniform distribution with equal probabilities (0, 1 and 2 represent 0/0, 0/1 and 1/1 genotypes respectively).

SNP id: v0 ... vn
Sample id: s0 ... sm

VariantSpark importance analysis (RandomForest Training) is slow when using parquet format (due to parallelisation problem). We have converted parquet files to csv.bz2 files (comma separated and bzip2 compressed). See description below for details.

Phenotype data are generated using VariantSpark gen-lable command in csv format. The phenotype is simulated based on 5 rendomly selected SNPs (truth SNPs) with equal weight (=1) and a noise variable with mean=0.5 (-gm) and standard deviation=0.5 (-gs)

The phenotype file columns are:

"":Sample Name
"label": Binary phenotype (0 or 1)
"pheno": continues phenotype
The next 5 columns include genotypes of the truth SNPs for all samples where the column name is the SNP id

The script also store all simulation logs to the S3 path.

VSdata: convert parquet file to csv.bz2

Here we explain how to convert synthetic genotype data (generated by VariantSpark gen-features) from parquet to csv.bz2 files.

The reason for this conversion is the problem in parallelisation of parquet data which does not allow to utilize all the computational resources of the cluster.

Genotype Conversion:

To convert parquet file to csv file, we first use parquet-tools to extract the raw data.

sh git clone https://github.com/apache/parquet-mr cd parquet-mr mvn package -pl parquet-tools -am –Plocal

This requires the thrift compiler (installation instructions in the parquet-mr readme)

Then we use a c++ program, convert.cpp , to form the final csv and then compress csv files to csv.bz2 file with lbzip2, which can be installed with the package manager.

Compile convert.cpp with

sh g++ -O3 –o convert convert.cpp

A bash script to handle the actual conversion process can be found at convert.sh. Some variables (namely BUCKETPREFIX, BASENAME, PARQUETDIR and PARQUETTOOLS) may need to be changed, depending on the parquet directory structure and parquet-tools version. The environment must have read/write permissions to the S3 location containing the parquet data.

The script can be used like ./convert.sh samplenum rownum and requires at least 18 threads to run efficiently (multiple instances can be run in parallel safely). It takes around 7 hours to completely process a 10,000 x 100,000,000 parquet directory, with performance being approximately linear in the total number of cells. The output is placed in the same location as the target parquet directory, with “.parquet” replaced with “.csv.bz2”

Note that the datasets produced on 20/1/20 use “s0”, “s1” etc instead of “s0”, “s1”. So to use these with their phenotype files, follow the instructions below:

In the phenotype files (.pheno.csv) samples are encoded as “s0”, “s1”, … However, in the csv.bz2 genotype files samples are encoded as “s0”, “s1”, … Since genotype files are giant and processing them takes a lot of time. We remove “_” from sample ids in phenotype files and rename them to (.pheno.no_.csv)

The script to fix phenotype file: fixPheno.sh

VSdata: Process data with VariantSpark

ProcessData.sh run variant spark with different parameters on different dataset generated above. It consist of 7 sets of experiments E0 to E6

This scipt add jobs as steps to the existing cluster. See above to learn how to create a cluster.

Parameters: You should modify the script and manually change these parameters:

S3: Path to S3 folder (or Bucket) where dataset is stored (leave to to as it is if you wish to process dataset we have generated)
S3R: Path to S3 folder (or Bucket) to store Results
clusterID: the cluster id ("j-XXXXXXXXXXXXX") you get when you generate the cluster. You can also find cluster ID in aws EMR console

Default parameters and dataset:

numTree=1000 (number of trees)
mtryFraction="0.1" (mtry as fraction of number of SNPs)
maxDepth=15 (max depth of tree)
minSample=50 (min number of samples in node to be divided)
batchSize=100 (number of tree to grow in parallel)
numVariant=1,000,000 (dataset number of SNPs)
numSample=10,000 (dataset number of samples)

Experiments:

E0: run variantSpark with unlimited maxDepth and minSample=0 (split nodes until they get pure)
E1: Vary maxDepth [3 5 7 9 11 13 15 20 25 100] - (minSample=0)
E2: Vary minSample [5 10 50 100 500 1000] - (unlimited maxDepth)
E3: Vary number of Trees [100 200 400 800 1600]
E4: Vary batchSize [10 50 100 500 1000]
E5: Vary mtry [10000 5000 1000 500 100 50 10] note that mtry=10,000 is equivalent to mtryFraction=0.1 for this dataset with 1000,000 SNPs and mtry=1000 is the default mtry (sqrt(1,000,000))
E6: Vary dataset [all generated dataset]. For the 2 largest generated datasets (100K sample 10M SNPs) and (10K samples 100M SNPs) we set the number of tree to 100 due to excesive computational time.

Get RandomForest Statistics

GetRFstat.py takes the RF model in JSON format (generated by VariantSpark) and print some statistics about the model including:

Average depth of a tree. (deepest branch of a tree)
Average depth of a branch (all branches of all trees)
Average number of nodes per tree (excluding leaf node)
Average number of leaf node per tree

VSdata: Aggregate results

Owner

Name: The Australian e-Health Research Centre
Login: aehrc
Kind: organization

Website: https://aehrc.com
Twitter: ehealthresearch
Repositories: 101
Profile: https://github.com/aehrc

The Australian e-Health Research Centre (AEHRC) is CSIRO’s digital health research program.

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/aehrc/variantspark_gigascience

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

VariantSpark_Gigascience

Createing an EMR clustr

VSdata: Syntetic genotype and phenotype generated by VariantSpark simulation module

VSdata: convert parquet file to csv.bz2

VSdata: Process data with VariantSpark

Get RandomForest Statistics

VSdata: Aggregate results

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels