https://github.com/aehrc/variantspark_gigascience
Scripts for the GigaScience publication
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary
Repository
Scripts for the GigaScience publication
Basic Info
- Host: GitHub
- Owner: aehrc
- License: other
- Language: HTML
- Default Branch: master
- Size: 47.1 MB
Statistics
- Stars: 1
- Watchers: 8
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
VariantSpark_Gigascience
Createing an EMR clustr
we use CreateCluster.sh bash scrtip to create an EMR cluster using aws-cli. Te generated cluster has the following characteristics. You are responsible to terminate your cluster when once you finish your process. We have not created cloudformation template for this cluster yet.
- Use Spot pricing
- Use Uniform Instance
- Without EC2 keypair
- Master EC2 instance type: r4.2xlarge 8 vCPUs 61GB RAM
- Core EC2 instance type: r4.4xlarge 16 vCPUs 122GB RAM
- VariantSpark installed (through Bootstrap)
If you would like to change the above configuration (i.e. if you want to use OnDemand pricing or SpotFleet), you may create this cluster and then clone it in the aws console and change the parameter there.
Parameters: You should modify the script and manually change these parameters:
- ClusterName: name of the cluster (i.e. C64 or C512)
- InstanceCount: Number of core instances (i.e. 4 or 32)
- LogURI: Path to S3 folder (or bucket) to store EMR logs
Important
You should install and configure awscli v2 on your machine (https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html). The above bash script uses aws2 (awscli v2) to create the EMR cluster. The aws v1 command throw an error on BidPrice.
VSdata: Syntetic genotype and phenotype generated by VariantSpark simulation module
We have used SimulateData.sh to create synthetic genotypes and phenotype using VariantSpark gen-features and gen-labels commands.
We have consider diffeerent number of samples and SNPs in the dataset.
- Samples: 1,000 - 10,000 - 100,000 (1K, 10K, 100K)
- SNPs: 100,000 - 1,000,000 - 10,000,000 - 100,000,000 (100K, 1M, 10M, 100M)
We ignore the dataset with 100K samples and 100M snps as its size were exteremly large.
Unfortunately gen-features and gen-labels does not work in the latest version of VariantSpark. Follow the instruction here to create an EC2 instance with proper version of VariantSpark installed use r4.16xlarge instance and allocage 1000GB EBS volume. Then run this script on that EC2 instance.
Parameters: You should modify the script and manually change this parameter:
-S3: Path to S3 folder (or bucket) to store results.
Genotype data are generated using VariantSpark gen-features command in parquet format. Randomly generated from a uniform distribution with equal probabilities (0, 1 and 2 represent 0/0, 0/1 and 1/1 genotypes respectively).
- SNP id: v0 ... vn
- Sample id: s0 ... sm
VariantSpark importance analysis (RandomForest Training) is slow when using parquet format (due to parallelisation problem). We have converted parquet files to csv.bz2 files (comma separated and bzip2 compressed). See description below for details.
Phenotype data are generated using VariantSpark gen-lable command in csv format. The phenotype is simulated based on 5 rendomly selected SNPs (truth SNPs) with equal weight (=1) and a noise variable with mean=0.5 (-gm) and standard deviation=0.5 (-gs)
The phenotype file columns are:
- "":Sample Name
- "label": Binary phenotype (0 or 1)
- "pheno": continues phenotype
- The next 5 columns include genotypes of the truth SNPs for all samples where the column name is the SNP id
The script also store all simulation logs to the S3 path.
VSdata: convert parquet file to csv.bz2
Here we explain how to convert synthetic genotype data (generated by VariantSpark gen-features) from parquet to csv.bz2 files.
The reason for this conversion is the problem in parallelisation of parquet data which does not allow to utilize all the computational resources of the cluster.
Genotype Conversion:
To convert parquet file to csv file, we first use parquet-tools to extract the raw data.
sh
git clone https://github.com/apache/parquet-mr
cd parquet-mr
mvn package -pl parquet-tools -am –Plocal
This requires the thrift compiler (installation instructions in the parquet-mr readme)
Then we use a c++ program, convert.cpp , to form the final csv and then compress csv files to csv.bz2 file with lbzip2, which can be installed with the package manager.
Compile convert.cpp with
sh
g++ -O3 –o convert convert.cpp
A bash script to handle the actual conversion process can be found at convert.sh. Some variables (namely BUCKETPREFIX, BASENAME, PARQUETDIR and PARQUETTOOLS) may need to be changed, depending on the parquet directory structure and parquet-tools version. The environment must have read/write permissions to the S3 location containing the parquet data.
The script can be used like ./convert.sh samplenum rownum and requires at least 18 threads to run efficiently (multiple instances can be run in parallel safely). It takes around 7 hours to completely process a 10,000 x 100,000,000 parquet directory, with performance being approximately linear in the total number of cells. The output is placed in the same location as the target parquet directory, with “.parquet” replaced with “.csv.bz2”
Note that the datasets produced on 20/1/20 use “s0”, “s1” etc instead of “s0”, “s1”. So to use these with their phenotype files, follow the instructions below:
In the phenotype files (.pheno.csv) samples are encoded as “s0”, “s1”, … However, in the csv.bz2 genotype files samples are encoded as “s0”, “s1”, … Since genotype files are giant and processing them takes a lot of time. We remove “_” from sample ids in phenotype files and rename them to (.pheno.no_.csv)
The script to fix phenotype file: fixPheno.sh
VSdata: Process data with VariantSpark
ProcessData.sh run variant spark with different parameters on different dataset generated above. It consist of 7 sets of experiments E0 to E6
This scipt add jobs as steps to the existing cluster. See above to learn how to create a cluster.
Parameters: You should modify the script and manually change these parameters:
- S3: Path to S3 folder (or Bucket) where dataset is stored (leave to to as it is if you wish to process dataset we have generated)
- S3R: Path to S3 folder (or Bucket) to store Results
- clusterID: the cluster id ("j-XXXXXXXXXXXXX") you get when you generate the cluster. You can also find cluster ID in aws EMR console
Default parameters and dataset:
- numTree=1000 (number of trees)
- mtryFraction="0.1" (mtry as fraction of number of SNPs)
- maxDepth=15 (max depth of tree)
- minSample=50 (min number of samples in node to be divided)
- batchSize=100 (number of tree to grow in parallel)
- numVariant=1,000,000 (dataset number of SNPs)
- numSample=10,000 (dataset number of samples)
Experiments:
- E0: run variantSpark with unlimited maxDepth and minSample=0 (split nodes until they get pure)
- E1: Vary maxDepth [3 5 7 9 11 13 15 20 25 100] - (minSample=0)
- E2: Vary minSample [5 10 50 100 500 1000] - (unlimited maxDepth)
- E3: Vary number of Trees [100 200 400 800 1600]
- E4: Vary batchSize [10 50 100 500 1000]
- E5: Vary mtry [10000 5000 1000 500 100 50 10] note that mtry=10,000 is equivalent to mtryFraction=0.1 for this dataset with 1000,000 SNPs and mtry=1000 is the default mtry (sqrt(1,000,000))
- E6: Vary dataset [all generated dataset]. For the 2 largest generated datasets (100K sample 10M SNPs) and (10K samples 100M SNPs) we set the number of tree to 100 due to excesive computational time.
Get RandomForest Statistics
GetRFstat.py takes the RF model in JSON format (generated by VariantSpark) and print some statistics about the model including:
- Average depth of a tree. (deepest branch of a tree)
- Average depth of a branch (all branches of all trees)
- Average number of nodes per tree (excluding leaf node)
- Average number of leaf node per tree
VSdata: Aggregate results
Owner
- Name: The Australian e-Health Research Centre
- Login: aehrc
- Kind: organization
- Website: https://aehrc.com
- Twitter: ehealthresearch
- Repositories: 101
- Profile: https://github.com/aehrc
The Australian e-Health Research Centre (AEHRC) is CSIRO’s digital health research program.
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0