hpc-gvcw

Automated Rice Variant calling workflow for HPC, Cloud and Desktop systems.

https://github.com/ibexcluster/hpc-gvcw

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: springer.com
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.2%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Automated Rice Variant calling workflow for HPC, Cloud and Desktop systems.

Basic Info

Host: GitHub
Owner: IBEXCluster
License: gpl-3.0
Language: Shell
Default Branch: main
Homepage:
Size: 93.8 KB

Statistics

Stars: 12
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 2

Created almost 5 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

HPC-GVCW

Principal investigators (PI)

Prof. Rod A. Wing,
Director, Center for Desert Agriculture,
Professor, Biological and Environmental Science and Engineering,
4700 King Abdullah University of Science and Technology,
Thuwal 23955-6900,
Kingdom of Saudi Arabia

For Pipeline support

Authors:

Nagarajan Kathiresan {nagarajan.kathiresan@kaust.edu.sa}
Yong Zhou {yong.zhou@kaust.edu.sa}
Zhichao Yu {2023317110021@webmail.hzau.edu.cn }
Luis F. Rivera Serna {luis.riveraserna@kaust.edu.sa}
Manjula Thimma {manjula.thimma@kaust.edu.sa}
Keerthana Manickam {keerthana9811@gmail.com}
Rod A Wing {rwing@ag.arizona.edu, rod.wing@kaust.edu.sa}

Publication:

DOI: https://doi.org/10.1186/s12915-024-01820-5
PDF available here: https://link.springer.com/content/pdf/10.1186/s12915-024-01820-5.pdf.

Computational systems

About Shaheen

The system has 6,174 dual sockets compute nodes based on 16 core Intel Haswell processors running at 2.3GHz. Each node has 128GB of DDR4 memory running at 2300MHz. Overall the system has a total of 197,568 processor cores and 790TB of aggregate memory. More information is available in https://www.hpc.kaust.edu.sa/content/shaheen-ii

About Ibex cluster

Ibex is a heterogeneous group of nodes, a mix of AMD, Intel and Nvidia GPUs with different architectures that gives the users a variety of options to work on. Overall, Ibex is made up of 488+ nodes togeter has a heterogeneous cluster and the workload is managed by the SLURM scheduler. More information is available in https://www.hpc.kaust.edu.sa/ibex

Workflow for Rice Variant Calling

Required Software

The following software are used and tested for HPC-GVCW.
1. bwa 0.7.17
2. samtools 1.8
3. gatk 4.1.6.0 and
4. tabix 0.2.6

Phase #1 - Data pre-processing
The objective of this phase is to get the clean data from the collected rice genome samples. This includes, (a) Genome alignment using BWA MEM algorithm, (b) Update FixMate reads for the same set of genomes, Mark Duplicate and Read grouping using Genome Analysis ToolKit (GATK).

Phase #2 - Variant discovery
The objective of this phase is to call the variants per sample and generate gVCFs files. Two major steps are required in this variant discovery phase. First, the multiple sorted input files are merged into single BAM file and (re)sorted to the merged BAM using SAMTools. Second step is to call the SNPs and INDELs simultaneously via local denovo-assembly of haplotypes in an active region using GATK called “HaplotypeCaller”. At this end of this phase, we will generate a gVCF output of SNPs and INDELs.

Phase #3 - Callset refinement
In this phase, we will combine all gVCF files from the HaplotypeCaller and generate joint genotyping across all the samples. This phase is extremely complex because of (i) Multiple samples executed across the cluster of nodes in phase #1 and phase #2 are combined (using GATK CombineGVCFs) into a single file and then, generate multi-sample joint genotyping (using GATK GenotypeGVCFs) and (ii) the CombineGVCFs and GenotypeGVCFs steps are executed in a single core using GATK.
As we know, the GATK tool is sequential due to programming limitations and the assembling of genotype across multiple samples into a single file takes extremely longer time and required huge memory when the data parallelization is absent. To address these limitations, the latest version of GATK offers variant intervals feature in CombineGVCFs and GenotypeGVCFs calls for data parallelization.
Phase #4 - Variant tables
In this phase, the quality of genotype is enriched through variant filters and it’s also separated based on SNPs and INDELs from these independent chunks of GenotypeGVCFs files. Once all the chunks of filtered SNPs and INDELs are generated, all these partial chunks can be combined into a single file using GatherVcfs and its recommended to assemble per chromosome. The chromosome-based SNPs and INDELs are converted into variant table.

Summary of workflow steps across multiple phases

The below table summarizes various bioinformatics tools used in different stages of the workflow. Additionally, we provided the optimal number of CPUs used, data parallelization methods and input/output file formats are summarized.

Owner

Login: IBEXCluster
Kind: user
Company: KAUST

Website: https://www.hpc.kaust.edu.sa/ibex
Repositories: 16
Profile: https://github.com/IBEXCluster

Nagarajan Kathiresan, Computational Scientist, KAUST Supercomputing Core Lab, KAUST, KSA

Citation (CITATION.cff)

Nagarajan Kathiresan; Yong Zhou; Zhichao Yu; Luis F. Rivera; Manjula Thimma; Keerthana Manickam; Rod A. Wing

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science