germline-shortv

https://github.com/sydney-informatics-hub/germline-shortv

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: Sydney-Informatics-Hub
License: gpl-3.0
Language: Shell
Default Branch: master
Size: 220 KB

Statistics

Stars: 2
Watchers: 4
Forks: 3
Open Issues: 5
Releases: 1

Created almost 6 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

Germline-ShortV

⚠️ Notice of workflow deprecation

This workflow is no longer actively supported or maintained. While you are welcome to use the existing code, please note that no further updates, bug fixes, or support will be provided.

For questions or alternative recommendations for University of Sydney staff and students, please get in touch with sih_info@sydney.edu.au. You can find alternatives at WorkflowHub

Thank you for your understanding!

Description

This pipeline is an implementation of the BROAD's Best Practice Workflow for Germline short variant discovery (SNPS + Indels). This implementation is optimised for the National Compute Infrastucture Gadi HPC, utilising scatter-gather parallelism and the nci.parallel utility to enable use of multiple nodes with high CPU or memory efficiency. Scatter-gather parallelism also enables checkpointing and semi-automated re-running of failed tasks.

This workflow requires sample BAM files, which can be generated using the Fastq-to-BAM pipeline. Optimisations for scalability and parallelization have been performed on the human reference genome GRCh38/hg38 + ALT contigs. Germline-ShortV can be applied to other model and non-model organisms (including non-diploid organisms), with some modifications as described below.

The primary steps to this pipeline are:

HaplotypeCaller
GenomicsDBImport
GenotypeGVCFs
Variant Quality Score Recalibration

Most jobs follow a typical pattern which is:

Creating an inputs file using <job>_make_input.sh <path/to/cohort.config>
Adjusting compute resources and submitting your job by qsub <job>_run_parallel.pbs. Benchmarking metrics are available on this page as a guide for compute resources required for your dataset. This runs <job>.sh in parallel for the inputs file created (1 line = 1 task). Default parameters are typically used otherwise you can modify command specific parameters in <job>.sh.
Performing a check using <job>_check.sh on the job by checking for expected output, and/or, checking for error messages in log files. Inputs will be written for failed tasks for job re-submission with <job>_missing_run_parallel.pbs

Human datasets: GRCh38/hg38 + ALT contigs reference

This pipeline has been optimised for BAMs mapped to the GRCh38/hg38 + ALT contigs reference genome. Scatter-gather parallelism has been designed to operate over 3,200 evenly sized genomic intervals (~1Mb in size) across your sample cohort. For each of the primary steps, this means:

HaplotypeCaller:
- The number of scatter tasks = N samples x 3,200 genomic intervals
- 3,200 VCFs are gathered per sample to create a sample.g.vcf.gz
GenomicsDBImport:
- The number of scatter tasks = 3,200 genomic intervals
- Each output is used as input for GenotypeGVCFs
GenotypeGVCFs:
- The number of scatter tasks = 3,200 genomic intervals
- 3,200 VCFs are gathered into a single, co-ordinate sorted cohort.sorted.vcf.gz
Variant Quality Score Recalibration
- Is relatively quick and scattering is not required.
- Inputs are cohort.g.vcf.gz and the final output is written to cohort.recalibrated.vcf.gz

The 3,200 genomic intervals have been ordered from longest to shortest task duration for job maximum efficiency. Some genomic intervals are excluded - these typically include repetitive regions which can significantly impede on compute performance.

Excluded sites

Excluded sites are listed in the Delly group's svrepeattelomere_centromere.bed file. This is included in the References dataset. The BED file contains:

telemeres
centromeres
chrUn (unplaced)
chrUnXXXXXdecoy (decoy)
chrNXXXXXrandom (unlocalized)
chrEBV

Cancer studies

A <cohort>.config file containing both tumour and normal samples can be used to call germline variants on the normal samples only. The make input files will ignore writing inputs for tumour samples. Tumour samples are specified in LabSampleID column the <cohort>.config file if they end in:

-T.* (e.g. Sample1-T, Sample1-T1, Sample1-T100). This is used to indicate different tumour samples belonging to Sample1.
-P.* (e.g. Sample2-P, Sample2-P1, Sample2-P100). This can be used to specify primary tumour samples belonging to Sample2.
-M.* (e.g. Sample3-M, Sample3-M1, Sample3-MCL1). This can be used to specify metastatic tumour samples belonging to Sample3.

A normal sample with LabSampleID as Sample1, Sample1-B, Sample1-N, will be considered "normal" and be included in this germline variant calling pipeline.

Workflow Diagram

User guide

The following will perform germline short variant calling for samples present in <cohort>.config. The scripts use relative paths and the Germline-ShortV is your working directory. Adjust compute resources requested in the .pbs files using the guide provided in each of the PBS job scripts.

This guide is intended for use of the scripts in this repository. For information on GATK's Best Practice Workflow for Germline short variant discovery please see their website.

At minimum, you will need a <cohort>.config and your current working directory should contain:

Final_bams directory with your <sample>.final.bam and <labsampleid>.final.bai files within.
References directory from Fastq-to-BAM

See set up for more details or if you are not using the GRCh38/hg38 + ALT contigs reference genome.

Clone this repository: git clone https://github.com/Sydney-Informatics-Hub/Germline-ShortV.git cd Germline-ShortV

HaplotypeCaller

Run HaplotypeCaller by creating task inputs, adjusting compute resources and submitting the PBS script: sh gatk4_hc_make_input.sh /path/to/cohort.config qsub gatk4_hc_run_parallel.pbs
Check HaplotypeCaller job. The script checks that all sample interval .vcf and .vcf.idx files exist, and for any error files in Logs/GATK4_HC_error_capture. Any failed tasks will be written to Inputs/gatk4_hc_missing.inputs. If there are failed tasks, investigate cause of errors using sample interval log files in Logs/GATK4_HC ``` sh gatk4hccheck.sh /path/to/cohort.config

Only run the job below if there were tasks that failed

qsub gatk4hcmissingrunparallel.pbs 3. Merge HaplotypeCaller per interval VCFs into single sample-level GVCFs creating by creating task inputs, adjusting compute resources and submitting the PBS script: sh gatk4gathervcfsmakeinput.sh /path/to/cohort.config qsub gatk4gathervcfsrunparallel.pbs 4. Check GatherVcfs job. The script checks that all `<sample>.g.vcf.gz` and `<sample>.g.vcf.gz.tbi` files exist, and for any error files in `Logs/GATK4_GatherVCFs_error_capture`. Any failed tasks will be written to `Inputs/gatk4_gathervcfs_missing.inputs`. If there are failed tasks, investigate cause of errors using sample interval log files in `Logs/GATK4_GatherVCFs` sh gatk4gathervcfscheck.sh

Only run the job below if there were tasks that failed

qsub gatk4gathervcfsmissingrunparallel.pbs ``` Sample GVCFs can be used again if you wish perform multi-sample calling with a bigger cohort (e.g. when you sequence new samples), so we recommend backing these up.

GenomicsDBImport

Consolidate interval VCFs using GATK’s GenomicsDBImport by creating task inputs, adjusting compute resources and submitting the PBS script: sh gatk4_genomicsdbimport_make_input.sh /path/to/cohort.config qsub gatk4_genomicsdbimport_run_parallel.pbs`
Check interval GenomicsDBImport databases are present, check log files for errors for each interval present in the scatter list file. Report interval duration and Runtime.totalMemory() to Logs/GATK4_GenomicsDBImport/GATK_duration_memory.txt. ``` sh gatk4genomicsdbimportcheck.sh /path/to/cohort.config

Only run the job below if there were tasks that failed

qsub gatk4genomicsdbimportmissingrunparallel.pbs ``__Tip__ - if you have tasks that failed, it is likely that it needs more memory. The memory and task duration to process each interval is variable to the number and coverage of your samples. Thegatk4genomicsdbimportcheck.shscript will output duration and memory used per task from step 1 inLogs/GATK4GenomicsDBImport/GATKdurationmemory.txtwhich is handy for benchmarking. Compute resources in thegatk4genomicsdbimportmissing.shtask script (run in parallel bygatk4genomicsdbimportmissingrun_parallel.pbs) by default allocates more memory per task, but you may want to make further adjustments to the--java-options -Xmx58g` flag.

GenotypeGVCFs

Perform joint calling using GATK's GenotypeGVCFs by creating task inputs, adjusting compute resources and submitting the PBS script: sh gatk4_genotypegvcfs_make_input.sh /path/to/cohort.config qsub gatk4_genotypegvcfs_run_parallel.pbs
Check all GenotypeGVCFs scatter outputs <cohort>.<interval>.vcf.gz and <cohort>.<interval>.vcf.gz.tbi exists and are not empty. Report interval duration and Runtime.totalMemory() to Logs/GATK4_GenotypeGVCFs/GATK_duration_memory.txt. ``` sh gatk4genotypegvcfscheck.sh /path/to/cohort.config

Only run the job below if there were tasks that failed

qsub gatk4genotypegvcfsmissingrunparallel.pbs ``3. Gather joint-genotyped interval VCFs into a multisample GVCF. By default the jobfs allocation for this job is set to 8GB. GATK uses jobfs as TMPDIR during the gather and sort steps. Users may need to increase the jobfs allocation, depending on their specific dataset. To do this edit the#PBS -l jobfs=` variable at the top of the script. See the NCI Gadi queue limits guide for queue-specific limits.

```

Change the config file name:

cohort=/path/to/cohort.conifg

Adjust the resource requests, then submit:

qsub gatk4gathersort_vcfs.pbs ```

Variant Quality Score Recalibration

The gatk4_vqsr.pbs script runs a series of single core commands that performs the workflow described in GATK's documentation - 1. VQSR: filter a cohort callset with VariantRecalibrator & ApplyVQSR . This includes:

Filtering excessive heterozygous sites. This is only recommended for large cohorts, please read GATK's recommendations for more detail and only use this if it applies to your cohort
Create variant-sites only VCF
Perform VariantRecalibrator for indels and SNPs
Perform ApplyVQSR for indels and SNPs to get final, indexed cohort.final.recalibrated.vcf.gz
Perform CollectVariantCallingMetrics on the final VCFs to get metrics in cohort.final.recalibrated.metrics.variant_calling_detail_metrics

The VQSR script requires R version 3.6.1 with the following packages: ggplot2, gplots, reshape, and gsalib. These are available on NCI Gadi with module load R/3.6.1. GATK's VariantRecalibrator tool is run separately for both SNPs and Indels, it generates an rscript file to aid users in visualising their input data and learned model. See the GATK VariantRecalibrator documentation for more details.

Run these steps editing gatk4_vqsr by: ``` Change cohort=/path/to/cohort.conifg

Adjusting memory, more memory is required for larger cohorts (more variants)

qsub gatk4_vqsr.pbs``

Back up cohort, genotyped, recalibrated GVCFs and varaint calling metrics.

Set up

Tools

The scripts have been tested and are compatible with:

gatk/4.1.2.0
gatk/4.1.8.1
samtools/1.10
R/3.6.1

GATK periodically likes to change flag names and how they are called, please keep this in mind if using a different version of GATK :).

Required inputs

Please ensure that you have the required inputs below. If you have used the Fastq-to-BAM pipeline, you should the inputs set up correctly but you should check your Reference directory for scatter-gathering operations.

Change to the working directory where your final bams were created. The required inputs are:
<cohort>.config file. This is a tab-delimited file including #SampleID LabSampleID SeqCentre Library(default=1) (the same config or a subset of samples from the config used in Fastq-to-BAM). Sample GVCFs and multi-sample VCFs will be created for samples included in <cohort>.config. Output files and directories will be named using the config prefix <cohort>.
- Disclaimer LabSampleID's ending in -T, -P or -M see cancer studies will be ignored by default.
Final_bams directory, containing <labsampleid>.final.bam and <labsampleid>.final.bai files. should match LabSampleID column in your <cohort>.config file. The output of Fastq-to-BAM will be structured this way, otherwise you can easily create this structure by creating symbolic links to your sample BAM and BAI files in Final_bams
A Reference directory, containing an indexed copy of the reference genome you aligned your reads to, and scatter-gather intervals for scattering tasks. This pipeline has been pre-set up and optimised for the GRCh38/hg38 + ALT contigs reference genome (required references are available through Fastq-to-BAM. Please follow the steps below if you have aligned your reads to a different reference genome

### For non-human organisms or for BAMs not aligned to GRCh38/hg38 + ALT contigs

Create a list of intervals for scattering tasks by: * Changing to your Reference directory, containing the reference genome you wish to use * Load the version of GATK 4 that you wish to use, e.g. module load gatk/4.1.8.1 * Run gatk SplitIntervals -R <reference.fa> --scatter-count <number_of_scatter_intervals> -XL <exclude_intervals.bed> -O ShortV_intervals * -XL <exclude_intervals.bed> is optional - it allows exclusion of intervals that can impede on compute efficiency and performance (e.g. centromeres, telomeres, unplaced and unlocalised contigs, etc). * It is recommended to set such that the reference genome size/ = 1000000 (as benchmarking metrics are based of 1Mb sized intervals). I do not recommend going any smaller than that as the tasks become too short in duration and can cause extra overhead. * Run find ShortV_intervals/ -name "*interval_list" -printf "%f\n" > ShortV_intervals/interval.list to create a single text file containing the interval_list files created with gatk SplitIntervals * The order of the intervals in this file are used to order tasks in the job * To optimise the pipeline for your reference genome/species of interest, I would recommend running the pipeline to HaplotypeCaller on a small dataset, and ordering this list from longest to shortest duration, before running this on the full dataset. You can get task duration for a sample using perl gatk4_duration_mem.pl Logs/GATK4_HC/<labsampleid>

Directory structure

Your current working directory should resemble the following at minimum:

bash |-- cohort.config |-- Final_bams (containing indexed sample BAM files) |-- Reference (containing an indexed reference genome to which the BAMs were created with, scatter intervals and optionally known variants for VQSR) |-- Germline-ShortV

You can now refer to the main user guide.

Benchmarking metrics

Summary - 6 human samples

The benchmarking results below were obtained from a human dataset (6 Platinum Genomes) with an average coverage of 92X.

| | CPUsused | Memused | CPUtime | Walltimeused | JobFSused | CPU Efficiency | Memory Efficiency | NCPUSpertask | Service_units | Queue | Parallel tasks | Total tasks | |------------------|-----------|----------|-----------|---------------|------------|----------------|-------------------|----------------|---------------|----------|----------------|-------------| | HaplotypeCaller | 864 | 2514 GB | 472:45:36 | 0:36:50 | 2425550kb | 0.89 | 0.74 | 1 | 1066 | normal | 864 | 19200 | | GatherVCFs | 6 | 47.46GB | 0:52:33 | 0:17:36 | 11.67MB | 0.5 | 0.73 | 1 | 2.61 | normalbw | 6 | 6 | | GenomicsDBImport | 112 | 256.69GB | 29:35:15 | 27:56.7 | 746.52MB | 0.95 | 0.59 | 1 | 39.01 | normalbw | 112 | 6400 | | GenotypeGVCFs | 48 | 99.18GB | 119:37:17 | 3:55:41 | 1.41GB | 0.63 | 0.68 | 1 | 565.64 | normal | 48 | 3200 | | Gather and sort | 1 | 16.43GB | 0:08:26 | 0:11:51 | 0B | 0.71 | 0.50 | 1 | 9.48 | express | 1 | 1 | | VQSR | 1 | 28.73GB | 0:45:13 | 0:45:43 | 1.02MB | 0.99 | 0.58 | 1 | 18.29 | normal | 1 | 1 |

Please note that multi-sample steps GenomicsDBImport and GenotypeGVCFs scale with cohort size.

Summary - 20 human samples

The benchmarking results below were obtained from a human dataset with:

Median coverage: 34.3 X
Mapping rate: 99.8%
Average total raw FASTQ size: 71.7 GB
Average final BAM size: 92.2 GB

| #JobName | Queue | CPUsused | Memused | CPUtime | Walltimeused | JobFSused | CPU Efficiency | MEM Efficiency | Serviceunits | Number of tasks | NCPUs per task | Average mem allowed per task(GB) | Java mem (if applicable) | |------------------------------|---------|-----------|----------|------------|---------------|------------|----------------|----------------|---------------|-----------------|----------------|----------------------------------|--------------------------| | gatk4hc960 | normal | 960 | 3.29TB | 1675:26:10 | 1:46:42 | 20.43MB | 0.98 | 0.88 | 3414.4 | 64000 | 1 | 4 | | | gatk4hc1920 | normal | 1920 | 6.52TB | 1693:46:28 | 0:54:08 | 20.55MB | 0.98 | 0.87 | 3464.53 | 64000 | 1 | 4 | | | gatk4hc2880 | normal | 2880 | 9.15TB | 1744:28:32 | 0:39:07 | 20.55MB | 0.93 | 0.81 | 3755.2 | 64000 | 1 | 4 | | | gatk4hc3840 | normal | 3840 | 11.42TB | 2154:15:26 | 0:40:05 | 20.55MB | 0.84 | 0.76 | 5130.67 | 64000 | 1 | 4 | | | gatk4gathervcfs5 | hugemem | 5 | 160.0GB | 10:22:28 | 2:43:03 | 8.05MB | 0.76 | 1.00 | 40.76 | 20 | 1 | 32 | | | gatk4gathervcfs10 | hugemem | 10 | 320.0GB | 9:56:00 | 1:17:49 | 8.07MB | 0.77 | 1.00 | 38.91 | 20 | 1 | 32 | | | gatk4gathervcfs15 | hugemem | 15 | 480.0GB | 13:28:19 | 1:44:05 | 8.08MB | 0.52 | 1.00 | 78.06 | 20 | 1 | 32 | | | gatk4gathervcfs20 | hugemem | 20 | 486.12GB | 28:21:05 | 1:49:46 | 8.09MB | 0.77 | 0.76 | 109.77 | 20 | 1 | 32 | | | gatk4genomicsdbimport48 | hugemem | 48 | 746.06GB | 84:38:42 | 2:29:53 | 8.92MB | 0.71 | 0.51 | 359.72 | 3200 | 1 | 31.25 | -Xmx40g | | gatk4genomicsdbimport192n | normal | 192 | 519.53GB | 119:33:16 | 3:41:05 | 8.93MB | 0.17 | 0.68 | 1414.93 | 3200 | 4 | 16 | -Xmx40g | | gatk4genomicsdbimport96 | hugemem | 96 | 1.48TB | 155:27:45 | 2:28:12 | 8.92MB | 0.66 | 0.51 | 711.36 | 3200 | 1 | 31.25 | -Xmx40g | | gatk4genomicsdbimport144 | hugemem | 144 | 931.71GB | 122:04:35 | 3:14:25 | 8.92MB | 0.26 | 0.21 | 1399.8 | 3200 | 1 | 31.25 | -Xmx40g | | gatk4genotypegvcfs48 | hugemem | 48 | 326.98GB | 94:44:02 | 5:21:39 | 8.88MB | 0.37 | 0.22 | 771.96 | 3200 | 2 | 62.5 | -Xmx58g | | gatk4genotypegvcfs96 | hugemem | 96 | 701.23GB | 92:53:48 | 2:43:46 | 8.88MB | 0.35 | 0.24 | 786.08 | 3200 | 2 | 62.5 | -Xmx58g | | gatk4genotypegvcfs144 | hugemem | 144 | 1.05TB | 92:09:53 | 1:58:00 | 8.88MB | 0.33 | 0.24 | 849.6 | 3200 | 2 | 62.5 | -Xmx58g |

Step 1. Benchmarking: HaplotypeCaller

The benchmarking metrics below were obtained using human datasets with an average coverage of 35X.

Scalability to compute resources

| Samples processed | CPUsused | CPUspersample | Memused | CPUtime | Walltimeused | JobFSused | Efficiency | Serviceunits(CPUhours) | Queue | Average SUs per sample | Exit_status | |-------------------|-----------|-----------------|----------|------------|---------------|------------|------------|--------------------------|--------|------------------------|-------------| | 20 | 960 | 48 | 3.29TB | 1675:26:10 | 1:46:42 | 20.43MB | 0.98 | 3414 | normal | 171 | 0 | | 20 | 1920 | 96 | 6.52TB | 1693:46:28 | 0:54:08 | 20.55MB | 0.98 | 3465 | normal | 173 | 0 | | 20 | 2880 | 144 | 9.15TB | 1744:28:32 | 0:39:07 | 20.55MB | 0.93 | 3755 | normal | 188 | 0 | | 20 | 3840 | 192 | 11.42TB | 2154:15:26 | 0:40:05 | 20.55MB | 0.84 | 5131 | normal | 257 | 0 |

Scalability to cohort size

| Samples processed | CPUsused | Memused | CPUtime | Walltimeused | JobFSused | Efficiency | Serviceunits(CPUhours) | Queue | Average SUs per sample | Exit_status | |-------------------|-----------|----------|-------------|---------------|------------|------------|--------------------------|--------|------------------------|-------------| | 20 | 1920 | 6.52TB | 1693:46:28 | 0:54:08 | 20.55MB | 0.98 | 3465 | normal | 173 | 0 | | 30 | 2880 | 9.56TB | 2490:01:45 | 0:53:19 | 26.97MB | 0.97 | 5118 | normal | 171 | 0 | | 40 | 3840 | 12.85TB | 3757:02:48 | 1:00:11 | 33.24MB | 0.98 | 7703 | normal | 193 | 0 | | 60 | 5760 | 19.66TB | 7581:22:59 | 2:01:12 | 46.05MB | 0.65 | 23270 | normal | 388 | 271 | | 80 | 7680 | 26.92TB | 11522:42:48 | 2:01:07 | 62.59MB | 0.74 | 31006 | normal | 388 | 271 |

Cite us to support us!

Chew, T., Willet, C., Samaha, G., Menadue, B. J., Downton, M., Sun, Y., Kobayashi, R., & Sadsad, R. (2021). Germline-ShortV (Version 1.0) [Computer software]. https://doi.org/10.48546/workflowhub.workflow.143.1

Acknowledgements

Acknowledgements (and co-authorship, where appropriate) are an important way for us to demonstrate the value we bring to your research. Your research outcomes are vital for ongoing funding of the Sydney Informatics Hub and national compute facilities.

Suggested acknowledgements

The authors acknowledge the technical assistance provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney and the Australian BioCommons which is enabled by NCRIS via Bioplatforms Australia. The authors acknowledge the use of the National Computational Infrastructure (NCI) supported by the Australian Government and the Sydney Informatics Hub HPC Allocation Scheme, supported by the Deputy Vice-Chancellor (Research), University of Sydney and the ARC LIEF, 2019: Smith, Muller, Thornber et al., Sustaining and strengthening merit-based access to National Computational Infrastructure (LE190100021).

References

GATK 4: Van der Auwera et al. 2013 https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/0471250953.bi1110s43

OpenMPI: Graham et al. 2015 https://dl.acm.org/doi/10.1007/11752578_29

Owner

Name: Sydney Informatics Hub
Login: Sydney-Informatics-Hub
Kind: organization
Email: sih.admin@sydney.edu.au
Location: University of Sydney, Sydney Australia

Website: https://sydney.edu.au/sydney-informatics-hub
Twitter: Sydney_CRF
Repositories: 189
Profile: https://github.com/Sydney-Informatics-Hub

The Sydney Informatics Hub is a Core Research Facility of the University of Sydney, providing training and expertise on research data, analysis and computing.

Citation (CITATION.cff)

cff-version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: "Chew"
    given-names: "Tracy"
    orcid: "https://orcid.org/0000-0001-9529-7705"
  - family-names: "Willet"
    given-names: "Cali"
    orcid: "https://orcid.org/0000-0001-8449-1502"
  - family-names: "Samaha"
    given-names: "Georgina"
    orcid: "https://orcid.org/0000-0003-0419-1476"
  - family-names: "Menadue"
    given-names: "Benjamin J"
    orcid: "https://orcid.org/0000-0001-5013-6350"
  - family-names: "Downton"
    given-names: "Matthew"
    orcid: ""
  - family-names: "Kobayashi"
    given-names: "Rika"
    orcid: "https://orcid.org/0000-0002-0672-833X"
  - family-names: "Sadsad"
    given-names: "Rosemarie"
    orcid: "https://orcid.org/0000-0003-2488-953X"
title: "Germline-ShortV"
version: 1.0
doi: 10.48546/workflowhub.workflow.143.1
date-released: 2021-06-18
url: "https://github.com/Sydney-Informatics-Hub/Germline-ShortV"

GitHub Events

Total

Issues event: 1
Push event: 1

Last Year

Issues event: 1
Push event: 1

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 12
Total pull requests: 1
Average time to close issues: about 1 month
Average time to close pull requests: N/A
Total issue authors: 3
Total pull request authors: 1
Average comments per issue: 1.17
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 1
Average comments per issue: 3.0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

germline-shortv

Science Score: 75.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Germline-ShortV

⚠️ Notice of workflow deprecation

Description

Human datasets: GRCh38/hg38 + ALT contigs reference

Excluded sites

Cancer studies

Workflow Diagram

User guide

HaplotypeCaller

Only run the job below if there were tasks that failed

Only run the job below if there were tasks that failed

GenomicsDBImport

Only run the job below if there were tasks that failed

GenotypeGVCFs

Only run the job below if there were tasks that failed

Change the config file name:

Adjust the resource requests, then submit:

Variant Quality Score Recalibration

Adjusting memory, more memory is required for larger cohorts (more variants)

Set up

Tools

Required inputs

Directory structure

Benchmarking metrics

Summary - 6 human samples

Summary - 20 human samples

Step 1. Benchmarking: HaplotypeCaller

Cite us to support us!

Acknowledgements

Suggested acknowledgements

References

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels