sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing

https://gitlab.com/rtourdot/sarek

Science Score: 41.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 9 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing

Basic Info
  • Host: gitlab.com
  • Owner: rtourdot
  • Default Branch: master
Statistics
  • Stars: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 5 years ago
Metadata Files
Readme Changelog Contributing License Code of conduct Citation Codeowners

README.md

nf-core/sarek

An open-source analysis pipeline to detect germline or somatic variants from whole genome or targeted sequencing

Nextflow nf-core DOI

GitHub Actions CI status GitHub Actions Linting status CircleCi build status

install with bioconda Docker Get help on Slack

Introduction

Sarek is a workflow designed to detect variants on whole genome or targeted sequencing data. Initially designed for Human, and Mouse, it can work on any species with a reference genome. Sarek can also handle tumour / normal pairs and could include additional relapses.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with Docker containers making installation trivial and results highly reproducible.

It's listed on Elixir - Tools and Data Services Registry and Dockstore.

Quick Start

  1. Install Nextflow (>=20.04.0)

  2. Install any of Docker, Singularity, Podman, Shifter or Charliecloud for full pipeline reproducibility (please only use Conda as a last resort; see docs)

  3. Download the pipeline and test it on a minimal dataset with a single command:

    bash nextflow run nf-core/sarek -profile test,<docker/singularity/podman/shifter/charliecloud/conda/institute>

    Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile <institute> in your command. This will enable either Docker or Singularity and set the appropriate execution settings for your local compute environment.

  4. Start running your own analysis!

    bash nextflow run nf-core/sarek -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> --input '*.tsv' --genome GRCh38

See usage docs for all of the available options when running the pipeline.

Pipeline Summary

By default, the pipeline currently performs the following:

  • Sequencing quality control (FastQC)
  • Map Reads to Reference (BWA mem)
  • Mark Duplicates (GATK MarkDuplicatesSpark)
  • Base (Quality Score) Recalibration (GATK BaseRecalibrator, GATK ApplyBQSR)
  • Preprocessing quality control (samtools stats)
  • Preprocessing quality control (Qualimap bamqc)
  • Overall pipeline run summaries (MultiQC)

Documentation

The nf-core/sarek pipeline comes with documentation about the pipeline: usage and output.

Credits

Sarek was developed at the National Genomics Infastructure and National Bioinformatics Infastructure Sweden which are both platforms at SciLifeLab, with the support of The Swedish Childhood Tumor Biobank (Barntumörbanken). QBiC later joined and helped with further development.

Main authors:

Helpful contributors:

Contributions & Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #sarek channel (you can join with this invite), or contact us: Maxime Garcia, Szilvester Juhos

CHANGELOG

Acknowledgements

Barntumörbanken | SciLifeLab :-:|:-: National Genomics Infrastructure | National Bioinformatics Infrastructure Sweden QBiC |

Citations

If you use nf-core/sarek for your analysis, please cite the Sarek article as follows:

Garcia M, Juhos S, Larsson M et al. Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants [version 2; peer review: 2 approved] F1000Research 2020, 9:63 doi: 10.12688/f1000research.16665.2.

You can cite the sarek zenodo record for a specific version using the following doi: 10.5281/zenodo.3476426

In addition, references of tools and data used in this pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Citation (CITATIONS.md)

# nf-core/sarek: Citations

## [nf-core/sarek](https://pubmed.ncbi.nlm.nih.gov/32269765/)

> Garcia MU, Juhos S, Larsson M, Olason PI, Martin M, Eisfeldt J, DiLorenzo S, Sandgren J, Díaz De Ståhl T, Ewels PA, Wirta V, Nistér M, Käller M, Nystedt B. Sarek: A portable workflow for whole-genome sequencing analysis of germline and somatic variants. F1000Res. 2020 Jan 29;9:63. eCollection 2020. doi: 10.12688/f1000research.16665.2. PubMed PMID: 32269765.

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

* [ASCAT](https://pubmed.ncbi.nlm.nih.gov/20837533/)
  > Van Loo P, Nordgard SH, Lingjærde OC, et al.: Allele-specific copy number analysis of tumors. Proc Natl Acad Sci USA . 2010 Sep 28;107(39):16910-5. doi: 10.1073/pnas.1009843107. Epub 2010 Sep 13. PubMed PMID: 20837533; PubMed Central PMCID: PMC2947907.

* [BCFTools](https://pubmed.ncbi.nlm.nih.gov/21903627/)
  > Li H: A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011 Nov 1;27(21):2987-93. doi: 10.1093/bioinformatics/btr509. Epub 2011 Sep 8. PubMed PMID: 21903627; PubMed Central PMCID: PMC3198575.

* [BWA-MEM](https://arxiv.org/abs/1303.3997v2)
  > Li H: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv 1303.3997v2. 2013

* [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

* [Control-FREEC](https://pubmed.ncbi.nlm.nih.gov/22155870/)
  > Boeva V, Popova T, Bleakley K, et al.: Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics. 2012; 28(3): 423–5. doi: 10.1093/bioinformatics/btr670. Epub 2011 Dec 6. PubMed PMID: 22155870; PubMed Central PMCID: PMC3268243.

* [GATK](https://pubmed.ncbi.nlm.nih.gov/20644199/)
  > McKenna A, Hanna M, Banks E, et al.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010 Sep;20(9):1297-303. doi: 10.1101/gr.107524.110. Epub 2010 Jul 19. PubMed PMID: 20644199; PubMed Central PMCID: PMC2928508.

* [Manta](https://pubmed.ncbi.nlm.nih.gov/26647377/)
  > Chen X, Schulz-Trieglaff O, Shaw R, et al.: Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics. 2016 Apr 15;32(8):1220-2. doi: 10.1093/bioinformatics/btv710. Epub 2015 Dec 8. PubMed PMID: 26647377.

* [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

* [Qualimap 2](https://pubmed.ncbi.nlm.nih.gov/26428292/)
  > Okonechnikov K, Conesa A, García-Alcalde F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics. 2016 Jan 15;32(2):292-4. doi: 10.1093/bioinformatics/btv566. Epub 2015 Oct 1. PubMed PMID: 26428292; PubMed Central PMCID: PMC4708105.

* [SAMtools](https://pubmed.ncbi.nlm.nih.gov/19505943/)
  > Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.

* [snpEff](https://pubmed.ncbi.nlm.nih.gov/22728672/)
  > Cingolani P, Platts A, Wang le L, et al.: A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). Apr-Jun 2012;6(2):80-92. doi: 10.4161/fly.19695. PubMed PMID: 22728672; PubMed Central PMCID: PMC3679285.

* [Strelka2](https://pubmed.ncbi.nlm.nih.gov/30013048/)
  > Kim S, Scheffler K, Halpern AL, et al.: Strelka2: fast and accurate calling of germline and somatic variants. Nat Methods. 2018 Aug;15(8):591-594. doi: 10.1038/s41592-018-0051-x. Epub 2018 Jul 16. PubMed PMID: 30013048.

* [TIDDIT](https://pubmed.ncbi.nlm.nih.gov/28781756/)
  > Eisfeldt J, Vezzi F, Olason P, et al.: TIDDIT, an efficient and comprehensive structural variant caller for massive parallel sequencing data. F1000Res. 2017 May 10;6:664. doi: 10.12688/f1000research.11168.2. eCollection 2017. PubMed PMID: 28781756; PubMed Central PMCID: PMC5521161.

* [Trim Galore!](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)

* [VCFTools](https://pubmed.ncbi.nlm.nih.gov/21653522/)
  > Danecek P, Auton A, Abecasis G, et al.: The variant call format and VCFtools. Bioinformatics. 2011 Aug 1;27(15):2156-8. doi: 10.1093/bioinformatics/btr330. Epub 2011 Jun 7. PubMed PMID: 21653522; PubMed Central PMCID: PMC3137218.

* [VEP](https://pubmed.ncbi.nlm.nih.gov/27268795/)
  > McLaren W, Gil L, Hunt SE, et al.: The Ensembl Variant Effect Predictor. Genome Biol. 2016 Jun 6;17(1):122. doi: 10.1186/s13059-016-0974-4. PubMed PMID: 27268795; PubMed Central PMCID: PMC4893825.

## R packages

* [R](https://www.R-project.org/)
  > R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

* [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html)
  > H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

* [optparse](https://CRAN.R-project.org/package=optparse)
  > Trevor L Davis (2018). optparse: Command Line Option Parser.

* [RColorBrewer](https://CRAN.R-project.org/package=RColorBrewer)
  > Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes.

## Software packaging/containerisation tools

* [Anaconda](https://anaconda.com)
  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

* [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)
  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

* [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)
  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

* [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

* [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)
  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

Dependencies

containers/snpeff/environment.yml conda
  • snpeff 4.3.1t.*
containers/vep/environment.yml conda
  • ensembl-vep 99.2.*
  • genesplicer 1.0.*
environment.yml conda
  • ascat 2.5.2
  • bcftools 1.9
  • bwa 0.7.17
  • bwa-mem2 2.0
  • cancerit-allelecount 4.0.2
  • cnvkit 0.9.6
  • control-freec 11.6
  • ensembl-vep 99.2
  • fastqc 0.11.9
  • fgbio 1.1.0
  • freebayes 1.3.2
  • gatk4-spark 4.1.7.0
  • genesplicer 1.0
  • htslib 1.9
  • llvm-openmp 8.0.1
  • manta 1.6.0
  • markdown 3.1.1
  • msisensor 0.5
  • multiqc 1.8
  • openmp 8.0.1
  • pigz 2.3.4
  • pygments 2.5.2
  • pymdown-extensions 6.0
  • qualimap 2.2.2d
  • r-ggplot2 3.3.0
  • samblaster 0.1.24
  • samtools 1.9
  • snpeff 4.3.1t
  • strelka 2.9.10
  • tiddit 2.7.1
  • trim-galore 0.6.5
  • vcfanno 0.3.2
  • vcftools 0.1.16