covpipe2

SARS-CoV-2 genome reconstruction for Illumina data in Nextflow

https://github.com/rki-mf1/covpipe2

Last synced: 6 months ago · JSON representation ·

Repository

SARS-CoV-2 genome reconstruction for Illumina data in Nextflow

Basic Info

Host: GitHub
Owner: rki-mf1
License: gpl-3.0
Language: Nextflow
Default Branch: main
Homepage:
Size: 16.1 MB

Statistics

Stars: 4
Watchers: 3
Forks: 1
Open Issues: 12
Releases: 26

Created almost 4 years ago · Last pushed about 1 year ago

Metadata Files

Readme Changelog License Citation

CoVpipe2

CoVpipe2 is a Nextflow pipeline for reference-based genome reconstruction of SARS-CoV-2 from NGS data. In principle it can be used also for other viruses.

Table of contents

- [CoVpipe2](#covpipe2) - [Quick installation](#quick-installation) - [Call help](#call-help) - [Test run](#test-run) - [Update the pipeline](#update-the-pipeline) - [Use a certain release](#use-a-certain-release) - [Quick run examples](#quick-run-examples) - [Example 1:](#example-1) - [Example 2:](#example-2) - [Example sample sheet](#example-sample-sheet) - [Manual](#manual) - [Changes to CoVpipe](#changes-to-covpipe) - [Workflow](#workflow) - [Citations](#citations) - [Acknowledgement, props and inspiration](#acknowledgement-props-and-inspiration)

Quick installation

The pipeline is written in Nextflow, which can be used on any POSIX compatible system (Linux, OS X, etc). Windows system is supported through WSL. You need Nextflow installed and either conda, or Docker, or Singularity to run the steps of the pipeline:

Install Nextflow via self-installing package
click here for a bash one-liner

```bash wget -qO- https://get.nextflow.io | bash

In the case you don’t have wget

curl -s https://get.nextflow.io | bash

```

OR

Install Nextflow via conda
click here for a bash two-liner for Miniconda3 Linux 64-bit

bash wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh conda create -n nextflow -c bioconda nextflow conda active nextflow

All other dependencies and tools will be installed within the pipeline via conda, Docker or Singularity.

:warning: Important for conda/mamba users: Make sure that your conda channels are configured according to the bioconda usage:

Check your current channel list:

bash conda config --show channels

Change you channel list:

bash conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge conda config --set channel_priority strict

Please, check bioconda usage for the latest configuration!

Call help

bash nextflow run rki-mf1/CoVpipe2 -r <version> --help

Test run

Validate your installation with a test run:

```bash

for a Conda installation

the Conda channel configuration needs to be bioconda conform

nextflow run rki-mf1/CoVpipe2 -r -profile local,conda,test --cores 4 --max_cores 8

for a Singularity installation

nextflow run rki-mf1/CoVpipe2 -r -profile local,singularity,test --cores 4 --max_cores 8

for a Docker installation

nextflow run rki-mf1/CoVpipe2 -r -profile local,docker,test --cores 4 --max_cores 8 ```

For more configuration options, see here.

Update the pipeline

bash nextflow pull rki-mf1/CoVpipe2

Use a certain release

We recommend to use a stable release of the pipeline:

bash nextflow pull rki-mf1/CoVpipe2 -r <RELEASE>

Quick run examples

Example 1:

bash nextflow run rki-mf1/CoVpipe2 -r <version> \ --reference 'sars-cov-2' \ --fastq my_samples.csv --list \ --kraken \ --cores 4 --max_cores 8 - Read input from sample sheet - Perform taxonomic classification to remove not SARS-CoV-2 reads - Local execution with maximal 8 cores in total and conda

Example 2:

bash nextflow run rki-mf1/CoVpipe2 -r <version> \ --reference 'sars-cov-2' \ --fastq '*R{1,2}.fastq.gz' \ --adapter /path/to/repo/data/adapters/NexteraTransposase.fasta \ --primer_version V4.1 \ -profile slurm,singularity

Remove adapters
Clip primer (ARTIC version V4.1)
Execution on a SLURM system with Singularity

Example sample sheet

CoVpipe2 accepts a sample sheet in CSV format as input and should look like this:

sample,fastq_1,fastq_2 sample1,/path/to/reads/id1_1.fastq.gz,/path/to/reads/id1_2.fastq.gz sample2,/path/to/reads/id2_1.fastq.gz,/path/to/reads/id2_2.fastq.gz sample3,/path/to/reads/id3_1.fastq.gz,/path/to/reads/id3_2.fastq.gz sample4,/path/to/reads/id4_1.fastq.gz,/path/to/reads/id4_2.fastq.gz

The header is required. Pay attention the set unique sample names!

Manual

click here to see the complete help message

``` Robert Koch Institute, MF1 Bioinformatics Workflow: CoVpipe2 Usage examples: nextflow run CoVpipe2.nf --fastq '*R{1,2}.fastq.gz' --cores 4 --max_cores 8 or nextflow run rki-mf1/CoVpipe2 -r --fastq '*R{1,2}.fastq.gz' --ref_genome ref.fasta --cores 4 --max_cores 8 Reference, required: --reference Currently supported: 'sars-cov-2' (MN908947.3) [default: sars-cov-2] OR --ref_genome Reference FASTA file. --ref_annotation Reference GFF file. Illumina read data, required: --fastq e.g.: 'sample{1,2}.fastq' or '*.fastq.gz' or '*/*.fastq.gz' Optional input settings: --list This flag activates csv input for --fastq [default: false] style and header of the csv is: sample,fastq_1,fastq_2 --mode Switch between 'paired'- and 'single'-end FASTQ; 'single' is experimental [default: paired] --run_id Run ID [default: ] Adapter clipping: --adapter Define the path of a FASTA file containing the adapter sequences to be clipped. [default: false] Trimming and QC: --fastp_additional_parameters Additional parameters for fastp [default: --qualified_quality_phred 20 --length_required 50] For shorter/longer amplicon length than 156 nt, adjust --length_required Taxonomic read filter: --kraken Activate taxonomic read filtering to exclude reads not classified with specific taxonomic ID (see --taxid) [default: false] A pre-processed kraken2 database will be automatically downloaded from https://zenodo.org/record/3854856 and stored locally. --kraken_db_custom Path to a custom Kraken2 database. [default: ] --taxid Taxonomic ID used together with the kraken2 database for read filtering [default: 2697049] Linage detection on read level with LCS: Uses this fork https://github.com/rki-mf1/LCS of https://github.com/rvalieris/LCS --read_linage Linage detection on read level [default: false] --lcs_ucsc_version Create marker table based on a specific UCSC SARS-CoV-2 tree (e.g. '2022-05-01'). Use 'predefined' to use the marker table from the repo (most probably not up-to-date) [default: predefined] See https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2 for available trees. --lcs_ucsc_predefined If '--lcs_ucsc_version 'predefined'', select pre-calculated UCSC table [default: 2022-01-31] See https://github.com/rki-mf1/LCS/tree/master/data/pre-generated-marker-tables --lcs_ucsc_update Use latest UCSC SARS-CoV-2 tree for marker table update. Overwrites --lcs_ucsc_version [default: false] Automatically checks https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/public-latest.version.txt --lcs_ucsc_downsampling Downsample sequences when updating marker table to save resources. Use 'None' to turn off [default: 10000] Attention! Updating without downsampling needs a lot of resources in terms of memory and might fail. Consider downsampling or increase the memory for this process. --lcs_variant_groups Provide path to custom variant groups table (TSV) for marker table update. Use 'default' for predefined groups from repo (https://github.com/rki-mf1/LCS/blob/master/data/variant_groups.tsv) [default: default] --lcs_cutoff Plot linages above this threshold [default: 0.03] Mapping: --isize_filter Insert size threshold for mapping. All BAM file entries with an insert size above this threshold are filtered out. Deactivated by default. [default: false] Primer detection: --bamclipper_additional_parameters Additional parameters for BAMClipper [default: false] Use -u INT and -d INT to adjust the primer detection window of BAMClipper: extend upstream (-u) or downstream (-d) from the 5' most nt of primer [default from BAMClipper: -u 1 -d 5] --primer_bedpe Provide the path to the primer BEDPE file. [default: false] TAB-delimited text file containing at least 6 fields, see here: https://bedtools.readthedocs.io/en/latest/content/general-usage.html#bedpe-format OR --primer_bed Provide the path to the primer BED file. A BEDPE file will be generated automatically. The name of each entry has to match this pattern: primerID[_LEFT|_RIGHT]_ampliconID [default: false] OR --primer_version Provide a primer version. Currently supported ARTIC versions: V1, V2, V3, V4, V4.1 [default: false] Variant calling: --vcount Minimum number of reads at a position to be considered for variant calling. [default: 10] --cov Minimum number of supporting reads which are required to call a variant. [default: 20] --frac Minimum percentage of supporting reads at the respective position required to call a variant. In turn, variants supported by (1 - frac)*100% reads will be explicitly called. [default: 0.1] --vois Compare called variants to a VCF file with you variants of interest [default: false] Variant hard filtering: --var_mqm Minimal mean mapping quality of observed alternate alleles (MQM). The mapping quality (MQ) measures how good reads align to the respective reference genome region. Good mapping qualities are around MQ 60. GATK recommends hard filtering of variants with MQ less than 40. [default: 40] --var_sap Maximal strand balance probability for the alternate allele (SAP). The SAP is the Phred-scaled probability that there is strand bias at the respective site. A value near 0 indicates little or no strand bias. Amplicon data usually has a high, WGS data a low bias. [default: false] Disable (default) for amplicon sequencing; for WGS GATK recommends 60 --var_qual Minimal variant call quality. Freebayes produces a general judgement of the variant call. [default: 10] Consensus generation: --cns_min_cov Minimum number of reads required so that the respective position in the consensus sequence is NOT hard masked. [default: 20] --cns_gt_adjust Minimum fraction of reads supporting a variant which leads to an explicit call of this variant (genotype adjustment). The value has to be greater than 0.5 but not greater than 1. To turn genotype adjustment off, set the value to 0. [default: 0.9] --cns_indel_filter Minimum fraction of reads supporting an indel which leads to an integration to the consensus sequence. Low frequency indels can be false positives introducing frameshifts. Since the IUPAC code is not able to model a base-or-gap case, those indels would be integrated in the IUPAC and masked consensus. To turn indel filtering off, set the value to 0. [default: 0.6] Updated for linage assignment and mutation calling: --update Update pangolin and nextclade [default: false] Depending on the chosen profile either the conda environment (profiles 'standard', 'conda', 'mamba') or the container (profiles 'docker', 'singularity') is updated. --pangolin_docker_default Default container tag for pangolin [default: rkimf1/pangolin:4.2-1.18.1.1--e24af6d] --nextclade_docker_default Default container tag for nextclade [default: rkimf1/nextclade2:2.13.1--ddb9e60] --pangolin_conda_default Default conda packages for pangolin [default: bioconda::pangolin=4.2 bioconda::pangolin-data=1.18.1.1] --nextclade_conda_default Default conda packages for nextclade [default: bioconda::nextclade=2.13.1] --nextclade_dataset_name Default dataset name for nextclade [default: sars-cov-2] --nextclade_dataset_tag Default dataset tag for nextclade [default: 2023-04-18T12:00:00Z] Computing options: --cores Max cores per process for local use [default: 4] --max_cores Max cores used on the machine for local use [default: 12] --memory Max memory in GB for local use [default: 12] Output options: --output Name of the result folder [default: results] --publish_dir_mode Mode of output publishing: 'copy', 'symlink' [default: copy] With 'symlink' results are lost when removing the work directory. Caching: --databases Location for auto-download data like databases [default: nextflow-autodownload-databases] --conda_cache_dir Location for storing the conda environments [default: conda] --singularity_cache_dir Location for storing the singularity images [default: singularity] Execution/Engine profiles: The pipeline supports profiles to run via different Executors and Engines e.g.: -profile local,conda Executor (choose one): local slurm Engines (choose one): conda mamba docker singularity Misc: cluster Loads resource configs more suitable for cluster execution. Has to be combine with an engine and an executor. Per default: -profile local,conda is executed. Test profile: Test the pipeline with a small test dataset: nextflow run rki-mf1/CoVpipe2 -r -profile executor,engine,test ```

Changes to CoVpipe

Workflow management framework: snakemake -> Nextflow
Docker/Singularity and conda support for each step
Container/conda updated feature for pangolin and nextclade
HPC/slurm profile provided
Fixes:
- Subtract only deletions from low coverage mask for consensus generation
New features:
- nexclade (mutation calling, clade assignment)
- LCS (linage decomposition)
- Restructured report
- krona plots (visualization of Kraken2 output)
- president (genome quality control)
Version update (status CoVpipe2 v0.2.1):
- bcftools: 1.11 -> 1.14
- Note: https://github.com/samtools/bcftools/issues/1708
- liftoff: 1.5.2 -> 1.6.2
- kraken2: 2.1.0 -> 2.1.2
- freebayes: 1.3.2 -> 1.3.6
- fastp: 0.20.1 -> 0.23.2
- bedtools: 2.29.2 -> 2.30.0

Workflow

Workflow overview: _{_{Components originally designed by James A. Fellows Yates & nf-core under a CC0 license (public domain)}}

More detailed overview with process names:

![workflow](/data/figures/covpipe2_processes.png) _{_{Components originally designed by James A. Fellows Yates & nf-core under a CC0 license (public domain)}}

Even more detailed overview with process names and parameters:

![workflow](/data/figures/covpipe2_processes_params.png) _{_{Components originally designed by James A. Fellows Yates & nf-core under a CC0 license (public domain)}}

Citations

If you use CoVpipe2 in your work, please consider citing our publication:

Lataretu, M., Drechsel, O., Kmiecinski, R., Trappe, K., Hölzer, M., & Fuchs, S

Lessons learned: overcoming common challenges in reconstructing the SARS-CoV-2 genome from short-read sequencing data via CoVpipe2 [version 1; peer review: awaiting peer review].

F1000Research 2023, 12:1091 (https://doi.org/10.12688/f1000research.136683.1)

Additionally, an extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Acknowledgement, props and inspiration

Owner

Name: RKI MF1 Bioinformatics
Login: rki-mf1
Kind: organization
Location: Germany

Repositories: 9
Profile: https://github.com/rki-mf1

Bioinformatics code of MF1

Citation (CITATIONS.md)

# CoVpipe2: Citations

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [ARTIC network](https://github.com/artic-network)

- [BAMClipper](https://pubmed.ncbi.nlm.nih.gov/28484262/)

  > Au CH, Ho DN, Kwong A, Chan TL, Ma ESK. BAMClipper: removing primers from alignments to minimize false-negative mutations in amplicon next-generation sequencing. Sci Rep. 2017 May 8;7(1):1567. doi: 10.1038/s41598-017-01703-6. PMID: 28484262; PMCID: PMC5431517.

- [BCFtools](https://www.ncbi.nlm.nih.gov/pubmed/21903627/)

  > Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011 Nov 1;27(21):2987-93. doi: 10.1093/bioinformatics/btr509. Epub 2011 Sep 8. PubMed PMID: 21903627; PubMed Central PMCID: PMC3198575.

- [BEDTools](https://www.ncbi.nlm.nih.gov/pubmed/20110278/)

  > Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PubMed PMID: 20110278; PubMed Central PMCID: PMC2832824.

- [BWA](https://arxiv.org/abs/1303.3997)

  > Li H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997v2 [q-bio.GN]

- [fastp](https://www.ncbi.nlm.nih.gov/pubmed/30423086/)

  > Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560. PubMed PMID: 30423086; PubMed Central PMCID: PMC6129281.

- [freebayes](https://arxiv.org/abs/1207.3907)

  > Garrison E, Marth G. (2012) Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN]

- [Kraken 2](https://www.ncbi.nlm.nih.gov/pubmed/31779668/)

  > Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019 Nov 28;20(1):257. doi: 10.1186/s13059-019-1891-0. PubMed PMID: 31779668; PubMed Central PMCID: PMC6883579.

- [Krona](https://pubmed.ncbi.nlm.nih.gov/21961884/)

  > Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics. 2011 Sep 30;12:385. doi: 10.1186/1471-2105-12-385. PMID: 21961884; PMCID: PMC3190407.

- [LSC](https://pubmed.ncbi.nlm.nih.gov/35104309/)

  > Valieris R, Drummond RD, Defelicibus A, Dias-Neto E, Rosales RA, Tojal da Silva I. A mixture model for determining SARS-Cov-2 variant composition in pooled samples. Bioinformatics. 2022 Mar 28;38(7):1809-1815. doi: 10.1093/bioinformatics/btac047. PMID: 35104309.

- [ncov-recombinant](https://github.com/ktmeaton/ncov-recombinant)

- [Nextstrain](https://pubmed.ncbi.nlm.nih.gov/29790939/)

  > Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T, Neher RA. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018 Dec 1;34(23):4121-4123. doi: 10.1093/bioinformatics/bty407. PubMed PMID: 29790939; PubMed Central PMCID: PMC6247931.

- [pangolin](https://github.com/cov-lineages/pangolin)

  > Áine O'Toole, Emily Scher, Anthony Underwood, Ben Jackson, Verity Hill, JT McCrone, Chris Ruis, Khali Abu-Dahab, Ben Taylor, Corin Yeats, Louis du Plessis, David Aanensen, Eddie Holmes, Oliver Pybus, Andrew Rambaut. pangolin: lineage assignment in an emerging pandemic as an epidemiological tool. Publication in preparation.

- [PRESIDENT](https://github.com/rki-mf1/president)

- [Python](https://www.python.org/) 

- [SAMtools](https://www.ncbi.nlm.nih.gov/pubmed/19505943/)

  > Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.

- [sc2rf](https://github.com/lenaschimmel/sc2rf)

- [SnpEff](https://www.ncbi.nlm.nih.gov/pubmed/22728672/)

  > Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012 Apr-Jun;6(2):80-92. doi: 10.4161/fly.19695. PubMed PMID: 22728672; PubMed Central PMCID: PMC3679285.

## R packages

- [R](https://www.R-project.org/)

  > R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

- [ggplot2](https://cran.r-project.org/web/packages/ggplot2/index.html)

  > H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)
  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.