viralassembly

Assemble and QC viral reads with a reference scheme

https://github.com/phac-nml/viralassembly

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Assemble and QC viral reads with a reference scheme

Basic Info
  • Host: GitHub
  • Owner: phac-nml
  • License: mit
  • Language: Nextflow
  • Default Branch: main
  • Size: 12.1 MB
Statistics
  • Stars: 1
  • Watchers: 4
  • Forks: 1
  • Open Issues: 1
  • Releases: 0
Created almost 2 years ago · Last pushed 9 months ago
Metadata Files
Readme Changelog Contributing License Citation

README.md

viralassembly

A generic viral assembly and QC pipeline which utilises a re-implementation of the artic pipeline to separate out the individual steps allowing greater control on tool versions along with how data is run through the processes. This pipeline can be used as a starting point for analyses on viruses without dedicated workflows already available.

This pipeline is intended to be run on either Nanopore Amplicon Sequencing data or Basic Nanopore NGS Sequencing data that can utilize a reference genome for mapping variant calling, and other downstream analyses. It generates variant calls, consensus sequences, and quality control information based on the reference. To do this, there are three different variant callers that can be utilized which includes: clair3, medaka, and nanopolish (For R9.4.1 flowcells and below only).

Some of the goals of this pipeline are: 1. Rework the artic nanopore pipeline steps as nextflow modules to deal with specific bugs and version incompatibilities - Example: BCFtools consensus error seen in artic pipeline sometimes - Allows adding in clair3 as a new variant calling tool - Potentially eventually work to remove artic as a dependency 2. Allow the pipeline to be used on other viruses with or without amplicon schemes - Due to the QC steps there is unfortunately a current limitation at working with segmented viruses - The pipeline will automatically exit after assembly and not generate QC and Reports for these at this time - This will hopefully be fully implemented at some point in the future 3. Provide Run level and Sample level final reports

Index

Installation

  1. Download and install nextflow

    1. Download and install with conda
      • Conda command: conda create on nextflow -c conda-forge -c bioconda nextflow
    2. Install with the instructions at https://www.nextflow.io/
  2. Run the pipeline with one of the following profiles to handle dependencies (or use your own profile if you have one!):

    • conda
    • mamba
    • singularity
    • docker

Running Commands

Simple commands to run input data. Input data can be done in three different ways: 1. Passing --fastq_pass </PATH/TO/fastq_pass> where fastq_pass is a directory containing barcode## subdirectories with fastq files 2. Passing --fastq_pass </PATH/TO/fastqs> where fastqs is a directory containing .fastq* files 3. Passing --input <samplesheet.csv> where samplesheet.csv is a CSV file with two columns 1. sample - The name of the sample 2. fastq_1 - Path to one fastq file per sample in .fastq* format

The basic examples will show how to run the pipeline using the --fastq_pass input but it could be subbed in for the --input CSV file if wanted.

All detailed running information is available in the usage docs

Nanopore - Clair3

Running the pipeline with Clair3 for variant calls requires fastq files and a clair3 model. When running, the pipeline will either: - Look for subdirectories off of the input "--fastqpass" directory called barcode## to be used in the pipeline - Look for fastq files in the input "--fastqpass" directory called *.fastq* to be used in the pipeline

This pipeline utilizes the same steps as the artic fieldbioinformatics minion pipeline but with each step run using nextflow to allow clair3 to be easily slotted in. See the clair3 section of the usage docs for more information

Basic command: bash nextflow run /PATH/TO/artic-generic-nf/main.nf \ -profile <PROFILE(s)> \ --variant_caller 'clair3' \ --fastq_pass </PATH/TO/fastq_pass> \ --reference <REF.fa> \ <OPTIONAL INPUTS>

Optional inputs could include: - Amplicon scheme instead of just a reference fasta file - Metadata - Filtering options - Running SnpEff for variant consequence prediction - Output reporting options

Nanopore - Medaka

Running the pipeline with medaka for variant calls requires fastq files and a medaka model. When running, the pipeline will either: - Look for subdirectories off of the input "--fastqpass" directory called barcode## to be used in the pipeline - Look for fastq files in the input "--fastqpass" directory called *.fastq* to be used in the pipeline

See the medaka section of the usage docs for more information

Basic command: bash nextflow run /PATH/TO/artic-generic-nf/main.nf \ -profile <PROFILE(s)> \ --variant_caller 'medaka' \ --fastq_pass </PATH/TO/fastq_pass> \ --medaka_model <Medaka Model> \ --reference <REF.fa> \ <OPTIONAL INPUTS>

Optional inputs could include: - Amplicon scheme instead of just a reference fasta file - Metadata - Filtering options - Using base artic minion instead of nextflow implementation - Running SnpEff for variant consequence prediction - Output reporting options

Medaka model information can be found here

Nanopore - Nanopolish

Running the pipeline with nanopolish for variant calls requires fastq files, fast5 files, and the sequencing summary file instead of providing a model. As such, nanopolish requires that the read ids in the fastq files are linked by the sequencing summary file to their signal-level data in the fast5 files. This makes it a lot easier to run using barcoded directories but it can be run with individual read files

See the nanopolish section of the usage docs for more information

Basic command: bash nextflow run /PATH/TO/artic-generic-nf/main.nf \ -profile <PROFILE(s)> \ --variant_caller 'nanopolish' \ --fastq_pass </PATH/TO/fastq_pass> \ --fast5_pass </PATH/TO/fast5_pass> \ --sequencing_summary </PATH/TO/sequencing_summary.txt> \ --reference <REF.fa> <OPTIONAL INPUTS>

Optional inputs could include: - Amplicon scheme instead of just a reference fasta file - Metadata - Filtering options - Using base artic minion instead of nextflow implementation - Running SnpEff for variant consequence prediction - Output reporting options

Outputs

Outputs are separated based off of their tool or file format and found in the results/ directory by default.

Outputs include: - Consensus fasta files - VCF files - Bam files - HTML summary files (either custom or MultiQC)

More output information on pipeline steps and output files can be found in the output docs

Limitations

Current limitations include:

  1. Nanopore data only at this time
  2. Currently runs for viruses using a reference genome
    • Segmented viruses will exit before the QC section for now
  3. Custom report can only work when running with conda
  4. SnpEff issues in running and database building/downloading
    • Database building/downloading requires one of three things:
      • The reference ID is in the SnpEff database
        • This allows the database to be downloaded
      • A gff3 file
        • This is used with the reference sequence to build a database
      • A well annotated NCBI genome matching the reference ID
        • This will pull the genbank file and use that to build a database
    • Running SnpEff with singularity sometimes leads to a lock issue which is hopefully fixed

Citations

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x. In addition, references of tools and data used in this pipeline are as follows:

Detailed citations for utilized tools are found in citations.md

Contributing

Contributions are welcome through creating PRs or Issues

Legal

Copyright 2023 Government of Canada

Licensed under the MIT License (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:

https://opensource.org/license/mit/

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Owner

  • Name: National Microbiology Laboratory
  • Login: phac-nml
  • Kind: organization

Citation (CITATIONS.md)

# phac-nml/viralassembly: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [ARTIC network](https://github.com/artic-network)

- [BCFtools](https://www.ncbi.nlm.nih.gov/pubmed/21903627/)

  > Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011 Nov 1;27(21):2987-93. doi: 10.1093/bioinformatics/btr509. Epub 2011 Sep 8. PubMed PMID: 21903627; PubMed Central PMCID: PMC3198575.

- [BEDTools](https://www.ncbi.nlm.nih.gov/pubmed/20110278/)

  > Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PubMed PMID: 20110278; PubMed Central PMCID: PMC2832824.

- [Chopper](https://academic.oup.com/bioinformatics/article/39/5/btad311/7160911?login=false)

  > Wouter De Coster, Rosa Rademakers, NanoPack2: population-scale evaluation of long-read sequencing data, Bioinformatics, Volume 39, Issue 5, May 2023, btad311, https://doi.org/10.1093/bioinformatics/btad311

- [Csvtk](https://github.com/shenwei356/csvtk)

- [Longshot](https://www.nature.com/articles/s41467-019-12493-y)

  > Edge, P., Bansal, V. Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing. Nat Commun 10, 4660 (2019). https://doi.org/10.1038/s41467-019-12493-y

- [Minimap2](https://academic.oup.com/bioinformatics/article/34/18/3094/4994778)

  > Heng Li, Minimap2: pairwise alignment for nucleotide sequences, Bioinformatics, Volume 34, Issue 18, September 2018, Pages 3094–3100, https://doi.org/10.1093/bioinformatics/bty191

- [MultiQC](https://www.ncbi.nlm.nih.gov/pubmed/27312411/)

  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

- [Qualimap](https://academic.oup.com/bioinformatics/article/28/20/2678/206551)

  > Fernando García-Alcalde, Konstantin Okonechnikov, José Carbonell, Luis M. Cruz, Stefan Götz, Sonia Tarazona, Joaquín Dopazo, Thomas F. Meyer, Ana Conesa, Qualimap: evaluating next-generation sequencing alignment data, Bioinformatics, Volume 28, Issue 20, October 2012, Pages 2678–2679, https://doi.org/10.1093/bioinformatics/bts503

- [R](https://www.R-project.org/)

  > R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

- [SAMtools](https://www.ncbi.nlm.nih.gov/pubmed/19505943/)

  > Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.

- [SnpEff](https://www.ncbi.nlm.nih.gov/pubmed/22728672/)

  > Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012 Apr-Jun;6(2):80-92. doi: 10.4161/fly.19695. PubMed PMID: 22728672; PubMed Central PMCID: PMC3679285.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total
  • Watch event: 1
  • Delete event: 1
  • Issue comment event: 1
  • Push event: 8
  • Pull request event: 5
  • Fork event: 1
  • Create event: 3
Last Year
  • Watch event: 1
  • Delete event: 1
  • Issue comment event: 1
  • Push event: 8
  • Pull request event: 5
  • Fork event: 1
  • Create event: 3

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 3
  • Average time to close issues: N/A
  • Average time to close pull requests: 26 days
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 3
  • Average time to close issues: N/A
  • Average time to close pull requests: 26 days
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • DarianHole (5)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/branch.yml actions
  • mshick/add-pr-comment b8f338c590a895d50bcbfa6c5859251edc8952fc composite
.github/workflows/ci.yml actions
  • actions/checkout b4ffde65f46336ab88eb53be808477a3936bae11 composite
  • jlumbroso/free-disk-space 54081f138730dfa15788a46383842cd2f914a1be composite
  • nf-core/setup-nextflow b9f764e8ba5c76b712ace14ecbfcef0e40ae2dd8 composite
.github/workflows/linting.yml actions
  • actions/checkout b4ffde65f46336ab88eb53be808477a3936bae11 composite
  • actions/setup-python 0a5c61591373683505ea898e09a3ea4f39ef2b9c composite
  • actions/upload-artifact 5d5d22a31266ced268874388b861e4b58bb5c2f3 composite
  • nf-core/setup-nextflow b9f764e8ba5c76b712ace14ecbfcef0e40ae2dd8 composite
.github/workflows/linting_comment.yml actions
  • dawidd6/action-download-artifact f6b0bace624032e30a85a8fd9c1a7f8f611f5737 composite
  • marocchino/sticky-pull-request-comment 331f8f5b4215f0445d3c07b4967662a32a2d3e31 composite
modules/nf-core/custom/dumpsoftwareversions/meta.yml cpan
modules/nf-core/samtools/flagstat/meta.yml cpan
pyproject.toml pypi
modules/local/artic/guppyplex/environment.yml pypi
modules/local/artic/minion/environment.yml pypi
modules/local/artic_subcommands/environment.yml pypi
modules/local/bcftools/consensus/environment.yml pypi
modules/local/bcftools/norm/environment.yml pypi
modules/local/bcftools/stats/environment.yml pypi
modules/local/bedtools/coverage/environment.yml pypi
modules/local/chopper/environment.yml pypi
modules/local/longshot/environment.yml pypi
modules/local/minimap2/environment.yml pypi
modules/local/multiqc/environment.yml pypi
modules/local/nanostat/environment.yml pypi
modules/local/qc/environment.yml pypi
modules/local/qualimap/bamqc/environment.yml pypi
modules/local/samtools/depth/environment.yml pypi
modules/local/samtools/reheader/environment.yml pypi
modules/local/snpeff/environment.yml pypi
modules/nf-core/custom/dumpsoftwareversions/environment.yml pypi
modules/nf-core/samtools/flagstat/environment.yml pypi