viralassembly

Assemble and QC viral reads with a reference scheme

https://github.com/phac-nml/viralassembly

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.9%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

Assemble and QC viral reads with a reference scheme

Basic Info

Host: GitHub
Owner: phac-nml
License: mit
Language: Nextflow
Default Branch: main
Size: 12.1 MB

Statistics

Stars: 1
Watchers: 4
Forks: 1
Open Issues: 1
Releases: 0

Created over 2 years ago · Last pushed about 1 year ago

Metadata Files

Readme Changelog Contributing License Citation

viralassembly

A generic viral assembly and QC pipeline which utilises a re-implementation of the artic pipeline to separate out the individual steps allowing greater control on tool versions along with how data is run through the processes. This pipeline can be used as a starting point for analyses on viruses without dedicated workflows already available.

This pipeline is intended to be run on either Nanopore Amplicon Sequencing data or Basic Nanopore NGS Sequencing data that can utilize a reference genome for mapping variant calling, and other downstream analyses. It generates variant calls, consensus sequences, and quality control information based on the reference. To do this, there are three different variant callers that can be utilized which includes: clair3, medaka, and nanopolish (For R9.4.1 flowcells and below only).

Some of the goals of this pipeline are: 1. Rework the artic nanopore pipeline steps as nextflow modules to deal with specific bugs and version incompatibilities - Example: BCFtools consensus error seen in artic pipeline sometimes - Allows adding in clair3 as a new variant calling tool - Potentially eventually work to remove artic as a dependency 2. Allow the pipeline to be used on other viruses with or without amplicon schemes - Due to the QC steps there is unfortunately a current limitation at working with segmented viruses - The pipeline will automatically exit after assembly and not generate QC and Reports for these at this time - This will hopefully be fully implemented at some point in the future 3. Provide Run level and Sample level final reports

Installation

Download and install nextflow
1. Download and install with conda
  - Conda command: conda create on nextflow -c conda-forge -c bioconda nextflow
2. Install with the instructions at https://www.nextflow.io/
Run the pipeline with one of the following profiles to handle dependencies (or use your own profile if you have one!):
- conda
- mamba
- singularity
- docker

Running Commands

Simple commands to run input data. Input data can be done in three different ways: 1. Passing --fastq_pass </PATH/TO/fastq_pass> where fastq_pass is a directory containing barcode## subdirectories with fastq files 2. Passing --fastq_pass </PATH/TO/fastqs> where fastqs is a directory containing .fastq* files 3. Passing --input <samplesheet.csv> where samplesheet.csv is a CSV file with two columns 1. sample - The name of the sample 2. fastq_1 - Path to one fastq file per sample in .fastq* format

The basic examples will show how to run the pipeline using the --fastq_pass input but it could be subbed in for the --input CSV file if wanted.

All detailed running information is available in the usage docs

Nanopore - Clair3

Running the pipeline with Clair3 for variant calls requires fastq files and a clair3 model. When running, the pipeline will either: - Look for subdirectories off of the input "--fastqpass" directory called barcode## to be used in the pipeline - Look for fastq files in the input "--fastqpass" directory called *.fastq* to be used in the pipeline

This pipeline utilizes the same steps as the artic fieldbioinformatics minion pipeline but with each step run using nextflow to allow clair3 to be easily slotted in. See the clair3 section of the usage docs for more information

Basic command: bash nextflow run /PATH/TO/artic-generic-nf/main.nf \ -profile <PROFILE(s)> \ --variant_caller 'clair3' \ --fastq_pass </PATH/TO/fastq_pass> \ --reference <REF.fa> \ <OPTIONAL INPUTS>

Optional inputs could include: - Amplicon scheme instead of just a reference fasta file - Metadata - Filtering options - Running SnpEff for variant consequence prediction - Output reporting options

Nanopore - Medaka

Running the pipeline with medaka for variant calls requires fastq files and a medaka model. When running, the pipeline will either: - Look for subdirectories off of the input "--fastqpass" directory called barcode## to be used in the pipeline - Look for fastq files in the input "--fastqpass" directory called *.fastq* to be used in the pipeline

See the medaka section of the usage docs for more information

Basic command: bash nextflow run /PATH/TO/artic-generic-nf/main.nf \ -profile <PROFILE(s)> \ --variant_caller 'medaka' \ --fastq_pass </PATH/TO/fastq_pass> \ --medaka_model <Medaka Model> \ --reference <REF.fa> \ <OPTIONAL INPUTS>

Optional inputs could include: - Amplicon scheme instead of just a reference fasta file - Metadata - Filtering options - Using base artic minion instead of nextflow implementation - Running SnpEff for variant consequence prediction - Output reporting options

Medaka model information can be found here

Nanopore - Nanopolish

Running the pipeline with nanopolish for variant calls requires fastq files, fast5 files, and the sequencing summary file instead of providing a model. As such, nanopolish requires that the read ids in the fastq files are linked by the sequencing summary file to their signal-level data in the fast5 files. This makes it a lot easier to run using barcoded directories but it can be run with individual read files

See the nanopolish section of the usage docs for more information

Basic command: bash nextflow run /PATH/TO/artic-generic-nf/main.nf \ -profile <PROFILE(s)> \ --variant_caller 'nanopolish' \ --fastq_pass </PATH/TO/fastq_pass> \ --fast5_pass </PATH/TO/fast5_pass> \ --sequencing_summary </PATH/TO/sequencing_summary.txt> \ --reference <REF.fa> <OPTIONAL INPUTS>

Outputs

Outputs are separated based off of their tool or file format and found in the results/ directory by default.

Outputs include: - Consensus fasta files - VCF files - Bam files - HTML summary files (either custom or MultiQC)

More output information on pipeline steps and output files can be found in the output docs

Limitations

Current limitations include:

Nanopore data only at this time
Currently runs for viruses using a reference genome
- Segmented viruses will exit before the QC section for now
Custom report can only work when running with conda
SnpEff issues in running and database building/downloading
- Database building/downloading requires one of three things:
  - The reference ID is in the SnpEff database
    - This allows the database to be downloaded
  - A gff3 file
    - This is used with the reference sequence to build a database
  - A well annotated NCBI genome matching the reference ID
    - This will pull the genbank file and use that to build a database
- Running SnpEff with singularity sometimes leads to a lock issue which is hopefully fixed

Citations

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x. In addition, references of tools and data used in this pipeline are as follows:

Detailed citations for utilized tools are found in citations.md

Contributing

Contributions are welcome through creating PRs or Issues

Legal

Licensed under the MIT License (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:

https://opensource.org/license/mit/

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Owner

Name: National Microbiology Laboratory
Login: phac-nml
Kind: organization

Website: https://www.nml-lnm.gc.ca/
Repositories: 50
Profile: https://github.com/phac-nml

Citation (CITATIONS.md)

# phac-nml/viralassembly: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [ARTIC network](https://github.com/artic-network)

- [BCFtools](https://www.ncbi.nlm.nih.gov/pubmed/21903627/)

  > Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011 Nov 1;27(21):2987-93. doi: 10.1093/bioinformatics/btr509. Epub 2011 Sep 8. PubMed PMID: 21903627; PubMed Central PMCID: PMC3198575.

- [BEDTools](https://www.ncbi.nlm.nih.gov/pubmed/20110278/)

  > Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PubMed PMID: 20110278; PubMed Central PMCID: PMC2832824.

- [Chopper](https://academic.oup.com/bioinformatics/article/39/5/btad311/7160911?login=false)

  > Wouter De Coster, Rosa Rademakers, NanoPack2: population-scale evaluation of long-read sequencing data, Bioinformatics, Volume 39, Issue 5, May 2023, btad311, https://doi.org/10.1093/bioinformatics/btad311

- [Csvtk](https://github.com/shenwei356/csvtk)

- [Longshot](https://www.nature.com/articles/s41467-019-12493-y)

  > Edge, P., Bansal, V. Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing. Nat Commun 10, 4660 (2019). https://doi.org/10.1038/s41467-019-12493-y

- [Minimap2](https://academic.oup.com/bioinformatics/article/34/18/3094/4994778)

  > Heng Li, Minimap2: pairwise alignment for nucleotide sequences, Bioinformatics, Volume 34, Issue 18, September 2018, Pages 3094–3100, https://doi.org/10.1093/bioinformatics/bty191

- [MultiQC](https://www.ncbi.nlm.nih.gov/pubmed/27312411/)

  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

- [Qualimap](https://academic.oup.com/bioinformatics/article/28/20/2678/206551)

  > Fernando García-Alcalde, Konstantin Okonechnikov, José Carbonell, Luis M. Cruz, Stefan Götz, Sonia Tarazona, Joaquín Dopazo, Thomas F. Meyer, Ana Conesa, Qualimap: evaluating next-generation sequencing alignment data, Bioinformatics, Volume 28, Issue 20, October 2012, Pages 2678–2679, https://doi.org/10.1093/bioinformatics/bts503

- [R](https://www.R-project.org/)

  > R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

- [SAMtools](https://www.ncbi.nlm.nih.gov/pubmed/19505943/)

  > Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.

- [SnpEff](https://www.ncbi.nlm.nih.gov/pubmed/22728672/)

  > Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012 Apr-Jun;6(2):80-92. doi: 10.4161/fly.19695. PubMed PMID: 22728672; PubMed Central PMCID: PMC3679285.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total

Watch event: 1
Delete event: 1
Issue comment event: 1
Push event: 8
Pull request event: 5
Fork event: 1
Create event: 3

Last Year

Watch event: 1
Delete event: 1
Issue comment event: 1
Push event: 8
Pull request event: 5
Fork event: 1
Create event: 3

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 0
Total pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: 26 days
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: 26 days
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

DarianHole (5)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

.github/workflows/branch.yml actions

mshick/add-pr-comment b8f338c590a895d50bcbfa6c5859251edc8952fc composite

.github/workflows/ci.yml actions

actions/checkout b4ffde65f46336ab88eb53be808477a3936bae11 composite
jlumbroso/free-disk-space 54081f138730dfa15788a46383842cd2f914a1be composite
nf-core/setup-nextflow b9f764e8ba5c76b712ace14ecbfcef0e40ae2dd8 composite

.github/workflows/linting.yml actions

actions/checkout b4ffde65f46336ab88eb53be808477a3936bae11 composite
actions/setup-python 0a5c61591373683505ea898e09a3ea4f39ef2b9c composite
actions/upload-artifact 5d5d22a31266ced268874388b861e4b58bb5c2f3 composite
nf-core/setup-nextflow b9f764e8ba5c76b712ace14ecbfcef0e40ae2dd8 composite

.github/workflows/linting_comment.yml actions

dawidd6/action-download-artifact f6b0bace624032e30a85a8fd9c1a7f8f611f5737 composite
marocchino/sticky-pull-request-comment 331f8f5b4215f0445d3c07b4967662a32a2d3e31 composite

modules/nf-core/custom/dumpsoftwareversions/meta.yml cpan

modules/nf-core/samtools/flagstat/meta.yml cpan

pyproject.toml pypi

modules/local/artic/guppyplex/environment.yml pypi

modules/local/artic/minion/environment.yml pypi

modules/local/artic_subcommands/environment.yml pypi

modules/local/bcftools/consensus/environment.yml pypi

modules/local/bcftools/norm/environment.yml pypi

modules/local/bcftools/stats/environment.yml pypi

modules/local/bedtools/coverage/environment.yml pypi

modules/local/chopper/environment.yml pypi

modules/local/longshot/environment.yml pypi

modules/local/minimap2/environment.yml pypi

modules/local/multiqc/environment.yml pypi

modules/local/nanostat/environment.yml pypi

modules/local/qc/environment.yml pypi

modules/local/qualimap/bamqc/environment.yml pypi

modules/local/samtools/depth/environment.yml pypi

modules/local/samtools/reheader/environment.yml pypi

modules/local/snpeff/environment.yml pypi

modules/nf-core/custom/dumpsoftwareversions/environment.yml pypi

modules/nf-core/samtools/flagstat/environment.yml pypi

viralassembly

Science Score: 57.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

viralassembly

Index

Installation

Running Commands

Nanopore - Clair3

Nanopore - Medaka

Nanopore - Nanopolish

Outputs

Limitations

Citations

Contributing

Legal

Owner

Citation (CITATIONS.md)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies