mikrokondo

A simple pipeline for bacterial assembly and quality control

https://github.com/phac-nml/mikrokondo

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 7 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.4%) to scientific vocabulary

Keywords

annotation assembly bacteria bioinformatics contamination-detection metagenomics nextflow pipelines quality-control
Last synced: 6 months ago · JSON representation ·

Repository

A simple pipeline for bacterial assembly and quality control

Basic Info
Statistics
  • Stars: 17
  • Watchers: 7
  • Forks: 3
  • Open Issues: 29
  • Releases: 19
Topics
annotation assembly bacteria bioinformatics contamination-detection metagenomics nextflow pipelines quality-control
Created over 2 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog Contributing License Citation

README.md

Logo

Nextflow <!-- run with conda --> run with docker run with singularity <!-- Launch on Nextflow Tower -->

Table of contents generated with markdown-toc

Introduction

What is mikrokondo?

Mikrokondo is a tidy workflow for performing routine bioinformatic tasks like, read pre-processing, assessing contamination, assembly and quality assessment of assemblies. It is easily configurable, provides dynamic dispatch of species specific workflows and produces common outputs.

Is mikrokondo right for me?

Mikrokondo is purpose built to provide sequencing and clinical laboratories with an all encompassing workflow to provide a standardized workflow that can provide the initial quality assessment of sequencing reads and assemblies, and initial pathogen-specific typing. It has been designed to be configurable so that new tools and quality metrics can be easily incorporated into the workflow to allow for automation of these routine tasks regardless of pathogen of interest. It currently accepts Illumina, Nanopore or Pacbio (Pacbio data only partially tested) sequencing data. It is capable of hybrid assembly or accepting pre-assembled genomes.

This workflow will detect what pathogen(s) is present and apply the applicable metrics and genotypic typing where appropriate, generating easy to read and understand reports. If your group is regularly sequencing or analyzing genomic sequences, implementation of this workflow will automate the hands-on time time usually required for these common bioinformatic tasks.

Citation

This software (currently unpublished) can be cited as:

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Contact

[Matthew Wells] : matthew.wells@phac-aspc.gc.ca

Installing mikrokondo

Step 1: Installing Nextflow

Nextflow is required to run mikrokondo (requires Linux), and instructions for its installation can be found at either: Nextflow Home or Nextflow Documentation

Step 2: Choose a Container Engine

Nextflow and Mikrokondo only supports running the pipeline using containers such as: Docker, Singularity (now apptainer), podman, gitpod, shifter and charliecloud. Currently only usage with Singularity has been fully tested, (Docker and Apptainer have only been partially tested) but support for each of the container services exists.

[!Note] Singularity was adopted by the Linux Foundation and is now called Apptainer. Singularity still exists, but it is likely newer installs will use Apptainer.

Docker or Singularity?

Docker or Singularity (Apptainer) Docker requires root privileges which can can make it a hassle to install on computing clusters (there are workarounds). Apptainer/Singularity does not, so running the pipeline using Apptainer/Singularity is the recommended method for running the pipeline.

Step 3: Install dependencies

Besides the Nextflow run time (requires Java), and container engine the dependencies required by mikrokondo are fairly minimal requiring only Python 3.10 (more recent Python versions will work as well) to run.

Dependencies listed

  • Python (3.10>=)
  • Nextflow (22.10.1>=)
  • Container service (Docker, Singularity, Apptainer have been tested)
  • The source code: git clone https://github.com/phac-nml/mikrokondo.git

Step 4: Further resources to download

  • GTDB Mash Sketch: required for speciation and determination if sample is metagenomic
  • Decontamination Index: Required for decontamination of reads (it is simply a minimap2 index)
  • Kraken2 database: Required for binning of metagenomic data and is an alternative to using Mash for speciation
  • Bakta database: Running Bakta is optional and there is a light database option, however the full one is recommended. You will have to unzip and un-tar the database for usage. You can skip running Bakta however making the requirement of downloading this database optional.
  • StarAMR database: Running StarAMR is optional and requires downloading the StarAMR databases. Also if you wish to avoid downloading the database, the container for StarAMR has a database included which mikrokondo will default to using if one is not specified making this requirement optional.

Configuration and settings:

The above downloadable resources must be updated in the following places in your nextflow.config. The spots to update in the params section of the nextflow.config are listed below:

``` // Bakta db path, note the quotation marks bakta_db = "/PATH/TO/BAKTA/DB"

// Decontamination minimap2 index, note the quotation marks dehosting_idx = "/PATH/TO/DECONTAMINATION/INDEX"

// kraken db path, not the quotation marks kraken2_db = "/PATH/TO/KRAKEN/DATABASE/"

// GTDB Mash sketch, note the quotation marks mash_sketch = "/PATH/TO/MASH/SKETCH/"

// STARAMR database path, note the quotation marks // Passing in a StarAMR database is optional if one is not specified the database in the container will be used. You can just leave the db option as null if you do not wish to pass one staramr_db = "/PATH/TO/STARMAR/DB" ```

The above parameters can be accessed for the command line as for passing arguments to the pipeline if not set in the nextflow.config file.

Getting Started

Usage

nextflow run main.nf --input PATH_TO_SAMPLE_SHEET --outdir OUTPUT_DIR --platform SEQUENCING_PLATFORM -profile CONTAINER_TYPE

Please check out the documentation for complete usage instructions here: docs

Under the usage section you can find example commands, instructions for configuration and a reference to a utility script to reduce command line bloat!

Data Input/formats

Mikrokondo requires two things as input: 1. Sample files - fastq and fasta must be in gzip format 2. Sample sheet - this FOFN (file of file names) contains sample names and allows user to combine read-sets. The following header fields are accepted: - sample - fastq1 - fastq2 - long_reads - assembly

For more information see the usage docs.

Output/Results

All output files will be written into the outdir (specified by the user). More explicit tool results can be found in both the Workflow and Subworkflow sections of the docs. Here is a brief description of the outdir structure (though in brief the further into the structure you head, the further in the workflow the tool has been run):

  • Assembly - contains all output files generated as a result of read assembly and tools using assembled contigs as input
    • Annotation - contains output files generated from tools applying annotation and/or gene characterization from assembled contigs
    • Assembling - contains output files generated as a part of the assembly process in nested order
    • FinalAssembly - this directory will always contain the final output contig files from the last step in the assembly process (will take into account any skip flags in the process)
    • PostProcessing - contains output files from intermediary tools that run after assembly but before annotation takes place in the workflow
    • Quality - contains all output files generated as a result of quality tools after assembly
  • Subtyping - contains all output files from workflow subtyping tools, based off assembled contigs
  • FinalReports - contains assorted reports including aggregated and flat reports
  • pipeline_info - includes tool versions and other pipeline specific information
  • Reads - contains all output files generated as a result of read processing and tools using reads as input
    • FinalReads - this directory will contain the final output read files from the last step in read processing (taking into account any skip flags used in the run)
    • Processing - contains output files from tools run to process reads in nested order
    • Quality - contains all output files generated from read quality tools

Run example data

Three test profile with example data are provided and can be run like so:

  • Assembly test profile: nextflow run main.nf -profile test_assembly,<docker/singularity> --outdir <OUTDIR>
  • Illumina test profile: nextflow run main.nf -profile test_illumina,<docker/singularity> --outdir <OUTDIR>
  • Nanopore test profile: nextflow run main.nf -profile test_nanopore,<docker/singularity> --outdir <OUTDIR>
  • Pacbio test profile: nextflow run main.nf -profile test_pacbio,<docker/singularity> --outdir <OUTDIR>
    • The pacbio workflow has only been partially tested as it fails at Flye due to not enough reads being present.

Testing

Integration tests are implemented using nf-test. In order to run tests locally, please do the following:

Install nf-test

```bash

Only need to install package nf-test. Below is only for

if you want to have nextflow and nf-test in a separate environment

conda create --name nextflow-testing nextflow nf-test conda activate nextflow-testing ```

Run tests

```bash

From mikrokondo root directory

nf-test test ```

Add --profile singularity to switch from using docker by default to using singularity.

Troubleshooting and FAQs:

Within release 0.1.0, Bakta is currently skipped however it can be enabled from the command line or within the nextflow.config (please check the docs for more information). It has been disabled by default due issues in using the latest bakta database releases due to an issue with amr_finder there are fixes available and older databases still work however they have not been tested. A user can still enable Bakta themselves or fix the database. More information is provided here: https://github.com/oschwengers/bakta/issues/268

For a list of common issues or errors and their solutions, please read our FAQ section.

References

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

Legal and Compliance Information:

Copyright Government of Canada 2023

Written by: National Microbiology Laboratory, Public Health Agency of Canada

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Updates and Release Notes:

Owner

  • Name: National Microbiology Laboratory
  • Login: phac-nml
  • Kind: organization

Citation (CITATIONS.md)

# mk-kondo/mikrokondo: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [Abricate](https://github.com/tseemann/abricate)
  > Seemann T, Abricate, Github https://github.com/tseemann/abricate

- [Bakta](https://github.com/oschwengers/bakta#citation)
  > Schwengers O., Jelonek L., Dieckmann M. A., Beyvers S., Blom J., Goesmann A. (2021). Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11). https://doi.org/10.1099/mgen.0.000685

- [Bandage](https://github.com/rrwick/Bandage.git)
  > Wick R.R., Schultz M.B., Zobel J. & Holt K.E. (2015). Bandage: interactive visualisation of de novo genome assemblies. Bioinformatics, 31(20), 3350-3352.

- [CheckM](https://github.com/Ecogenomics/CheckM)
  > Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2014. Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25: 1043-1055.

- [ECTyper](https://github.com/phac-nml/ecoli_serotyping)
  > Bessonov K, Laing C, Robertson J, Yong I, Ziebell K, Gannon VPJ, Nichani A, Arya G, Nash JHE, Christianson S. ECTyper: in silico Escherichia coli serotype and species prediction from raw and assembled whole-genome sequence data. Microb Genom. 2021 Dec;7(12):000728. doi: 10.1099/mgen.0.000728. PMID: 34860150; PMCID: PMC8767331.

- [FastP](https://doi.org/10.1093/bioinformatics/bty560)
  > Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884-i890, https://doi.org/10.1093/bioinformatics/bty560

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

- [Flye](https://github.com/fenderglass/Flye)
  > Mikhail Kolmogorov, Derek M. Bickhart, Bahar Behsaz, Alexey Gurevich, Mikhail Rayko, Sung Bong Shin, Kristen Kuhn, Jeffrey Yuan, Evgeny Polevikov, Timothy P. L. Smith and Pavel A. Pevzner "metaFlye: scalable long-read metagenome assembly using repeat graphs", Nature Methods, 2020 doi:s41592-020-00971-x

- [GTDB](https://gtdb.ecogenomic.org/)
  > Donovan H Parks, Maria Chuvochina, Christian Rinke, Aaron J Mussig, Pierre-Alain Chaumeil, Philip Hugenholtz, GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy, Nucleic Acids Research, Volume 50, Issue D1, 7 January 2022, Pages D785–D794, https://doi.org/10.1093/nar/gkab776

- [Kleborate](https://github.com/klebgenomics/Kleborate)
  > Lam, MMC. et al. A genomic surveillance framework and genotyping tool for Klebsiella pneumoniae and its related species complex. Nature Communications (2021)

- [Kraken2](https://github.com/DerrickWood/kraken2)
  > Wood, D.E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257 (2019). https://doi.org/10.1186/s13059-019-1891-0

- [LisSero](https://github.com/MDU-PHL/LisSero)
  > Doumith et al. Differentiation of the major Listeria monocytogenes serovars by multiplex PCR. J Clin Microbiol, 2004; 42:8; 3819-22

- [Mash](https://github.com/marbl/Mash)
  > Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.

- [Medaka](https://github.com/nanoporetech/medaka)

- [Minimap2](https://github.com/lh3/minimap2)
  > Li, H. (2021). New strategies to improve minimap2 alignment accuracy. Bioinformatics, 37:4572-4574. doi:10.1093/bioinformatics/btab705

- [mlst](https://github.com/tseemann/mlst)
  > Seemann T, mlst Github https://github.com/tseemann/mlst

- [mob-suite](https://github.com/phac-nml/mob-suite)
  > Robertson J, Nash JHE. MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies. Microb Genom. 2018 Aug;4(8):e000206. doi: 10.1099/mgen.0.000206. Epub 2018 Jul 27. PMID: 30052170; PMCID: PMC6159552.

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

- [Pilon](https://github.com/broadinstitute/pilon)
  > Bruce J. Walker, Thomas Abeel, Terrance Shea, Margaret Priest, Amr Abouelliel, Sharadha Sakthikumar, Christina A. Cuomo, Qiandong Zeng, Jennifer Wortman, Sarah K. Young, Ashlee M. Earl (2014) Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. PLoS ONE 9(11): e112963. doi:10.1371/journal.pone.0112963

- [Quast](https://github.com/ablab/quast)
  > Alla Mikheenko, Andrey Prjibelski, Vladislav Saveliev, Dmitry Antipov, Alexey Gurevich, Versatile genome assembly evaluation with QUAST-LG, Bioinformatics (2018) 34 (13): i142-i150. doi: 10.1093/bioinformatics/bty266 First published online: June 27, 2018

- [Racon](https://github.com/isovic/racon)
  > Vaser R, Sović I, Nagarajan N, Šikić M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 2017 May;27(5):737-746. doi: 10.1101/gr.214270.116. Epub 2017 Jan 18. PMID: 28100585; PMCID: PMC5411768.

- [Samtools](https://github.com/samtools/samtools)
  > Twelve years of SAMtools and BCFtools Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008

- [Seqtk](https://github.com/lh3/seqtk.git)

- [Shigatyper](https://github.com/CFSAN-Biostatistics/shigatyper)
  > Wu Y, Lau HK, Lee T, Lau DK, Payne J In Silico Serotyping Based on Whole-Genome Sequencing Improves the Accuracy of Shigella Identification. Applied and Environmental Microbiology, 85(7). (2019)

- [Sistr](https://github.com/phac-nml/sistr_cmd)
  > The Salmonella In Silico Typing Resource (SISTR): an open web-accessible tool for rapidly typing and subtyping draft Salmonella genome assemblies. Catherine Yoshida, Peter Kruczkiewicz, Chad R. Laing, Erika J. Lingohr, Victor P.J. Gannon, John H.E. Nash, Eduardo N. Taboada. PLoS ONE 11(1): e0147101. doi: 10.1371/journal.pone.0147101

- [Spades](https://github.com/ablab/spades)
  > Bankevich A., Nurk S., Antipov D., Gurevich A., Dvorkin M., Kulikov A. S., Lesin V., Nikolenko S., Pham S., Prjibelski A., Pyshkin A., Sirotkin A., Vyahhi N., Tesler G., Alekseyev M. A., Pevzner P. A. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 2012

- [Spatyper](https://github.com/HCGB-IGTP/spaTyper)
  > Jose Francisco Sanchez-Herrero, & mjsull. (2020). spaTyper: Staphylococcal protein A (spa) characterization pipeline (v0.3.1). Zenodo. https://doi.org/10.5281/zenodo.4063625

- [StarAMR](https://github.com/phac-nml/staramr)
  > Bharat A, Petkau A, Avery BP, Chen JC, Folster JP, Carson CA, Kearney A, Nadon C, Mabon P, Thiessen J, Alexander DC, Allen V, El Bailey S, Bekal S, German GJ, Haldane D, Hoang L, Chui L, Minion J, Zahariadis G, Domselaar GV, Reid-Smith RJ, Mulvey MR. Correlation between Phenotypic and In Silico Detection of Antimicrobial Resistance in Salmonella enterica in Canada Using Staramr. Microorganisms. 2022; 10(2):292. https://doi.org/10.3390/microorganisms10020292

- [Unicycler](https://github.com/rrwick/Unicycler)
  > Wick RR, Judd LM, Gorrie CL, Holt KE (2017) Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLOS Computational Biology 13(6): e1005595. https://doi.org/10.1371/journal.pcbi.1005595

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)
  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total
  • Create event: 28
  • Release event: 8
  • Issues event: 21
  • Watch event: 5
  • Delete event: 17
  • Issue comment event: 68
  • Push event: 177
  • Pull request review comment event: 140
  • Pull request review event: 158
  • Pull request event: 54
  • Fork event: 1
Last Year
  • Create event: 28
  • Release event: 8
  • Issues event: 21
  • Watch event: 5
  • Delete event: 17
  • Issue comment event: 68
  • Push event: 177
  • Pull request review comment event: 140
  • Pull request review event: 158
  • Pull request event: 54
  • Fork event: 1

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 135
  • Total Committers: 2
  • Avg Commits per committer: 67.5
  • Development Distribution Score (DDS): 0.052
Past Year
  • Commits: 135
  • Committers: 2
  • Avg Commits per committer: 67.5
  • Development Distribution Score (DDS): 0.052
Top Committers
Name Email Commits
Matthew Wells m****9@s****a 128
Matthew Wells m****s@c****a 7
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 42
  • Total pull requests: 115
  • Average time to close issues: 29 days
  • Average time to close pull requests: 4 days
  • Total issue authors: 4
  • Total pull request authors: 8
  • Average comments per issue: 0.83
  • Average comments per pull request: 1.14
  • Merged pull requests: 89
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 15
  • Pull requests: 45
  • Average time to close issues: 11 days
  • Average time to close pull requests: 4 days
  • Issue authors: 3
  • Pull request authors: 7
  • Average comments per issue: 0.73
  • Average comments per pull request: 1.09
  • Merged pull requests: 32
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • mattheww95 (36)
  • apetkau (3)
  • jrober84 (2)
  • sgsutcliffe (1)
Pull Request Authors
  • mattheww95 (77)
  • apetkau (24)
  • emarinier (4)
  • sgsutcliffe (4)
  • ChristyPeterson (3)
  • j3551ca (1)
  • dfornika (1)
  • kylacochrane (1)
Top Labels
Issue Labels
enhancement (24) bug (16) no-repro (1) question (1)
Pull Request Labels
enhancement (1)