monty_prerelease_info

Information about the MONTY metagenomics pipeline - coming soon!

https://github.com/charlesfoster/monty_prerelease_info

Last synced: 9 months ago · JSON representation ·

Repository

Information about the MONTY metagenomics pipeline - coming soon!

Basic Info

Host: GitHub
Owner: charlesfoster
License: mit
Default Branch: main
Size: 7.78 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

Purpose of this repository

This repository is intended to provide a simple overview of the MONTY pipeline, including the motivations behind its development, its goals, and a simplified explanation of how it works.

While the pipeline will be fully open-source, the code has not yet been publicly released while further bug testing + refinements/optimisations are conducted. The release is intended to be by the end of 2024. Please feel free to check in on the following repository in which the pipeline will be available upon release: https://github.com/charlesfoster/monty. Once you stop seeing a 404 error, the pipeline can be considered ready to go!

While waiting for the pipeline to be released, please click here to enjoy footage of the real life Monty (see: "What's in a name?").

Related media

Files related to presentation of this pipeline, either via talks, posters, or other, are as follows:

ABACBS 2024 poster: click here

Pipeline introduction

MONTY is a bioinformatics pipeline that has been designed to allow the simple execution of a complex workflow(s) for the analysis of metagenomic sequencing data, with a focus on the virome component. Based on an input spreadsheet, the pipeline can take raw input reads (short reads: single-end or paired-end; long reads: under development) then conduct quality control, kmer-based taxonomy assignment, de novo assembly, virus identification, mapping-based taxonomy assignment, and estimation of taxon relative abundance.

Currently the taxon count matrices output by the pipeline are raw counts, but upcoming development is planned to enable the generation of normalised counts. While virus-focused, the pipeline will also work with any metagenomics dataset, but the databases used within the pipeline (both default and otherwise) will need to be adjusted.

What's in a name?

Bioinformaticians have a long history of adopting interesting/funky names for their tools to stand out from the crowd and aid in searchability. So, instead of the original generic pipeline name of "viromics", we decided to name the pipeline after one of our lab's unofficial mascots: my dog Monty. After that, it was just a matter of forcing a backronym onto the pipeline, and, hence, the "Metagenomic Analysis of Existing and Novel Threats in Virology" name was birthed.

Why develop another new metagenomics pipeline?

As the field of metagenomics has grown, so too has the breadth of new bioinformatics pipelines to help analyse metagenomic data. Within our lab group we wanted the ability to control in fine detail how we analyse our data to avoid relying on external pipelines that might not quite do what we want, or might even cease to be supported. Additionally, while the number of metagenomics pipelines that consider viruses is growing, in some cases (certainly not all) the inclusion of virus-focused analyses seems like an afterthought, with the primary focus being the characterisation of bacterial taxa.

What goals have driven development?

As with other fields in science, conducting metagenomic analyses can seem like a bit of a black box. A given workflow might provide you results, but how reliable are those results? There are many steps that can introduce biases into results, such as the choice of tools and databases. Accordingly, we sought to develop a virus-focused workflow that is transparent, reliable, repeatable, freely available, open-source, highly scalable, and easy to run.

While several excellent workflow languages exist, a natural underlying choice was to implement the workflow using the Nextflow domain-specific language given its strong growth of Nextflow, including a surging uptake in the bioinformatics community. We also chose to follow the 'nf-core framework' for Nextflow workflow development, given its adherence to best practices and provision of automated pipeline testing, deployment and synchronization. We do not intend at this stage to propose MONTY as being an official nf-core pipeline given some overlaps with existing nf-core workflows (see: Credits), but continued development will mirror nf-core development procedures.

By developing the pipeline in Nextflow, MONTY will be able to be executed on high-performance computing infrastructures (including cloud-based services like AWS, Google Cloud Batch etc.), as well as having native support for container technologies such as Docker and Singularity.

Pipeline steps

charlesfoster/monty

Quality Control

Raw read QC (FastQC)
Read deduplication (optional) (BBtools Clumpify)
Read filtering/trimming (optional)
- Short reads: (fastp)
Assessment of virome enrichment (optional) (ViromeQC)
Removal of contaminant reads from the host and/or PhiX spike-in (optional) (bowtie2 or hostile)
Clean read QC (FastQC)

Kmer-based Taxonomy Assignment

Assignment of taxonomy to reads using kraken2 and/or centrifuge
- Re-estimation of kraken2 counts using bracken
Visualisation with KRONA

De Novo Assembly

Assembly of reads de novo into contigs/scaffolds using MEGAHIT, SPAdes or PLASS PENGUIN
Assessment of assembly quality (QUAST)

Virus Identification

Identification of virus contigs from de novo assemblies (geNomad and/or cenotetaker3)
Assessment of the quality/completeness of identified viruses (CheckV)
Binning of virus genomes (vRhyme)

Counts Estimation

Assessment of coverage of contigs (either all contigs or just virus contigs) (bowtie2, samtools, CoverM)
Taxonomic assignment of reads and contigs using diamond and/or mmseqs2
Reformatting and cleaning of taxids (taxonkit)
Conversion of results into counts matrices aggregated at various user-defined taxonomic levels (custom script)

Roadmap

Implementation of a long reads workflow
Allow users to include some samples with paired-end reads and some samples with single-end reads (currently one 'type' or the other must be used for all samples)
Improved normalisation of output count matrices
In-depth benchmarking against existing tools
Provision of MONTY as an online service via AWS and/or Seqera Platform

Credits

Code and inspiration

MONTY was originally written by Dr Charles S.P. Foster.

We thank the maintainers and developers of nf-core. The development of some pipeline sections and decisions has been inspired by, and overlaps with, several excellent nf-core workflows, such as nf-core/mag and nf-core/phageannotator. By design, these workflows are meant to be run sequentially, e.g. "nf-core/mag --> nf-core/phageannotator --> nf-core/differentialabundance --> ...", lending into the strength of the interoperability of pipelines within the nf-core framework. However, in our case we wished to have an end-to-end pipeline with a focus on viruses and implementing additional analytical tools and parameter values (both default values and possible values).

Funding

Development of the MONTY pipeline is supported in part by the following organisations:

Logo for NHMRC

NHMRC IDEAS grant (2021404)

Logo for NHMRC

JDRF Career Development award (5-CDA-2023-1332-S-B)
JDRF Australia EMCR Science Accelerator Award (2-SRA-2021-1083-M-B)

Logo for NHMRC

Helmsley Charitable Trust

Suggestions?

If you have any suggestions for development, please feel free to email me: charlesDOTfosterATunswDOTeduDOTau.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

Login: charlesfoster
Kind: user

Repositories: 2
Profile: https://github.com/charlesfoster

Citation (CITATIONS.md)

# charlesfoster/monty: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [Bowtie2](https:/dx.doi.org/10.1038/nmeth.1923)

  > Langmead, B. and Salzberg, S. L. 2012 Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), p. 357–359. doi: 10.1038/nmeth.1923.

- [Bracken](https://dx.doi.org/10.7717/peerj-cs.104)

  > Lu J, Breitwieser FP, Thielen P, Salzberg SL. 2017. Bracken: estimating species abundance in metagenomics data. PeerJ Computer Science 3:e104. https://doi.org/10.7717/peerj-cs.104

- [Cenotetaker3](https://doi.org/10.1093/ve/veaa100)

  > Michael J Tisza, Anna K Belford, Guillermo Domínguez-Huerta, Benjamin Bolduc, Christopher B Buck, Cenote-Taker 2 democratizes virus discovery and sequence annotation, Virus Evolution, Volume 7, Issue 1, January 2021, veaa100, https://doi.org/10.1093/ve/veaa100.

- [Centrifuge](https://doi.org/10.1101/gr.210641.116)

  > Kim, D., Song, L., Breitwieser, F. P., & Salzberg, S. L. (2016). Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome research, 26(12), 1721-1729. doi: 10.1101/gr.210641.116.

- [CheckM](https://doi.org/10.1101/gr.186072.114)

  > Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7), 1043–1055. doi: 10.1101/gr.186072.114

- [CheckV](https://pubmed.ncbi.nlm.nih.gov/33349699/)

  > Nayfach S, Camargo AP, Schulz F, Eloe-Fadrosh E, Roux S, Kyrpides NC. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat Biotechnol. 2021 May;39(5):578-585. doi: 10.1038/s41587-020-00774-7. Epub 2020 Dec 21. PMID: 33349699; PMCID: PMC8116208.

- [Clumpify/BBTools](http://sourceforge.net/projects/bbmap/)

- [FastP](https://doi.org/10.1093/bioinformatics/bty560)

  > Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics , 34(17), i884–i890. doi: 10.1093/bioinformatics/bty560.

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

  > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].

- [geNomad](https://doi.org/10.1101/2023.03.05.531206)

  > Camargo, A. P., et al. (2023). You can move, but you can’t hide: identification of mobile genetic elements with geNomad. bioRxiv preprint. doi: https://doi.org/10.1101/2023.03.05.531206

- [Kraken2](https://doi.org/10.1186/s13059-019-1891-0)

  > Wood, D et al., 2019. Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257. doi: 10.1186/s13059-019-1891-0.

- [Krona](https://doi.org/10.1186/1471-2105-12-385)

  > Ondov, B. D., Bergman, N. H., & Phillippy, A. M. (2011). Interactive metagenomic visualization in a Web browser. BMC bioinformatics, 12(1), 1-10. doi: 10.1186/1471-2105-12-385.

- [MEGAHIT](https://doi.org/10.1016/j.ymeth.2016.02.020)

  > Li, D., Luo, R., Liu, C. M., Leung, C. M., Ting, H. F., Sadakane, K., ... & Lam, T. W. (2016). MEGAHIT v1. 0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods, 102, 3-11. doi: 10.1016/j.ymeth.2016.02.020.

- [MMseqs2](https://www.nature.com/articles/nbt.3988)

  > Steinegger, M., Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35, 1026–1028 (2017). https://doi.org/10.1038/nbt.3988

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

- [Penguin](https://doi.org/10.1101/2024.03.29.587318)

  > Annika Jochheim, Florian A. Jochheim, Alexandra Kolodyazhnaya, Étienne Morice, Martin Steinegger, Johannes Söding. Strain-resolved de-novo metagenomic assembly of viral genomes and microbial 16S rRNAs. bioRxiv 2024.03.29.587318; doi: https://doi.org/10.1101/2024.03.29.587318.

- [QUAST](https://doi.org/10.1093/bioinformatics/btt086)

  > Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, Glenn Tesler, QUAST: quality assessment tool for genome assemblies, Bioinformatics, Volume 29, Issue 8, April 2013, Pages 1072–1075, https://doi.org/10.1093/bioinformatics/btt086.

- [SAMtools](https://doi.org/10.1093/bioinformatics/btp352)

  > Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics , 25(16), 2078–2079. doi: 10.1093/bioinformatics/btp352.

- [Seqkit](https://doi.org/10.1371/journal.pone.0163962)

  > Shen W, Le S, Li Y, Hu F (2016) SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS ONE 11(10): e0163962. https://doi.org/10.1371/journal.pone.0163962

- [Seqtk](https://github.com/lh3/seqtk)

- [SPAdes](https://doi.org/10.1101/gr.213959.116)

  > Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome research, 27(5), 824-834. doi: 10.1101/gr.213959.116.

- [ViromeQC](https://doi.org/10.1038/s41587-019-0334-5)

  > Zolfo, M., Pinto, F., Asnicar, F. et al. Detecting contamination in viromes using ViromeQC. Nat Biotechnol 37, 1408–1412 (2019). https://doi.org/10.1038/s41587-019-0334-5.

- [vRhyme](https://pubmed.ncbi.nlm.nih.gov/35544285/)

  > Kieft K, Adams A, Salamzade R, Kalan L, Anantharaman K. vRhyme enables binning of viral genomes from metagenomes. Nucleic Acids Res. 2022 Aug 12;50(14):e83. doi: 10.1093/nar/gkac341. PMID: 35544285; PMCID: PMC9371927.

## Data

- [nf-core/mag test data](https://github.com/nf-core/test-datasets/tree/mag)
  > Sabrina Krakau, Daniel Straub, Hadrien Gourlé, Gisela Gabernet, Sven Nahnsen, nf-core/mag: a best-practice pipeline for metagenome hybrid assembly and binning, NAR Genomics and Bioinformatics, Volume 4, Issue 1, March 2022, lqac007, https://doi.org/10.1093/nargab/lqac007

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

monty_prerelease_info

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Purpose of this repository

Related media

Pipeline introduction

What's in a name?

Why develop another new metagenomics pipeline?

What goals have driven development?

Pipeline steps

Quality Control

Kmer-based Taxonomy Assignment

De Novo Assembly

Virus Identification

Counts Estimation

Roadmap

Credits

Code and inspiration

Funding

Suggestions?

Citations

Owner

Citation (CITATIONS.md)

GitHub Events

Total

Last Year