monty_prerelease_info
Information about the MONTY metagenomics pipeline - coming soon!
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 4 DOI reference(s) in README -
✓Academic publication links
Links to: nature.com -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.9%) to scientific vocabulary
Repository
Information about the MONTY metagenomics pipeline - coming soon!
Basic Info
- Host: GitHub
- Owner: charlesfoster
- License: mit
- Default Branch: main
- Size: 7.78 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Purpose of this repository
This repository is intended to provide a simple overview of the MONTY pipeline, including the motivations behind its development, its goals, and a simplified explanation of how it works.
While the pipeline will be fully open-source, the code has not yet been publicly released while further bug testing + refinements/optimisations are conducted. The release is intended to be by the end of 2024. Please feel free to check in on the following repository in which the pipeline will be available upon release: https://github.com/charlesfoster/monty. Once you stop seeing a 404 error, the pipeline can be considered ready to go!
While waiting for the pipeline to be released, please click here to enjoy footage of the real life Monty (see: "What's in a name?").
Related media
Files related to presentation of this pipeline, either via talks, posters, or other, are as follows:
- ABACBS 2024 poster: click here
Pipeline introduction
MONTY is a bioinformatics pipeline that has been designed to allow the simple execution of a complex workflow(s) for the analysis of metagenomic sequencing data, with a focus on the virome component. Based on an input spreadsheet, the pipeline can take raw input reads (short reads: single-end or paired-end; long reads: under development) then conduct quality control, kmer-based taxonomy assignment, de novo assembly, virus identification, mapping-based taxonomy assignment, and estimation of taxon relative abundance.
Currently the taxon count matrices output by the pipeline are raw counts, but upcoming development is planned to enable the generation of normalised counts. While virus-focused, the pipeline will also work with any metagenomics dataset, but the databases used within the pipeline (both default and otherwise) will need to be adjusted.
What's in a name?
Bioinformaticians have a long history of adopting interesting/funky names for their tools to stand out from the crowd and aid in searchability. So, instead of the original generic pipeline name of "viromics", we decided to name the pipeline after one of our lab's unofficial mascots: my dog Monty. After that, it was just a matter of forcing a backronym onto the pipeline, and, hence, the "Metagenomic Analysis of Existing and Novel Threats in Virology" name was birthed.
Why develop another new metagenomics pipeline?
As the field of metagenomics has grown, so too has the breadth of new bioinformatics pipelines to help analyse metagenomic data. Within our lab group we wanted the ability to control in fine detail how we analyse our data to avoid relying on external pipelines that might not quite do what we want, or might even cease to be supported. Additionally, while the number of metagenomics pipelines that consider viruses is growing, in some cases (certainly not all) the inclusion of virus-focused analyses seems like an afterthought, with the primary focus being the characterisation of bacterial taxa.
What goals have driven development?
As with other fields in science, conducting metagenomic analyses can seem like a bit of a black box. A given workflow might provide you results, but how reliable are those results? There are many steps that can introduce biases into results, such as the choice of tools and databases. Accordingly, we sought to develop a virus-focused workflow that is transparent, reliable, repeatable, freely available, open-source, highly scalable, and easy to run.
While several excellent workflow languages exist, a natural underlying choice was to implement the workflow using the Nextflow domain-specific language given its strong growth of Nextflow, including a surging uptake in the bioinformatics community. We also chose to follow the 'nf-core framework' for Nextflow workflow development, given its adherence to best practices and provision of automated pipeline testing, deployment and synchronization. We do not intend at this stage to propose MONTY as being an official nf-core pipeline given some overlaps with existing nf-core workflows (see: Credits), but continued development will mirror nf-core development procedures.
By developing the pipeline in Nextflow, MONTY will be able to be executed on high-performance computing infrastructures (including cloud-based services like AWS, Google Cloud Batch etc.), as well as having native support for container technologies such as Docker and Singularity.
Pipeline steps
Quality Control
- Raw read QC (
FastQC) - Read deduplication (optional) (
BBtools Clumpify) - Read filtering/trimming (optional)
- Short reads: (
fastp)
- Short reads: (
- Assessment of virome enrichment (optional) (
ViromeQC) - Removal of contaminant reads from the host and/or PhiX spike-in (optional) (
bowtie2orhostile) - Clean read QC (
FastQC)
Kmer-based Taxonomy Assignment
- Assignment of taxonomy to reads using
kraken2and/orcentrifuge- Re-estimation of
kraken2counts usingbracken
- Re-estimation of
- Visualisation with
KRONA
De Novo Assembly
- Assembly of reads de novo into contigs/scaffolds using
MEGAHIT,SPAdesorPLASS PENGUIN - Assessment of assembly quality (
QUAST)
Virus Identification
- Identification of virus contigs from de novo assemblies (
geNomadand/orcenotetaker3) - Assessment of the quality/completeness of identified viruses (
CheckV) - Binning of virus genomes (
vRhyme)
Counts Estimation
- Assessment of coverage of contigs (either all contigs or just virus contigs) (
bowtie2,samtools,CoverM) - Taxonomic assignment of reads and contigs using
diamondand/ormmseqs2 - Reformatting and cleaning of taxids (
taxonkit) - Conversion of results into counts matrices aggregated at various user-defined taxonomic levels (custom script)
Roadmap
- Implementation of a long reads workflow
- Allow users to include some samples with paired-end reads and some samples with single-end reads (currently one 'type' or the other must be used for all samples)
- Improved normalisation of output count matrices
- In-depth benchmarking against existing tools
- Provision of MONTY as an online service via AWS and/or Seqera Platform
Credits
Code and inspiration
MONTY was originally written by Dr Charles S.P. Foster.
We thank the maintainers and developers of nf-core. The development of some pipeline sections and decisions has been inspired by, and overlaps with, several excellent nf-core workflows, such as nf-core/mag and nf-core/phageannotator. By design, these workflows are meant to be run sequentially, e.g. "nf-core/mag --> nf-core/phageannotator --> nf-core/differentialabundance --> ...", lending into the strength of the interoperability of pipelines within the nf-core framework. However, in our case we wished to have an end-to-end pipeline with a focus on viruses and implementing additional analytical tools and parameter values (both default values and possible values).
Funding
Development of the MONTY pipeline is supported in part by the following organisations:
- NHMRC IDEAS grant (2021404)
- JDRF Career Development award (5-CDA-2023-1332-S-B)
- JDRF Australia EMCR Science Accelerator Award (2-SRA-2021-1083-M-B)
- Helmsley Charitable Trust
Suggestions?
If you have any suggestions for development, please feel free to email me: charlesDOTfosterATunswDOTeduDOTau.
Citations
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
Owner
- Login: charlesfoster
- Kind: user
- Repositories: 2
- Profile: https://github.com/charlesfoster
Citation (CITATIONS.md)
# charlesfoster/monty: Citations ## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/) > Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. ## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/) > Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. ## Pipeline tools - [Bowtie2](https:/dx.doi.org/10.1038/nmeth.1923) > Langmead, B. and Salzberg, S. L. 2012 Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), p. 357–359. doi: 10.1038/nmeth.1923. - [Bracken](https://dx.doi.org/10.7717/peerj-cs.104) > Lu J, Breitwieser FP, Thielen P, Salzberg SL. 2017. Bracken: estimating species abundance in metagenomics data. PeerJ Computer Science 3:e104. https://doi.org/10.7717/peerj-cs.104 - [Cenotetaker3](https://doi.org/10.1093/ve/veaa100) > Michael J Tisza, Anna K Belford, Guillermo Domínguez-Huerta, Benjamin Bolduc, Christopher B Buck, Cenote-Taker 2 democratizes virus discovery and sequence annotation, Virus Evolution, Volume 7, Issue 1, January 2021, veaa100, https://doi.org/10.1093/ve/veaa100. - [Centrifuge](https://doi.org/10.1101/gr.210641.116) > Kim, D., Song, L., Breitwieser, F. P., & Salzberg, S. L. (2016). Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome research, 26(12), 1721-1729. doi: 10.1101/gr.210641.116. - [CheckM](https://doi.org/10.1101/gr.186072.114) > Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7), 1043–1055. doi: 10.1101/gr.186072.114 - [CheckV](https://pubmed.ncbi.nlm.nih.gov/33349699/) > Nayfach S, Camargo AP, Schulz F, Eloe-Fadrosh E, Roux S, Kyrpides NC. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat Biotechnol. 2021 May;39(5):578-585. doi: 10.1038/s41587-020-00774-7. Epub 2020 Dec 21. PMID: 33349699; PMCID: PMC8116208. - [Clumpify/BBTools](http://sourceforge.net/projects/bbmap/) - [FastP](https://doi.org/10.1093/bioinformatics/bty560) > Chen, S., Zhou, Y., Chen, Y., & Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics , 34(17), i884–i890. doi: 10.1093/bioinformatics/bty560. - [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. - [geNomad](https://doi.org/10.1101/2023.03.05.531206) > Camargo, A. P., et al. (2023). You can move, but you can’t hide: identification of mobile genetic elements with geNomad. bioRxiv preprint. doi: https://doi.org/10.1101/2023.03.05.531206 - [Kraken2](https://doi.org/10.1186/s13059-019-1891-0) > Wood, D et al., 2019. Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257. doi: 10.1186/s13059-019-1891-0. - [Krona](https://doi.org/10.1186/1471-2105-12-385) > Ondov, B. D., Bergman, N. H., & Phillippy, A. M. (2011). Interactive metagenomic visualization in a Web browser. BMC bioinformatics, 12(1), 1-10. doi: 10.1186/1471-2105-12-385. - [MEGAHIT](https://doi.org/10.1016/j.ymeth.2016.02.020) > Li, D., Luo, R., Liu, C. M., Leung, C. M., Ting, H. F., Sadakane, K., ... & Lam, T. W. (2016). MEGAHIT v1. 0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods, 102, 3-11. doi: 10.1016/j.ymeth.2016.02.020. - [MMseqs2](https://www.nature.com/articles/nbt.3988) > Steinegger, M., Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35, 1026–1028 (2017). https://doi.org/10.1038/nbt.3988 - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/) > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. - [Penguin](https://doi.org/10.1101/2024.03.29.587318) > Annika Jochheim, Florian A. Jochheim, Alexandra Kolodyazhnaya, Étienne Morice, Martin Steinegger, Johannes Söding. Strain-resolved de-novo metagenomic assembly of viral genomes and microbial 16S rRNAs. bioRxiv 2024.03.29.587318; doi: https://doi.org/10.1101/2024.03.29.587318. - [QUAST](https://doi.org/10.1093/bioinformatics/btt086) > Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, Glenn Tesler, QUAST: quality assessment tool for genome assemblies, Bioinformatics, Volume 29, Issue 8, April 2013, Pages 1072–1075, https://doi.org/10.1093/bioinformatics/btt086. - [SAMtools](https://doi.org/10.1093/bioinformatics/btp352) > Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics , 25(16), 2078–2079. doi: 10.1093/bioinformatics/btp352. - [Seqkit](https://doi.org/10.1371/journal.pone.0163962) > Shen W, Le S, Li Y, Hu F (2016) SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS ONE 11(10): e0163962. https://doi.org/10.1371/journal.pone.0163962 - [Seqtk](https://github.com/lh3/seqtk) - [SPAdes](https://doi.org/10.1101/gr.213959.116) > Nurk, S., Meleshko, D., Korobeynikov, A., & Pevzner, P. A. (2017). metaSPAdes: a new versatile metagenomic assembler. Genome research, 27(5), 824-834. doi: 10.1101/gr.213959.116. - [ViromeQC](https://doi.org/10.1038/s41587-019-0334-5) > Zolfo, M., Pinto, F., Asnicar, F. et al. Detecting contamination in viromes using ViromeQC. Nat Biotechnol 37, 1408–1412 (2019). https://doi.org/10.1038/s41587-019-0334-5. - [vRhyme](https://pubmed.ncbi.nlm.nih.gov/35544285/) > Kieft K, Adams A, Salamzade R, Kalan L, Anantharaman K. vRhyme enables binning of viral genomes from metagenomes. Nucleic Acids Res. 2022 Aug 12;50(14):e83. doi: 10.1093/nar/gkac341. PMID: 35544285; PMCID: PMC9371927. ## Data - [nf-core/mag test data](https://github.com/nf-core/test-datasets/tree/mag) > Sabrina Krakau, Daniel Straub, Hadrien Gourlé, Gisela Gabernet, Sven Nahnsen, nf-core/mag: a best-practice pipeline for metagenome hybrid assembly and binning, NAR Genomics and Bioinformatics, Volume 4, Issue 1, March 2022, lqac007, https://doi.org/10.1093/nargab/lqac007 ## Software packaging/containerisation tools - [Anaconda](https://anaconda.com) > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. - [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/) > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. - [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/) > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. - [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241) > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241. - [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/) > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.
GitHub Events
Total
- Delete event: 1
- Push event: 4
- Create event: 5
Last Year
- Delete event: 1
- Push event: 4
- Create event: 5