mashwrapper
Species identification for Legionella using Illumina data
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: ncbi.nlm.nih.gov -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary
Repository
Species identification for Legionella using Illumina data
Basic Info
- Host: GitHub
- Owner: CDCgov
- License: mit
- Language: Groovy
- Default Branch: main
- Homepage: https://www.cdc.gov/legionella/index.html
- Size: 67.4 MB
Statistics
- Stars: 5
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 6
Metadata Files
README.md
mashwrapper
Org: CDC/NCIRD/DBB/RDB/PSLB
Contact Email: jhamlin@cdc.gov
Exemption: None
Status: Maintenance
Introduction
mashwrapper is a wrapper around the program Mash and the NCBI Datasets command line tools (CLI). It identifies the most likely species from paired gzipped FASTQ reads using a Mash database.
You can provide the database for comparison in two ways:
1. --get_database: Used when downloading and building a new Mash database from genomes
2. --use_database: Used when you're skipping the build step and instead providing a prebuilt Mash database
The tool outputs a text file containing the top five matches from the Mash database for the input reads. This output includes standard Mash results, and the best species match is determined by a cutoff based on the Mash distance score. For Legionella, this cutoff is conservatively set to a Mash distance of < 0.05. If you're using the tool for a different species, you should adjust this cutoff value based on what is most appropriate for your organism.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers, making installation trivial and results highly reproducible.
Pipeline summary
- Confirm input sample sheet (
--get_databaseor--use_database) - Confirm input organism sheet optional
- Download genomes from NCBI using NCBI datasets CLI optional
- Format downloaded genomes to be GenusSpeciesGenebankIdentifier.fna using NCBI dataformat CLI optional
- Build individual Mash sketches for all genomes optional
- Build Mash database from all Mash sketches optional
- Test FASTQ reads against a Mash database either built or provided (
--get_databaseor--use_database) - Collate results from each isolate of interest tested against the Mash database (
--get_databaseor--use_database)
Quick Start
Install
Nextflow(>=21.10.3)Install either
DockerorSingularityto ensure full pipeline reproducibility with Nextflow.Condamay be used as a last resort; see docs)Clone or download the pipeline and test it on a minimal dataset:
This repository includes a test dataset with the following files: - inputDB.txt - A plain text file of species to download when using the
-profile testGetoption. File does not include a header. - inputReads.csv - A CSV file listing paired-end read files. It has the following header: sample,fastq1,fastq2 - myMashDatabase.msh - A prebuilt Mash database from isolates listed in inputDB.txt file and used with the-profile testUseoption. - subERR125190_(1,2).fastq.gz - Subsampled reads (45,000 reads) from Legionella fallonii - subERR351242_(1,2).fastq.gz - Subsampled reads (45,000 reads) from Legionella pneumophila - subSRR10019387_(1,2).fastq.gz - Subsampled reads (45,000 reads) from Legionella longbeachae
Step-by-step example commands
```console ## Step 1: Clone the repository git clone https://github.com/CDCgov/mashwrapper.git
## Step 2: Test downloading and building the databse
## "YOURPROFILE" is your preferred execution environment (Docker, Singularity or Conda)
nextflow run mashwrapper -profile testGet,YOURPROFILE
## Step 3: Test using a prebuilt database
## "YOURPROFILE" is your preferred execution environment (Docker, Singularity or Conda)
nextflow run mashwrapper -profile testUse,YOURPROFILE
``
*You will likely need to adjust the [nfcore_custom.config](https://github.com/CDCgov/mashwrapper/blob/main/conf/nfcore_custom.config) file to work on your compute environment. To use it, specify the path to its directory using the--customconfigbase` flag. This should point to the "conf" directory (i.e., ~/mashwrapper/conf).*
- Start running your analysis!
```console
## Build a Mash database for organism(s) of interest
nextflow run nf-core/mashwrapper -profile
## Use a prebuilt Mash database
nextflow run nf-core/mashwrapper -profile
Documentation
The nf-core/mashwrapper pipeline comes with documentation about the pipeline usage and parameters and output.
Credits
mashwrapper is based heavily on previous work by Jason Caravas with the current version written by Jenna Hamlin.
We thank the following people for their extensive assistance in the development of this pipeline:
Contributions and Support
If you would like to contribute to this pipeline, please file an Issue
Repository Usage and Legal Notices
Please see the notices page for detailed information
Owner
- Name: Centers for Disease Control and Prevention
- Login: CDCgov
- Kind: organization
- Email: data@cdc.gov
- Location: Atlanta, GA
- Website: http://open.cdc.gov/
- Twitter: CDCgov
- Repositories: 114
- Profile: https://github.com/CDCgov
CDC's collaborative software projects to protect America from health, safety, and security threats, both foreign and in the U.S.
Citation (CITATIONS.md)
# nf-core/mashwrapper: Citations ## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/) > Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. ## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/) > Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. ## Pipeline tools - [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/) > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. ## Software packaging/containerisation tools - [Anaconda](https://anaconda.com) > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. - [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/) > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. - [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/) > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. - [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241) - [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/) > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.
GitHub Events
Total
- Create event: 3
- Release event: 3
- Issues event: 1
- Watch event: 2
- Delete event: 1
- Issue comment event: 1
- Push event: 29
- Pull request event: 1
Last Year
- Create event: 3
- Release event: 3
- Issues event: 1
- Watch event: 2
- Delete event: 1
- Issue comment event: 1
- Push event: 29
- Pull request event: 1
Dependencies
- nf-core/tower-action v2 composite
- nf-core/tower-action v2 composite
- mshick/add-pr-comment v1 composite
- actions/checkout v2 composite
- actions/checkout v2 composite
- actions/setup-node v2 composite
- actions/setup-python v1 composite
- actions/upload-artifact v2 composite
- dawidd6/action-download-artifact v2 composite
- marocchino/sticky-pull-request-comment v2 composite