mashwrapper

Species identification for Legionella using Illumina data

https://github.com/cdcgov/mashwrapper

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Species identification for Legionella using Illumina data

Basic Info

Host: GitHub
Owner: CDCgov
License: mit
Language: Groovy
Default Branch: main
Homepage: https://www.cdc.gov/legionella/index.html
Size: 67.4 MB

Statistics

Stars: 5
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 6

Created over 4 years ago · Last pushed 12 months ago

Metadata Files

Readme Changelog Contributing License Code of conduct Citation

mashwrapper

Org: CDC/NCIRD/DBB/RDB/PSLB
Contact Email: jhamlin@cdc.gov
Exemption: None
Status: Maintenance

Introduction

mashwrapper is a wrapper around the program Mash and the NCBI Datasets command line tools (CLI). It identifies the most likely species from paired gzipped FASTQ reads using a Mash database.

You can provide the database for comparison in two ways: 1. --get_database: Used when downloading and building a new Mash database from genomes 2. --use_database: Used when you're skipping the build step and instead providing a prebuilt Mash database

The tool outputs a text file containing the top five matches from the Mash database for the input reads. This output includes standard Mash results, and the best species match is determined by a cutoff based on the Mash distance score. For Legionella, this cutoff is conservatively set to a Mash distance of < 0.05. If you're using the tool for a different species, you should adjust this cutoff value based on what is most appropriate for your organism.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers, making installation trivial and results highly reproducible.

Pipeline summary

Confirm input sample sheet (--get_database or --use_database)
Confirm input organism sheet optional
Download genomes from NCBI using NCBI datasets CLI optional
Format downloaded genomes to be GenusSpeciesGenebankIdentifier.fna using NCBI dataformat CLI optional
Build individual Mash sketches for all genomes optional
Build Mash database from all Mash sketches optional
Test FASTQ reads against a Mash database either built or provided (--get_database or --use_database)
Collate results from each isolate of interest tested against the Mash database (--get_database or --use_database)

Quick Start

Install Nextflow (>=21.10.3)
Install either Docker or Singularity to ensure full pipeline reproducibility with Nextflow. Conda may be used as a last resort; see docs)
Clone or download the pipeline and test it on a minimal dataset:

This repository includes a test dataset with the following files: - inputDB.txt - A plain text file of species to download when using the -profile testGet option. File does not include a header. - inputReads.csv - A CSV file listing paired-end read files. It has the following header: sample,fastq1,fastq2 - myMashDatabase.msh - A prebuilt Mash database from isolates listed in inputDB.txt file and used with the -profile testUse option. - subERR125190_(1,2).fastq.gz - Subsampled reads (45,000 reads) from Legionella fallonii - subERR351242_(1,2).fastq.gz - Subsampled reads (45,000 reads) from Legionella pneumophila - subSRR10019387_(1,2).fastq.gz - Subsampled reads (45,000 reads) from Legionella longbeachae

Step-by-step example commands

```console ## Step 1: Clone the repository git clone https://github.com/CDCgov/mashwrapper.git

## Step 2: Test downloading and building the databse
## "YOURPROFILE" is your preferred execution environment (Docker, Singularity or Conda)
nextflow run mashwrapper -profile testGet,YOURPROFILE

## Step 3: Test using a prebuilt database
## "YOURPROFILE" is your preferred execution environment (Docker, Singularity or Conda)
nextflow run mashwrapper -profile testUse,YOURPROFILE

``*You will likely need to adjust the [nfcore_custom.config](https://github.com/CDCgov/mashwrapper/blob/main/conf/nfcore_custom.config) file to work on your compute environment. To use it, specify the path to its directory using the--customconfigbase` flag. This should point to the "conf" directory (i.e., ~/mashwrapper/conf).*

Start running your analysis!

```console ## Build a Mash database for organism(s) of interest nextflow run nf-core/mashwrapper -profile --input samplesheet.csv --getdatabase organismsheet.txt --customconfig_base ~/mashwrapper/conf

## Use a prebuilt Mash database nextflow run nf-core/mashwrapper -profile --input samplesheet.csv --usedatabase myMashDatabase.msh --customconfig_base ~/mashwrapper/conf ```

Documentation

The nf-core/mashwrapper pipeline comes with documentation about the pipeline usage and parameters and output.

Credits

mashwrapper is based heavily on previous work by Jason Caravas with the current version written by Jenna Hamlin.

We thank the following people for their extensive assistance in the development of this pipeline:

Contributions and Support

If you would like to contribute to this pipeline, please file an Issue

Repository Usage and Legal Notices

Please see the notices page for detailed information

Owner

Name: Centers for Disease Control and Prevention
Login: CDCgov
Kind: organization
Email: data@cdc.gov
Location: Atlanta, GA

Website: http://open.cdc.gov/
Twitter: CDCgov
Repositories: 114
Profile: https://github.com/CDCgov

CDC's collaborative software projects to protect America from health, safety, and security threats, both foreign and in the U.S.

Citation (CITATIONS.md)

# nf-core/mashwrapper: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)
  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total

Create event: 3
Release event: 3
Issues event: 1
Watch event: 2
Delete event: 1
Issue comment event: 1
Push event: 29
Pull request event: 1

Last Year

Create event: 3
Release event: 3
Issues event: 1
Watch event: 2
Delete event: 1
Issue comment event: 1
Push event: 29
Pull request event: 1

Dependencies

.github/workflows/awsfulltest.yml actions

nf-core/tower-action v2 composite

.github/workflows/awstest.yml actions

nf-core/tower-action v2 composite

.github/workflows/branch.yml actions

mshick/add-pr-comment v1 composite

.github/workflows/ci.yml actions

actions/checkout v2 composite

.github/workflows/linting.yml actions

actions/checkout v2 composite
actions/setup-node v2 composite
actions/setup-python v1 composite
actions/upload-artifact v2 composite

.github/workflows/linting_comment.yml actions

dawidd6/action-download-artifact v2 composite
marocchino/sticky-pull-request-comment v2 composite

modules/nf-core/modules/custom/dumpsoftwareversions/meta.yml cpan

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

mashwrapper

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

mashwrapper

Introduction

Pipeline summary

Quick Start

Documentation

Credits

Contributions and Support

Repository Usage and Legal Notices

Owner

Citation (CITATIONS.md)

GitHub Events

Total

Last Year

Dependencies