mashwrapper

Species identification for Legionella using Illumina data

https://github.com/cdcgov/mashwrapper

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Species identification for Legionella using Illumina data

Basic Info
Statistics
  • Stars: 5
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 6
Created about 4 years ago · Last pushed 8 months ago
Metadata Files
Readme Changelog Contributing License Code of conduct Citation

README.md

mashwrapper

Nextflow run with conda run with docker run with singularity

Org: CDC/NCIRD/DBB/RDB/PSLB
Contact Email: jhamlin@cdc.gov
Exemption: None
Status: Maintenance

Introduction

mashwrapper is a wrapper around the program Mash and the NCBI Datasets command line tools (CLI). It identifies the most likely species from paired gzipped FASTQ reads using a Mash database.

You can provide the database for comparison in two ways: 1. --get_database: Used when downloading and building a new Mash database from genomes 2. --use_database: Used when you're skipping the build step and instead providing a prebuilt Mash database

The tool outputs a text file containing the top five matches from the Mash database for the input reads. This output includes standard Mash results, and the best species match is determined by a cutoff based on the Mash distance score. For Legionella, this cutoff is conservatively set to a Mash distance of < 0.05. If you're using the tool for a different species, you should adjust this cutoff value based on what is most appropriate for your organism.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers, making installation trivial and results highly reproducible.

Pipeline summary

  1. Confirm input sample sheet (--get_database or --use_database)
  2. Confirm input organism sheet optional
  3. Download genomes from NCBI using NCBI datasets CLI optional
  4. Format downloaded genomes to be GenusSpeciesGenebankIdentifier.fna using NCBI dataformat CLI optional
  5. Build individual Mash sketches for all genomes optional
  6. Build Mash database from all Mash sketches optional
  7. Test FASTQ reads against a Mash database either built or provided (--get_database or --use_database)
  8. Collate results from each isolate of interest tested against the Mash database (--get_database or --use_database)

Quick Start

  1. Install Nextflow (>=21.10.3)

  2. Install either Docker or Singularity to ensure full pipeline reproducibility with Nextflow. Conda may be used as a last resort; see docs)

  3. Clone or download the pipeline and test it on a minimal dataset:

This repository includes a test dataset with the following files: - inputDB.txt - A plain text file of species to download when using the -profile testGet option. File does not include a header. - inputReads.csv - A CSV file listing paired-end read files. It has the following header: sample,fastq1,fastq2 - myMashDatabase.msh - A prebuilt Mash database from isolates listed in inputDB.txt file and used with the -profile testUse option. - subERR125190_(1,2).fastq.gz - Subsampled reads (45,000 reads) from Legionella fallonii - subERR351242_(1,2).fastq.gz - Subsampled reads (45,000 reads) from Legionella pneumophila - subSRR10019387_(1,2).fastq.gz - Subsampled reads (45,000 reads) from Legionella longbeachae

Step-by-step example commands

```console ## Step 1: Clone the repository git clone https://github.com/CDCgov/mashwrapper.git

## Step 2: Test downloading and building the databse
## "YOURPROFILE" is your preferred execution environment (Docker, Singularity or Conda)
nextflow run mashwrapper -profile testGet,YOURPROFILE

## Step 3: Test using a prebuilt database
## "YOURPROFILE" is your preferred execution environment (Docker, Singularity or Conda)
nextflow run mashwrapper -profile testUse,YOURPROFILE 

`` *You will likely need to adjust the [nfcore_custom.config](https://github.com/CDCgov/mashwrapper/blob/main/conf/nfcore_custom.config) file to work on your compute environment. To use it, specify the path to its directory using the--customconfigbase` flag. This should point to the "conf" directory (i.e., ~/mashwrapper/conf).*

  1. Start running your analysis!

```console ## Build a Mash database for organism(s) of interest nextflow run nf-core/mashwrapper -profile --input samplesheet.csv --getdatabase organismsheet.txt --customconfig_base ~/mashwrapper/conf

## Use a prebuilt Mash database nextflow run nf-core/mashwrapper -profile --input samplesheet.csv --usedatabase myMashDatabase.msh --customconfig_base ~/mashwrapper/conf ```

Documentation

The nf-core/mashwrapper pipeline comes with documentation about the pipeline usage and parameters and output.

Credits

mashwrapper is based heavily on previous work by Jason Caravas with the current version written by Jenna Hamlin.

We thank the following people for their extensive assistance in the development of this pipeline:

Contributions and Support

If you would like to contribute to this pipeline, please file an Issue

Repository Usage and Legal Notices

Please see the notices page for detailed information

Owner

  • Name: Centers for Disease Control and Prevention
  • Login: CDCgov
  • Kind: organization
  • Email: data@cdc.gov
  • Location: Atlanta, GA

CDC's collaborative software projects to protect America from health, safety, and security threats, both foreign and in the U.S.

Citation (CITATIONS.md)

# nf-core/mashwrapper: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)
  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total
  • Create event: 3
  • Release event: 3
  • Issues event: 1
  • Watch event: 2
  • Delete event: 1
  • Issue comment event: 1
  • Push event: 29
  • Pull request event: 1
Last Year
  • Create event: 3
  • Release event: 3
  • Issues event: 1
  • Watch event: 2
  • Delete event: 1
  • Issue comment event: 1
  • Push event: 29
  • Pull request event: 1

Dependencies

.github/workflows/awsfulltest.yml actions
  • nf-core/tower-action v2 composite
.github/workflows/awstest.yml actions
  • nf-core/tower-action v2 composite
.github/workflows/branch.yml actions
  • mshick/add-pr-comment v1 composite
.github/workflows/ci.yml actions
  • actions/checkout v2 composite
.github/workflows/linting.yml actions
  • actions/checkout v2 composite
  • actions/setup-node v2 composite
  • actions/setup-python v1 composite
  • actions/upload-artifact v2 composite
.github/workflows/linting_comment.yml actions
  • dawidd6/action-download-artifact v2 composite
  • marocchino/sticky-pull-request-comment v2 composite
modules/nf-core/modules/custom/dumpsoftwareversions/meta.yml cpan