Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.2%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: Zymo-Research
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 9.5 MB
Statistics
  • Stars: 1
  • Watchers: 2
  • Forks: 2
  • Open Issues: 4
  • Releases: 12
Created almost 3 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog License Citation

README.md

MGscanTM Shotgun Metagnomics Pipeline

Introduction

This is a bioinformatics analysis pipeline used for shotgun metagenomic data developed at Zymo Research. This pipeline was adpated from community-developed nf-core/taxprofiler pipeline version 1.0.0. Many changes were made to the original pipeline. Some are based on our experience or preferences. But more importantly, we want to make the pipeline and its results easier to use/understand by people without bioinformatics experience. People can run the pipeline on the point-and-click bioinformatics platform Aladdin Bioinformatics. Changes include but are not limited to: * Changed the behavior of pipeline so that the user may choose one taxonomy profiler instead of running all available taxonomy profilers. We found that some of the profilers have worse performances or outdated databases. We have disabled those profilers temporarily, but kept the code to run them, with a plan to reactivate them if necessary. This is a philosophical change, we believe this approach offers simplicity and avoids confusion for researchers. * Added sourmash as the preferred taxonomy profiler. * Added a Zymo version of the sourmash database that includes common host genomes so that host removal step does not need to be run anymore. * Upgraded MetaPhlAn3 to MetaPhlAn4. * Added antimicrobial resistance analysis with AMRplusplus and visualization of its results. * Added visualizations of sourmash and MetaPhlAn4 results to the report. * Added diversity analysis using Qiime2 and corresponding visualizations to the report. * Fixed, simplified, and improved the report. * Made the pipeline more resistant to bad samples, so that they don't stop the processing of others.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

Pipeline summary

  1. Read QC (FastQC or falco as an alternative option)
  2. Performs optional read pre-processing (code for long-read inherited from nf-core/taxprofiler, but not separately tested by us yet)
  3. Perform Host-read removal (This is not performed when sourmash-zymo-xxxx is selected as the database because it already contains host sequences. )
  4. Run merging when applicable
  5. Identifies antimicrobial resistance genes in samples from database MEGARes version 3
    • Reads are aligned to MEGARes reference (bwa mem)
    • Resistome statistics are quantified and compiled for each sample (AMRplusplus)
  6. Performs taxonomic profiling using one of: (nf-core/taxprofiler has more choices for this step, if there are tools you'd like for this step, please let us know.)
  7. Merge all taxonomic profiling results into one table (Qiime2) and generate composition plot (Krona)
  8. Perform alpha/beta diversity analysis (Qiime2) and beta diversity differential expression (ANCOM-BC)
  9. Present all results in above steps in a report (MultiQC)

Quick Start

We recommend you run this pipeline via the Aladdin Bioinformatics platform. It is much easier to run without any requirement for coding. Also, because the Zymo sourmash database is private, public users would not be able to use it via the command line. If you would still like to run the pipeline via CLI, see instruction below.

Prerequisites

  • Nextflow version 22.10.1 or later
  • At least 8 CPU threads and 60GB memory. The default config file nextflow.config has higher settings, please modify to fit your device.
  • Docker if you are running locally.
  • Permissions to AWS S3 and Batch resources if you are running on AWS Batch.

Using AWS Batch

bash nextflow run Zymo-Research/aladdin-shotgun \ -profile awsbatch \ --design "<path to design CSV file>" \ --database sourmash-zymo-2024 \ --run_amr true \ -work-dir "<work dir on S3>" \ --awsregion "<AWS Batch region> \ --awsqueue "<SQS ARN>" \ --outdir "<output dir on S3>" \ -r "0.0.14" \ -name "<report title>" 1. The parameter --design is required. It must be a CSV file with the following format. sample,read_1,read_2,group,run_accession sample1,s1_run1_R1.fastq.gz,s1_run1_R2.fastq.gz,groupA,run1 sample1,s1_run2_R1.fastq.gz,s1_run2_R2.fastq.gz,groupA,run2 sample2,s2_run1_R1.fastq.gz,,groupB,, sample3,s3_run1_R1.fastq.gz,s3_run1_R2.fastq.gz,groupB,, - The header line must be present. - The columns "sample", "read1", "read2", "group" must be present. Column "runaccession" is optional. - The column "sample" contains the name/label for each sample. It can be duplicate. When duplicated, it means the same sample has multiple sequencing runs. In those cases, a different value for "runaccession" is expected. See "sample1" in above example. Sample names must contain only alphanumerical characters or underscores, and must start with a letter. - The columns "read1", "read2" refers to the paths, including S3 paths, of Read 1 and 2 of Illumina paired-end data. They must be ".fastq.gz" or ".fq.gz" files. When your data are single-end Illumina or PacBio data, simply use "read1" column, and leave "read2" column empty. FASTA files from Nanopore data are currently not supported. - The column "group" contains the group name/label for comparison purposes in the diversity analysis. If you don't have/need this information, simply leave the column empty, but this column must be present regardless. Same rules for legal characters of sample names apply here too. - The column "runaccesssion" is optional. It is only required when there are duplicates in the "sample" column. This is to mark different run names for the sample. 2. The parameter --database is used to change taxonomy profiler and database. It has a default value 'sourmash-zymo'. You can skip this if you don't want to change it. 3. The parameter `--runamris used to run antimicrobial resistance analysis. This parameter is by default false. If you wish to skip this analysis, remove--run_amrfrom the command line or set it to "false". 4. The parameters--awsregion,--awsqueue,-work-dir, and--outdirare required when running on AWS Batch, the latter two must be directories on S3. 5. The parameter-rwill run a specific release of the pipeline. If not specified, it will run the themainbranch instead. 6. The parameter-name` will define the title of the MultiQC report.

There are many other options built in the pipeline to customize your run and handle specific situations, please refer to the Usage Documentation.

Using Docker locally

bash nextflow run Zymo-Research/aladdin-shotgun \ -profile docker,dev \ --design "<path to design CSV file>" \ --database sourmash-zymo-2024 Please see above for requirements of the design CSV file.

Using SLURM on ZymoCloud

bash nextflow run Zymo-Research/aladdin-shotgun \ -profile slurm \ --design "<path to design CSV file>" \ --database sourmash-zymo-2024 \ --run_amr true \ -work-dir "<work dir on ZymoCloud>" \ --partition "<partition name on ZymoCloud>" \ --outdir "<output dir on ZymoCloud>" \ -r "0.0.14" \ -name "<report title>" Using SLURM on ZymoCloud is only supported after release 0.0.13

Credits

This pipeline was adapted from nf-core/taxprofiler version 1.0.0. Please refer to credits for list of orginal contributors. Contributors from Zymo Research include: - Zymo Research microbiomics team (source code, database, review) - Nora Sharp (Pipeline coding) - Zhenfeng Liu (Pipeline coding)

Owner

  • Name: Zymo Research
  • Login: Zymo-Research
  • Kind: organization
  • Location: Irvine, CA, USA

Innovative & Quality tools for Epigenetics Research and DNA/RNA Purification.

Citation (CITATIONS.md)

# nf-core/taxprofiler: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. *Nat Biotechnol.* 2020;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. *Nat Biotechnol.* 2017;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. *Bioinformatics.* 2016;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

- [falco](https://doi.org/10.12688/f1000research.21142.2)

  > de Sena Brandine G & Smith AD. Falco: high-speed FastQC emulation for quality control of sequencing data. *F1000Research.* 2021; 8:1874.

- [fastp](https://doi.org/10.1093/bioinformatics/bty560)

  > Chen S, Zhou Y, Chen Y, Gu J. Fastp: An Ultra-Fast All-in-One FASTQ Preprocessor. *Bioinformatics.* 2018;34(17):i884-90. doi: 10.1093/bioinformatics/bty560.

- [AdapterRemoval2](https://doi.org/10.1186/s13104-016-1900-2)

  > Schubert M, Lindgreen S, and Orlando L. AdapterRemoval v2: Rapid Adapter Trimming, Identification, and Read Merging. *BMC Research Notes.* 2016;9:88. doi:10.1186/s13104-016-1900-2.

- [Porechop](https://github.com/rrwick/Porechop)

- [FILTLONG](https://github.com/rrwick/Filtlong)

- [BBTools](http://sourceforge.net/projects/bbmap/)

- [PRINSEQ++](https://doi.org/10.7287/peerj.preprints.27553v1)

  > Cantu VA, Sadural J, and Edwards R. PRINSEQ++, a Multi-Threaded Tool for Fast and Efficient Quality Control and Preprocessing of Sequencing Datasets. *PeerJ Preprints.* 2019;e27553v1. doi: 10.7287/peerj.preprints.27553v1.

- [Bowtie2](https://doi.org/10.1038/nmeth.1923)

  > Langmead B & Salzberg SL. Fast gapped-read alignment with Bowtie 2. *Nature Methods.* 2012;9(4):357–359. doi: 10.1038/nmeth.1923

- [minimap2](https://doi.org/10.1093/bioinformatics/bty191)

  > Li H. Minimap2: pairwise alignment for nucleotide sequences. *Bioinformatics.* 2018;34(18):3094–3100. doi: 10.1093/bioinformatics/bty191

- [SAMTools](https://doi.org/10.1093/gigascience/giab008)

  > Danecek P, Bonfield JK, Liddle J, et al. Twelve years of SAMtools and BCFtools. *GigaScience.* 2021;10(2). doi: 10.1093/gigascience/giab008

- [MetaPhlAn4](https://doi.org/10.7554/eLife.65088)

  > Beghini F, McIver L, Blanco-Míguez A, et al. Integrating Taxonomic, Functional, and Strain-Level Profiling of Diverse Microbial Communities with BioBakery 3. *ELife.* 2021;10:e65088. doi: 10.7554/eLife.65088

- [Sourmash](https://doi.org/10.12688/f1000research.19675.1)

  > Pierce NT, Irber L, Reiter T, Brooks P, Brown CT. Large-scale sequence comparisons with sourmash. *F1000Research.* 2019;8:1006. doi: 10.12688/f1000research.19675.1

- [bwa - Burrows-Wheeler Alignment Tool](https://bio-bwa.sourceforge.net/bwa.shtml)

- [AMRplusplus](https://doi.org/10.1093/nar/gkac1047)

  > Bonin N, Doster E, Worley H, et al. MEGARes and AMR++, v3.0: an updated comprehensive database of antimicrobial resistance determinants and an improved software pipeline for classification using high-throughput sequencing. *Nucleic Acids Research.* 2023;51(1):744-52. doi: 10.1093/nar/gkac1047

- [QIIME2](https://doi.org/10.1038%2Fs41587-019-0209-9)

  > Bolyen E, Rideout JR, Dillon MR, et al. Reproducible, interactive, scalable, and extensible microbiome data science using QIIME 2. *Nat Biotechnol.* 2020;37(8):852-857. doi: 10.1038/s41587-019-0209-9

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. *Nat Methods.* 2018;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, et al. BioContainers: an open-source and community-driven framework for software standardization. *Bioinformatics.* 2017;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)
  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. *PLoS One.* 2017;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

## Data

- [Maixner (2021)](https://doi.org/10.1016/j.cub.2021.09.031) (CI Test Data)

  > Maixner F, Sarhan M, Huang K, et al. Hallstatt Miners Consumed Blue Cheese and Beer during the Iron Age and Retained a Non-Westernized Gut Microbiome until the Baroque Period. *Current Biology.* 2021;31(23): 5149–62.e6. doi: 10.1016/j.cub.2021.09.031.

- [Meslier (2022)](https://doi.org/10.1038/s41597-022-01762-z) (AWS Full Test data)

  > Meslier V, Quinquis B, Da Silva K, et al. Benchmarking Second and Third-Generation Sequencing Platforms for Microbial Metagenomics. *Scientific Data.* 2022;9(1):694. doi: 10.1038/s41597-022-01762-z.

GitHub Events

Total
  • Create event: 14
  • Issues event: 5
  • Release event: 1
  • Delete event: 14
  • Push event: 79
  • Pull request review event: 4
  • Pull request event: 22
Last Year
  • Create event: 14
  • Issues event: 5
  • Release event: 1
  • Delete event: 14
  • Push event: 79
  • Pull request review event: 4
  • Pull request event: 22

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 4
  • Total pull requests: 11
  • Average time to close issues: about 1 year
  • Average time to close pull requests: 2 days
  • Total issue authors: 2
  • Total pull request authors: 3
  • Average comments per issue: 0.5
  • Average comments per pull request: 0.0
  • Merged pull requests: 9
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 11
  • Average time to close issues: 3 months
  • Average time to close pull requests: 2 days
  • Issue authors: 1
  • Pull request authors: 3
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 9
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • zxl124 (12)
  • nsharp2 (1)
Pull Request Authors
  • nsharp2 (17)
  • zxl124 (15)
  • annabelledamerum (3)
Top Labels
Issue Labels
bug (6)
Pull Request Labels

Dependencies

modules/nf-core/adapterremoval/meta.yml cpan
modules/nf-core/bbmap/bbduk/meta.yml cpan
modules/nf-core/bowtie2/align/meta.yml cpan
modules/nf-core/bowtie2/build/meta.yml cpan
modules/nf-core/bracken/bracken/meta.yml cpan
modules/nf-core/bracken/combinebrackenoutputs/meta.yml cpan
modules/nf-core/cat/fastq/meta.yml cpan
modules/nf-core/centrifuge/centrifuge/meta.yml cpan
modules/nf-core/centrifuge/kreport/meta.yml cpan
modules/nf-core/custom/dumpsoftwareversions/meta.yml cpan
modules/nf-core/diamond/blastx/meta.yml cpan
modules/nf-core/falco/meta.yml cpan
modules/nf-core/fastp/meta.yml cpan
modules/nf-core/fastqc/meta.yml cpan
modules/nf-core/filtlong/meta.yml cpan
modules/nf-core/gunzip/meta.yml cpan
modules/nf-core/kaiju/kaiju/meta.yml cpan
modules/nf-core/kaiju/kaiju2krona/meta.yml cpan
modules/nf-core/kaiju/kaiju2table/meta.yml cpan
modules/nf-core/kraken2/kraken2/meta.yml cpan
modules/nf-core/krakentools/combinekreports/meta.yml cpan
modules/nf-core/krakentools/kreport2krona/meta.yml cpan
modules/nf-core/krakenuniq/preloadedkrakenuniq/meta.yml cpan
modules/nf-core/krona/ktimporttaxonomy/meta.yml cpan
modules/nf-core/krona/ktimporttext/meta.yml cpan
modules/nf-core/malt/run/meta.yml cpan
modules/nf-core/megan/rma2info/meta.yml cpan
modules/nf-core/metaphlan4/mergemetaphlantables/meta.yml cpan
modules/nf-core/metaphlan4/metaphlan4/meta.yml cpan
modules/nf-core/metaphlan4/qiimeprep/meta.yml cpan
modules/nf-core/metaphlan4/unmapped/meta.yml cpan
modules/nf-core/minimap2/align/meta.yml cpan
modules/nf-core/minimap2/index/meta.yml cpan
modules/nf-core/motus/merge/meta.yml cpan
modules/nf-core/motus/profile/meta.yml cpan
modules/nf-core/multiqc/meta.yml cpan
modules/nf-core/porechop/porechop/meta.yml cpan
modules/nf-core/prinseqplusplus/meta.yml cpan
modules/nf-core/qiime/import/meta.yml cpan
modules/nf-core/samtools/bam2fq/meta.yml cpan
modules/nf-core/samtools/index/meta.yml cpan
modules/nf-core/samtools/stats/meta.yml cpan
modules/nf-core/samtools/view/meta.yml cpan
modules/nf-core/taxpasta/merge/meta.yml cpan
modules/nf-core/untar/meta.yml cpan
Dockerfile docker
  • python 3.8-slim build
assets/mqc_plugins/setup.py pypi
  • multiqc ==1.14
environment.yml pypi