Science Score: 65.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
    Organization ebi-metagenomics has institutional domain (www.ebi.ac.uk)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.2%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: EBI-Metagenomics
  • License: apache-2.0
  • Language: Nextflow
  • Default Branch: main
  • Size: 12 MB
Statistics
  • Stars: 7
  • Watchers: 2
  • Forks: 1
  • Open Issues: 1
  • Releases: 0
Created over 2 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog License Citation

README.md

MGnify amplicon analysis pipeline

This repository contains the v6.0 MGnify amplicon analysis pipeline. It is, first and foremost, a refactor of the existing v5.0 amplicon analysis pipeline, replacing CWL with Nextflow as its workflow management system. This pipeline re-implements all existing closed-reference v5.0 features, and makes multiple significant changes and additions.

V6 Schema

Pipeline description

Features

The amplicon analysis pipeline v6.0 re-implements all of the existing features from v5.0:

  • Reads quality control
  • rRNA sequence extraction using Infernal/cmsearch
  • Closed-reference-based taxonomic classification and visualiation of rRNA using MAPseq and Krona

The amplicon analysis pipeline v6.0 also contains multiple significant changes:

  • Refactoring from CWL to Nextflow for pipeline definition
  • Simplification the reads quality control using fastp
  • Automatic amplified region inference for 16S and 18S rRNA
  • Automatic primer identification, trimming, and validation
  • Addition of Amplicon Sequence Variant (ASV) calling using DADA2
  • Taxonomic classification and visualisation of ASVs using MAPseq and Krona to complement the existing closed-reference analysis
  • Addition of PR2 as a reference database
  • Updating of existing reference databases (SILVA, UNITE, ITSoneDB, Rfam)

Valid amplicons

At this stage, the only sequence amplicons that this pipeline is built for are:

| Amplicon | Closed-reference analysis | ASV analysis | | :------: | :-----------------------: | :----------: | | 16S | ✓ | ✓ | | 18S | ✓ | ✓ | | LSU | ✓ | ✗ | | ITS | ✓ | ✗ |

Tools

| Tool | Version | Purpose | | ----------------------------------------------------------------------------------------------- | -------- | ------------------------------------------------------ | | fastp | 0.23.4 | Read quality control | | SeqFu | 1.20.3 | FASTQ sanity checking | | seqtk | 1.3-r106 | FASTQ file manipulation | | SeqKit | 2.9.0 | FASTQ file manipulation | | easel | 0.49 | FASTA file manipulation | | bedtools | 2.30.0 | FASTA sequence masking | | Infernal/cmsearch | 1.1.5 | rRNA sequence searching | | cmsearchtbloutdeoverlap | 0.09 | Deoverlapping of cmsearch results | | MAPseq | 2.1.1b | Reference-based taxonomic classification of rRNA | | Krona | 2.8.1 | Krona chart visualisation | | cutadapt | 4.6 | Primer trimming | | R | 4.3.3 | R programming language (runs DADA2) | | DADA2 | 1.30.0 | ASV calling | | MultiQC | 1.24.1 | Result aggregation into HTML reports | | mgnify-pipelines-toolkit | 0.1.8 | Toolkit containing various in-house processing scripts | | PIMENTO | 1.0.0 | Primer inference toolkit used in the pipeline |

Reference databases

This pipeline uses five different reference databases. The files the pipeline uses are processed from the raw files available on each database's website, for use by MAPseq and cmsearch. We provide ready-made versions of these processed files on our FTP, which you can find here:

| Reference database | Version | Purpose | Processed file paths | | --------------------------------------------- | ------- | ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | | SILVA | 138.1 | 16S+18S+LSU rRNA database | https://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/silva-ssu/ https://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/silva-lsu/ | | PR2 | 5.0 | Protist-focused 18S+16S rRNA database | https://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/pr2/ | | UNITE | 9.0 | ITS database | https://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/unite/ | | ITSoneDB | 1.141 | ITS database | https://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/itsonedb/ | | Rfam | 14.10 | rRNA covariance models | https://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/rfam/ |

[!NOTE]
The preprocessed databases are generated with the Microbiome Informatics reference-databases-preprocessing-pipeline.

How to run

Requirements

At the moment the only prerequisites for running it are Nextflow and Docker/Singularity, since all of the Nextflow processes use pre-built containers.

Input shape

The input data for the pipeline is amplicon sequencing reads (either paired-end or single-end) in the form of FASTQ files. These files should be specified using a .csv samplesheet file with this format:

sample,fastq_1,fastq_2,single_end SRR9674618,/path/to/reads/SRR9674618.fastq.gz,,true SRR17062740,/path/to/reads/SRR17062740_1.fastq.gz,/path/to/reads/SRR17062740_2.fastq.gz,false

Execution

You can run the current version of the pipeline on SLURM like this:

bash nextflow run ebi-metagenomics/amplicon-pipeline \ -r main \ -profile codon_slurm \ --input /path/to/samplesheet.csv \ --outdir /path/to/outputdir

If you want to run the pipeline on deeply-sequenced reads, DADA2 can become a serious bottleneck. To counter this on SLURM, you can specify the large_samples profile which will massively boost the resources those processes will ask for. We will improve this to be more dynamic in the future, so for now use it with caution to avoid causing a standstill in the cluster. Here's an example:

bash nextflow run ebi-metagenomics/amplicon-pipeline \ -r main \ -profile codon_slurm,large_samples \ --input /path/to/samplesheet.csv \ --outdir /path/to/outputdir

Outputs

Output directory structure

Example output directory structure for one run (ERR4334351):

├── ERR4334351 │   ├── amplified-region-inference │   │   ├── ERR4334351.16S.V3-V4.txt │   │   └── ERR4334351.tsv │   ├── asv │   │   ├── 16S-V3-V4 │   │   │   └── ERR4334351_16S-V3-V4_asv_read_counts.tsv │   │   ├── ERR4334351_asv_seqs.fasta │   │   ├── ERR4334351_DADA2-PR2_asv_tax.tsv │   │   ├── ERR4334351_DADA2-SILVA_asv_tax.tsv │   │   └── ERR4334351_dada2_stats.tsv │   ├── primer-identification │   │   ├── ERR4334351.cutadapt.json │   │   ├── ERR4334351_primers.fasta │   │   └── ERR4334351_primer_validation.tsv │   ├── qc │   │   ├── ERR4334351.fastp.json │   │   ├── ERR4334351.merged.fastq.gz │   │   ├── ERR4334351_dada2_errors.txt │   │   ├── ERR4334351_multiqc_report.html │   │   ├── ERR4334351_seqfu.tsv │   │   └── ERR4334351_suffix_header_err.json │   ├── sequence-categorisation │   │   ├── ERR4334351_SSU.fasta │   │   ├── ERR4334351_SSU_rRNA_archaea.RF01959.fa │   │   ├── ERR4334351_SSU_rRNA_bacteria.RF00177.fa │   │   └── ERR4334351.tblout.deoverlapped │   └── taxonomy-summary │   ├── DADA2-PR2 │   │   ├── ERR4334351_16S-V3-V4_DADA2-PR2_asv_krona_counts.txt │   │   ├── ERR4334351_16S-V3-V4.html │   │   └── ERR4334351_DADA2-PR2.mseq │   ├── DADA2-SILVA │   │   ├── ERR4334351_16S-V3-V4_DADA2-SILVA_asv_krona_counts.txt │   │   ├── ERR4334351_16S-V3-V4.html │   │   └── ERR4334351_DADA2-SILVA.mseq │   ├── PR2 │   │   ├── ERR4334351.html │   │   ├── ERR4334351_PR2.mseq │   │   ├── ERR4334351_PR2.tsv │   │   └── ERR4334351_PR2.txt │   └── SILVA-SSU │   ├── ERR4334351.html │   ├── ERR4334351_SILVA-SSU.mseq │   ├── ERR4334351_SILVA-SSU.tsv │   └── ERR4334351_SILVA-SSU.txt ├── pipeline_info │   ├── execution_report_2025-03-25_14-13-55.html │   ├── execution_timeline_2025-03-25_14-13-55.html │   ├── execution_trace_2025-03-25_14-13-55.txt │   ├── pipeline_dag_2025-03-25_14-13-55.html │   └── software_versions.yml ├── bco.json ├── study_multiqc_report.html ├── qc_passed_runs.csv ├── qc_failed_runs.csv └── primer_validation_summary.json

For a more detailed description of the different output files, see the OUTPUTS_DESCRIPTION.md file.

Large samples profile

When working with deeply sequenced data or complex biomes, it is recommended to use the large_samples profile.

This profile is specifically designed to accommodate the increased computational demands associated with such datasets, especially in DADA2.

When running the pipeline use:

$ nextflow run ... -p large_samples ...

Citations

This pipeline uses code developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

  • Name: MGnify
  • Login: EBI-Metagenomics
  • Kind: organization
  • Email: metagenomics-help@ebi.ac.uk
  • Location: Genome Campus, UK

MGnify (formerly known as EBImetagenomics) is a free resource for the assembly, analysis, archiving and browsing all types of microbiome derived sequence data

Citation (CITATIONS.md)

# nf-core/ampliconpipeline: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

  > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total
  • Issues event: 2
  • Watch event: 3
  • Delete event: 10
  • Issue comment event: 9
  • Push event: 95
  • Pull request event: 39
  • Pull request review comment event: 37
  • Pull request review event: 51
  • Create event: 11
Last Year
  • Issues event: 2
  • Watch event: 3
  • Delete event: 10
  • Issue comment event: 9
  • Push event: 95
  • Pull request event: 39
  • Pull request review comment event: 37
  • Pull request review event: 51
  • Create event: 11

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 1
  • Total pull requests: 21
  • Average time to close issues: 1 day
  • Average time to close pull requests: 7 days
  • Total issue authors: 1
  • Total pull request authors: 4
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.24
  • Merged pull requests: 18
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 21
  • Average time to close issues: 1 day
  • Average time to close pull requests: 7 days
  • Issue authors: 1
  • Pull request authors: 4
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.24
  • Merged pull requests: 18
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • SandyRogers (1)
  • xylapple2013 (1)
Pull Request Authors
  • chrisAta (15)
  • mberacochea (5)
  • Ge94 (5)
  • SandyRogers (2)
Top Labels
Issue Labels
Pull Request Labels
documentation (1)