assembly-analysis-pipeline

MGnify Assembly Analysis Pipeline

https://github.com/ebi-metagenomics/assembly-analysis-pipeline

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 7 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.6%) to scientific vocabulary

Keywords

genomics metagenomics nextflow pipeline science
Last synced: 7 months ago · JSON representation

Repository

MGnify Assembly Analysis Pipeline

Basic Info
Statistics
  • Stars: 0
  • Watchers: 3
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
genomics metagenomics nextflow pipeline science
Created over 1 year ago · Last pushed 8 months ago
Metadata Files
Readme Contributing License Citation

README.md

ebi-metagenomics/assembly-analysis-pipeline

GitHub Actions CI Status GitHub Actions Linting Status nf-test Nextflow

Introduction

MGnify assembly analysis pipeline

This repository contains the MGnify assembly analysis pipeline, from version 6.0.0 onwards. For version 5.0 of the pipeline, please follow this link.

V6 Schema

Pipeline description

Features

The MGnify assembly analysis pipeline, version 6.0.0 and onwards, provides the following key features:

  • Assembly Quality Control: The pipeline performs quality control on the assembled contigs and includes optional decontamination functionality to remove human, PhiX, and custom contaminant sequences.
  • CDS Prediction: The pipeline utilizes the MGnify Combined Gene Caller to predict coding sequences (CDS) within the assembled contigs.
  • Taxonomic Assignment: The pipeline assigns taxonomic classifications to the assembled contigs using Contig Annotation Tool (CAT).
  • Functional Annotation:
    • InterProScan: Identifies protein domains, families, and functional sites.
    • eggNOG Mapper: Assigns clusters of orthologs groups (COGs) annotations and eggNOG functional descriptions.
    • GO Slims: The pipeline maps the protein sequences to Gene Ontology (GO) Slim terms.
    • run_dbCAN: Annotates carbohydrate-active enzymes.
    • KEGG Orthologs: Assigns KEGG Orthologs (KO) identifiers using HMMER.
    • RHEA: Proteins are assigned RHEA ids.
  • Biosynthetic Gene Cluster Annotation: The pipeline uses antiSMASH and SanntiS to identify and annotate biosynthetic gene clusters associated with secondary metabolite production.
  • KEGG Modules completeness: The pipeline analyzes the KEGG Orthologs annotations to infer the presence and completeness of KEGG modules.
  • Consolidated annotation: The pipeline aggregates all the generated annotations into a single consolidated GFF file.

Tools

| Tool | Version | Purpose | | ------------------------------------------------------------------------------------------------- | -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | | antiSMASH | 8.0.1 | Tool for the identification and annotation of secondary metabolite biosynthesis gene clusters | | CAT_pack | 6.0 | Taxonomic classification of the contigs in the assembly | | cmsearchtbloutdeoverlap | 0.09 | Deoverlapping of cmsearch results | | csvtk | 0.31.0 | A cross-platform, efficient, and practical CSV/TSV toolkit | | Combined Gene Caller - Merge | 1.2.0 | Combined gene caller merge script used to combine predictions of Pyrodigal and FragGeneScanRS (this tool is part of the mgnify-pipelines-toolkit) | | Diamond | 2.1.11 | Used to match predicted CDS against the CAT reference database for the taxonomic classification of the contigs | | DRAM | 13.5 | Summarizes annotations from multiple tools like KEGG, Pfam, and CAZy | | easel | 0.49 | Extracts FASTA sequences by name from a cmsearch deoverlap result | | extractcoords | 1.2.0 | Processes output from easel-sfetch to extract SSU and LSU sequences (this tool is part of the mgnify-pipelines-toolkit). | | FragGeneScanRs | 1.1.0 | CDS calling; this tool specializes in calling fragmented CDS | | generategaf | 1.2.0 | Script that generates a GO Annotation File (GAF) from an InterProScan result TSV file (this tool is part of the mgnify-pipelines-toolkit). | | Genome Properties | 2.0 | Uses protein signatures as evidence to determine the presence of each step within a property | | Infernal - cmscan | 1.1.5 | RNA sequence searching | | InterProScan | 5.73-104.0 | Functionally characterizes nucleotide or protein sequences by scanning them against the InterPro database. | | HMMER | 3.4 | Used to annotate CDS with KO | | Krona | 2.8.1 | Krona chart visualization | | kegg-pathways-completeness | 1.3.0 | Computes the completeness of each KEGG pathway module based on KEGG orthologue (KO) annotations. | | MGnify pipelines toolkit | 1.2.0 | Collection of tools and scripts used in MGnify pipelines. | | minimap2 | 2.29-r1283 | A versatile pairwise aligner for genomic and spliced nucleotide sequences. Used in the assembly decontamination subworkflow | | MultiQC | 1.29 | Tool to aggregate bioinformatic analysis results. | | Owltools | 2024-06-12T00:00:00Z | Tool utilized to map GO terms to GO-slims | | Pyrodigal | 3.6.3 | CDS calling | | pigz | 2.3.4 | A parallel implementation of gzip for modern multi-processor, multi-core systems | | QUAST | 5.2.0 | Tool used evaluates genome assemblies, it's part of the pipeline QC module. | | run_dbCAN | 5.1.2 | Annotation tool for the Carbohydrate-Active enZYmes Database (CAZy) | | SeqKit | 2.8.0 | Used to manipulate FASTA files | | SanntiS | 0.9.4.1 | Tool used to identify biosynthetic gene clusters | | tabix | 1.21 | Generic indexer for TAB-delimited genome position files | | Genome Tools - gff3validator | 1.6.5 | Used to validate the analysis summary GFF file | | jq | 1.5 | Used to concatenate the chunked antiSMASH json results |

Reference databases

This pipeline uses several reference databases, you can find the list of them in the follow table. The databases marked with * are downloaded and post-processed by the Microbiome Informatics reference-databases-preprocessing-pipeline. Our team also stores ready to use version of these databases in EBI's FTP server.

| Reference database | Version | Purpose | Download | | ------------------------------------------------------------------------------------------------------------ | ---------- | ------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Rfam covariance models | 15 | rRNA covariance models | ftp://ftp.ebi.ac.uk/pub/databases/Rfam/15.0/Rfam.cm.gz | | Rfam clan info | 15 | rRNA clan information | ftp://ftp.ebi.ac.uk/pub/databases/Rfam/15.0/Rfam.clanin | | InterProScan | 5.73-104.0 | InterProScan reference database | ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.73-104.0/ | | eggNOG-mapper | 5.0.2 | eggNOG-mapper annotation databases and Diamond | https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.12#requirements | | antiSMASH | 8.0.1 | The antiSMASH reference database | https://docs.antismash.secondarymetabolites.org/install/#antismash-standalone-lite | | KOFAM* | 2025-04 | KOfam - HMM profiles for KEGG/KO. Our reference generation pipeline generates the required files | https://github.com/EBI-Metagenomics/reference-databases-preprocessing-pipeline | | GO Slims* | 20160705 | Metagenomics GO Slims | ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/goslim/20160705/goslim_20160705.tar.gz | | run_dbCAN | 4.1.4-V13 | Pre-built runDBCan reference database | ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/dbcan/dbcan4.1.3V12.tar.gz | | CAT_Pack | 202501 | CAT/BAT/RAT NCBI taxonomy pre-made reference database | https://github.com/MGXlab/CAT_pack?tab=readme-ov-file#downloading-preconstructed-database-files | | DRAM | 1.3.0 | DRAM databases | https://github.com/WrightonLabCSU/DRAM/wiki#dram-setup |

Reference genomes

The pipeline includes an optional decontamination step that requires reference genomes (e.g., human, PhiX174, or any user-supplied genome). Frequently used reference genomes are available on our FTP server.

Use the following pipeline options to configure references:

  • --reference_genomes_folder: Path to a folder containing all reference genome subfolders.

  • --human_reference, --phyx_reference, --contaminant_reference: Names of the subfolders (not paths) for each specific reference.

Each genome should be organized as follows:

<reference_genomes_folder>/ ├── <genome_prefix>/ │ └── <genome_prefix>.fna

[!IMPORTANT] FASTA files must use the .fna extension.

How to run

Requirements

At the moment the only prerequisites for running it are Nextflow and Docker/Singularity, since all the Nextflow processes use pre-built containers.

Input shape

The input data for the pipeline is metagenomic assemblies FASTA files. These files should be specified using a .csv samplesheet file with this format:

sample,assembly_fasta,contaminant_reference,human_reference,phix_reference ERZ999,/path/to/assembly/ERZ999.fasta.gz,,, ERZ998,/path/to/assembly/ERZ998.fasta.gz,,,

Execution

You can run the current version of the pipeline with:

bash nextflow run ebi-metagenomics/assembly-analysis-pipeline \ -r main \ --input /path/to/samplesheet.csv \ --outdir /path/to/outputdir

This pipeline supports nf-core shared configuration files.

For a more detailed description on how to use the pipeline, see the usage file.

Outputs

For a more detailed description of the different output files, see the outputs file.

Citations

Richardson L, Allen B, Baldi G, Beracochea M, Bileschi ML, Burdett T, et al. MGnify: the microbiome sequence data analysis resource in 2023 [Internet]. Vol. 51, Nucleic Acids Research. Oxford University Press (OUP); 2022. p. D753–9. Available from: http://dx.doi.org/10.1093/nar/gkac1080

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

  • Name: MGnify
  • Login: EBI-Metagenomics
  • Kind: organization
  • Email: metagenomics-help@ebi.ac.uk
  • Location: Genome Campus, UK

MGnify (formerly known as EBImetagenomics) is a free resource for the assembly, analysis, archiving and browsing all types of microbiome derived sequence data

GitHub Events

Total
  • Delete event: 22
  • Issue comment event: 12
  • Push event: 214
  • Pull request review event: 192
  • Pull request review comment event: 195
  • Pull request event: 56
  • Create event: 30
Last Year
  • Delete event: 22
  • Issue comment event: 12
  • Push event: 214
  • Pull request review event: 192
  • Pull request review comment event: 195
  • Pull request event: 56
  • Create event: 30

Dependencies

.github/workflows/branch.yml actions
  • mshick/add-pr-comment b8f338c590a895d50bcbfa6c5859251edc8952fc composite
.github/workflows/ci.yml actions
  • actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
  • conda-incubator/setup-miniconda a4260408e20b96e80095f42ff7f1a15b27dd94ca composite
  • eWaterCycle/setup-apptainer main composite
  • jlumbroso/free-disk-space 54081f138730dfa15788a46383842cd2f914a1be composite
  • nf-core/setup-nextflow v2 composite
.github/workflows/clean-up.yml actions
  • actions/stale 28ca1036281a5e5922ead5184a1bbf96e5fc984e composite
.github/workflows/download_pipeline.yml actions
  • actions/setup-python 0b93645e9fea7318ecaed2b359559ac225c90a2b composite
  • eWaterCycle/setup-apptainer 4bb22c52d4f63406c49e94c804632975787312b3 composite
  • jlumbroso/free-disk-space 54081f138730dfa15788a46383842cd2f914a1be composite
  • nf-core/setup-nextflow v2 composite
.github/workflows/fix-linting.yml actions
  • actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
  • actions/setup-python 0b93645e9fea7318ecaed2b359559ac225c90a2b composite
  • peter-evans/create-or-update-comment 71345be0265236311c031f5c7866368bd1eff043 composite
.github/workflows/linting.yml actions
  • actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
  • actions/setup-python 0b93645e9fea7318ecaed2b359559ac225c90a2b composite
  • actions/upload-artifact b4b15b8c7c6ac21ea08fcf65892d2ee8f75cf882 composite
  • nf-core/setup-nextflow v2 composite
  • pietrobolcato/action-read-yaml 1.1.0 composite
.github/workflows/linting_comment.yml actions
  • dawidd6/action-download-artifact 80620a5d27ce0ae443b965134db88467fc607b43 composite
  • marocchino/sticky-pull-request-comment 331f8f5b4215f0445d3c07b4967662a32a2d3e31 composite
.github/workflows/template_version_comment.yml actions
  • actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
  • mshick/add-pr-comment b8f338c590a895d50bcbfa6c5859251edc8952fc composite
  • nichmor/minimal-read-yaml v0.0.2 composite
modules/nf-core/multiqc/meta.yml cpan
subworkflows/nf-core/utils_nextflow_pipeline/meta.yml cpan
subworkflows/nf-core/utils_nfcore_pipeline/meta.yml cpan
subworkflows/nf-core/utils_nfschema_plugin/meta.yml cpan
modules/nf-core/multiqc/environment.yml conda
  • multiqc 1.25.1.*