assembly-analysis-pipeline
MGnify Assembly Analysis Pipeline
https://github.com/ebi-metagenomics/assembly-analysis-pipeline
Science Score: 39.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 7 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.6%) to scientific vocabulary
Keywords
Repository
MGnify Assembly Analysis Pipeline
Basic Info
- Host: GitHub
- Owner: EBI-Metagenomics
- License: apache-2.0
- Language: JavaScript
- Default Branch: dev
- Homepage: https://www.ebi.ac.uk/metagenomics
- Size: 70.4 MB
Statistics
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
ebi-metagenomics/assembly-analysis-pipeline
Introduction
MGnify assembly analysis pipeline
This repository contains the MGnify assembly analysis pipeline, from version 6.0.0 onwards. For version 5.0 of the pipeline, please follow this link.

Pipeline description
Features
The MGnify assembly analysis pipeline, version 6.0.0 and onwards, provides the following key features:
- Assembly Quality Control: The pipeline performs quality control on the assembled contigs and includes optional decontamination functionality to remove human, PhiX, and custom contaminant sequences.
- CDS Prediction: The pipeline utilizes the MGnify Combined Gene Caller to predict coding sequences (CDS) within the assembled contigs.
- Taxonomic Assignment: The pipeline assigns taxonomic classifications to the assembled contigs using Contig Annotation Tool (CAT).
- Functional Annotation:
- InterProScan: Identifies protein domains, families, and functional sites.
- eggNOG Mapper: Assigns clusters of orthologs groups (COGs) annotations and eggNOG functional descriptions.
- GO Slims: The pipeline maps the protein sequences to Gene Ontology (GO) Slim terms.
- run_dbCAN: Annotates carbohydrate-active enzymes.
- KEGG Orthologs: Assigns KEGG Orthologs (KO) identifiers using HMMER.
- RHEA: Proteins are assigned RHEA ids.
- Biosynthetic Gene Cluster Annotation: The pipeline uses antiSMASH and SanntiS to identify and annotate biosynthetic gene clusters associated with secondary metabolite production.
- KEGG Modules completeness: The pipeline analyzes the KEGG Orthologs annotations to infer the presence and completeness of KEGG modules.
- Consolidated annotation: The pipeline aggregates all the generated annotations into a single consolidated GFF file.
Tools
| Tool | Version | Purpose | | ------------------------------------------------------------------------------------------------- | -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | | antiSMASH | 8.0.1 | Tool for the identification and annotation of secondary metabolite biosynthesis gene clusters | | CAT_pack | 6.0 | Taxonomic classification of the contigs in the assembly | | cmsearchtbloutdeoverlap | 0.09 | Deoverlapping of cmsearch results | | csvtk | 0.31.0 | A cross-platform, efficient, and practical CSV/TSV toolkit | | Combined Gene Caller - Merge | 1.2.0 | Combined gene caller merge script used to combine predictions of Pyrodigal and FragGeneScanRS (this tool is part of the mgnify-pipelines-toolkit) | | Diamond | 2.1.11 | Used to match predicted CDS against the CAT reference database for the taxonomic classification of the contigs | | DRAM | 13.5 | Summarizes annotations from multiple tools like KEGG, Pfam, and CAZy | | easel | 0.49 | Extracts FASTA sequences by name from a cmsearch deoverlap result | | extractcoords | 1.2.0 | Processes output from easel-sfetch to extract SSU and LSU sequences (this tool is part of the mgnify-pipelines-toolkit). | | FragGeneScanRs | 1.1.0 | CDS calling; this tool specializes in calling fragmented CDS | | generategaf | 1.2.0 | Script that generates a GO Annotation File (GAF) from an InterProScan result TSV file (this tool is part of the mgnify-pipelines-toolkit). | | Genome Properties | 2.0 | Uses protein signatures as evidence to determine the presence of each step within a property | | Infernal - cmscan | 1.1.5 | RNA sequence searching | | InterProScan | 5.73-104.0 | Functionally characterizes nucleotide or protein sequences by scanning them against the InterPro database. | | HMMER | 3.4 | Used to annotate CDS with KO | | Krona | 2.8.1 | Krona chart visualization | | kegg-pathways-completeness | 1.3.0 | Computes the completeness of each KEGG pathway module based on KEGG orthologue (KO) annotations. | | MGnify pipelines toolkit | 1.2.0 | Collection of tools and scripts used in MGnify pipelines. | | minimap2 | 2.29-r1283 | A versatile pairwise aligner for genomic and spliced nucleotide sequences. Used in the assembly decontamination subworkflow | | MultiQC | 1.29 | Tool to aggregate bioinformatic analysis results. | | Owltools | 2024-06-12T00:00:00Z | Tool utilized to map GO terms to GO-slims | | Pyrodigal | 3.6.3 | CDS calling | | pigz | 2.3.4 | A parallel implementation of gzip for modern multi-processor, multi-core systems | | QUAST | 5.2.0 | Tool used evaluates genome assemblies, it's part of the pipeline QC module. | | run_dbCAN | 5.1.2 | Annotation tool for the Carbohydrate-Active enZYmes Database (CAZy) | | SeqKit | 2.8.0 | Used to manipulate FASTA files | | SanntiS | 0.9.4.1 | Tool used to identify biosynthetic gene clusters | | tabix | 1.21 | Generic indexer for TAB-delimited genome position files | | Genome Tools - gff3validator | 1.6.5 | Used to validate the analysis summary GFF file | | jq | 1.5 | Used to concatenate the chunked antiSMASH json results |
Reference databases
This pipeline uses several reference databases, you can find the list of them in the follow table. The databases marked with * are downloaded and post-processed by the Microbiome Informatics reference-databases-preprocessing-pipeline. Our team also stores ready to use version of these databases in EBI's FTP server.
| Reference database | Version | Purpose | Download | | ------------------------------------------------------------------------------------------------------------ | ---------- | ------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Rfam covariance models | 15 | rRNA covariance models | ftp://ftp.ebi.ac.uk/pub/databases/Rfam/15.0/Rfam.cm.gz | | Rfam clan info | 15 | rRNA clan information | ftp://ftp.ebi.ac.uk/pub/databases/Rfam/15.0/Rfam.clanin | | InterProScan | 5.73-104.0 | InterProScan reference database | ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.73-104.0/ | | eggNOG-mapper | 5.0.2 | eggNOG-mapper annotation databases and Diamond | https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.12#requirements | | antiSMASH | 8.0.1 | The antiSMASH reference database | https://docs.antismash.secondarymetabolites.org/install/#antismash-standalone-lite | | KOFAM* | 2025-04 | KOfam - HMM profiles for KEGG/KO. Our reference generation pipeline generates the required files | https://github.com/EBI-Metagenomics/reference-databases-preprocessing-pipeline | | GO Slims* | 20160705 | Metagenomics GO Slims | ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/goslim/20160705/goslim_20160705.tar.gz | | run_dbCAN | 4.1.4-V13 | Pre-built runDBCan reference database | ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/dbcan/dbcan4.1.3V12.tar.gz | | CAT_Pack | 202501 | CAT/BAT/RAT NCBI taxonomy pre-made reference database | https://github.com/MGXlab/CAT_pack?tab=readme-ov-file#downloading-preconstructed-database-files | | DRAM | 1.3.0 | DRAM databases | https://github.com/WrightonLabCSU/DRAM/wiki#dram-setup |
Reference genomes
The pipeline includes an optional decontamination step that requires reference genomes (e.g., human, PhiX174, or any user-supplied genome). Frequently used reference genomes are available on our FTP server.
Use the following pipeline options to configure references:
--reference_genomes_folder: Path to a folder containing all reference genome subfolders.--human_reference,--phyx_reference,--contaminant_reference: Names of the subfolders (not paths) for each specific reference.
Each genome should be organized as follows:
<reference_genomes_folder>/
├── <genome_prefix>/
│ └── <genome_prefix>.fna
[!IMPORTANT] FASTA files must use the
.fnaextension.
How to run
Requirements
At the moment the only prerequisites for running it are Nextflow and Docker/Singularity, since all the Nextflow processes use pre-built containers.
Input shape
The input data for the pipeline is metagenomic assemblies FASTA files. These files should be specified using a .csv samplesheet file with this format:
sample,assembly_fasta,contaminant_reference,human_reference,phix_reference
ERZ999,/path/to/assembly/ERZ999.fasta.gz,,,
ERZ998,/path/to/assembly/ERZ998.fasta.gz,,,
Execution
You can run the current version of the pipeline with:
bash
nextflow run ebi-metagenomics/assembly-analysis-pipeline \
-r main \
--input /path/to/samplesheet.csv \
--outdir /path/to/outputdir
This pipeline supports nf-core shared configuration files.
For a more detailed description on how to use the pipeline, see the usage file.
Outputs
For a more detailed description of the different output files, see the outputs file.
Citations
Richardson L, Allen B, Baldi G, Beracochea M, Bileschi ML, Burdett T, et al. MGnify: the microbiome sequence data analysis resource in 2023 [Internet]. Vol. 51, Nucleic Acids Research. Oxford University Press (OUP); 2022. p. D753–9. Available from: http://dx.doi.org/10.1093/nar/gkac1080
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
Owner
- Name: MGnify
- Login: EBI-Metagenomics
- Kind: organization
- Email: metagenomics-help@ebi.ac.uk
- Location: Genome Campus, UK
- Website: https://www.ebi.ac.uk/metagenomics/
- Twitter: MGnifyDB
- Repositories: 153
- Profile: https://github.com/EBI-Metagenomics
MGnify (formerly known as EBImetagenomics) is a free resource for the assembly, analysis, archiving and browsing all types of microbiome derived sequence data
GitHub Events
Total
- Delete event: 22
- Issue comment event: 12
- Push event: 214
- Pull request review event: 192
- Pull request review comment event: 195
- Pull request event: 56
- Create event: 30
Last Year
- Delete event: 22
- Issue comment event: 12
- Push event: 214
- Pull request review event: 192
- Pull request review comment event: 195
- Pull request event: 56
- Create event: 30
Dependencies
- mshick/add-pr-comment b8f338c590a895d50bcbfa6c5859251edc8952fc composite
- actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
- conda-incubator/setup-miniconda a4260408e20b96e80095f42ff7f1a15b27dd94ca composite
- eWaterCycle/setup-apptainer main composite
- jlumbroso/free-disk-space 54081f138730dfa15788a46383842cd2f914a1be composite
- nf-core/setup-nextflow v2 composite
- actions/stale 28ca1036281a5e5922ead5184a1bbf96e5fc984e composite
- actions/setup-python 0b93645e9fea7318ecaed2b359559ac225c90a2b composite
- eWaterCycle/setup-apptainer 4bb22c52d4f63406c49e94c804632975787312b3 composite
- jlumbroso/free-disk-space 54081f138730dfa15788a46383842cd2f914a1be composite
- nf-core/setup-nextflow v2 composite
- actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
- actions/setup-python 0b93645e9fea7318ecaed2b359559ac225c90a2b composite
- peter-evans/create-or-update-comment 71345be0265236311c031f5c7866368bd1eff043 composite
- actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
- actions/setup-python 0b93645e9fea7318ecaed2b359559ac225c90a2b composite
- actions/upload-artifact b4b15b8c7c6ac21ea08fcf65892d2ee8f75cf882 composite
- nf-core/setup-nextflow v2 composite
- pietrobolcato/action-read-yaml 1.1.0 composite
- dawidd6/action-download-artifact 80620a5d27ce0ae443b965134db88467fc607b43 composite
- marocchino/sticky-pull-request-comment 331f8f5b4215f0445d3c07b4967662a32a2d3e31 composite
- actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
- mshick/add-pr-comment b8f338c590a895d50bcbfa6c5859251edc8952fc composite
- nichmor/minimal-read-yaml v0.0.2 composite
- multiqc 1.25.1.*