mettannotator

https://github.com/ebi-metagenomics/mettannotator

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: EBI-Metagenomics
License: apache-2.0
Language: Python
Default Branch: main
Size: 10.8 MB

Statistics

Stars: 75
Watchers: 5
Forks: 5
Open Issues: 2
Releases: 5

Created over 2 years ago · Last pushed 10 months ago

Metadata Files

Readme Contributing License Citation

mettannotator

Introduction
Workflow and tools
Installation and dependencies
- Reference databases
Usage
Test
Outputs
Preparing annotations for ENA or GenBank submission
Mobilome annotation
Credits
Contributions and Support
Citation

Introduction

mettannotator is a bioinformatics pipeline that generates an exhaustive annotation of prokaryotic genomes using existing tools. The output is a GFF file that integrates the results of all pipeline components. Results of each individual tool are also provided.

Workflow and tools

The workflow uses the following tools and databases:

| Tool/Database | Version | Purpose | | ------------------------------------------------------------------------------------------------ | --------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- | | Prokka | 1.14.6 | CDS calling and functional annotation (default) | | Bakta | 1.9.3 | CDS calling and functional annotation (if --bakta flag is used) | | Bakta db | 2024-01-19 with AMRFinderPlus DB 2024-01-31.1 | Bakta DB (when Bakta is used as the gene caller) | | Pseudofinder | v1.1.0 | Identification of possible pseudogenes | | Swiss-Prot | 202406 | Database for Pseudofinder | | InterProScan | 5.62-94.0 | Protein annotation (InterPro, Pfam) | | eggNOG-mapper | 2.1.11 | Protein annotation (eggNOG, KEGG, COG, GO-terms) | | eggNOG DB | 5.0.2 | Database for eggNOG-mapper | | UniFIRE | 2023.4 | Protein annotation | | AMRFinderPlus | 3.12.8 | Antimicrobial resistance gene annotation; virulence factors, biocide, heat, acid, and metal resistance gene annotation | | AMRFinderPlus DB | 3.12 2024-01-31.1 | Database for AMRFinderPlus | | DefenseFinder | 1.2.0 | Annotation of anti-phage systems | | DefenseFinder models | 1.2.3 | Database for DefenseFinder | | GECCO | 0.9.8 | Biosynthetic gene cluster annotation | | antiSMASH | 7.1.0 | Biosynthetic gene cluster annotation | | SanntiS | 0.9.3.4 | Biosynthetic gene cluster annotation | | run_dbCAN | 4.1.2 | PUL prediction | | dbCAN DB | V12 | Database for rundbCAN | | CRISPRCasFinder | 4.3.2 | Annotation of CRISPR arrays | | cmscan | 1.1.5 | ncRNA predictions | | Rfam | 14.9 | Identification of SSU/LSU rRNA and other ncRNAs | | tRNAscan-SE | 2.0.9 | tRNA predictions | | pyCirclize | 1.4.0 | Visualise the merged GFF file | | VIRify | 2.0.0 | Viral sequence annotation (runs separately) | | Mobilome annotation pipeline | 2.0 | Mobilome annotation (runs separately) |

Installation and dependencies

This workflow is built using Nextflow. It uses containers (Docker or Singularity) making installation simple and results highly reproducible.

Install Nextflow version >=21.10
Install Singularity
Install Docker

Although it's possible to run the pipeline on a personal computer, due to the compute requirements, we encourage users to run it on HPC clusters. Any HPC scheduler supported by Nextflow is compatible; however, our team primarily uses Slurm and IBM LSF for the EBI HPC cluster, so those are the profiles we ship with the pipeline.

Reference databases

The pipeline needs reference databases in order to work, they take roughly 180G.

| Path | Size | | ------------------- | ---- | | amrfinder | 217M | | antismash | 9.4G | | bakta | 71G | | dbcan | 7.5G | | defensefinder | 242M | | eggnog | 48G | | interproscan | 45G | | interproentrylist | 2.6M | | rfammodels | 637M | | pseudofinder | 273M | | total | 182G |

mettannotator has an automated mechanism to download the databases using the --dbs <db_path> flag. When this flag is provided, the pipeline inspects the folder to verify if the required databases are already present. If any of the databases are missing, the pipeline will automatically download them.

Users can also provide individual paths to each reference database and its version if needed. For detailed instructions, please refer to the Reference databases section in the --help of the pipeline.

It's important to note that users are not allowed to mix the --dbs flag with individual database paths and versions; they are mutually exclusive. We recommend users to run the pipeline with the --dbs flag for the first time in an appropriate path and to avoid downloading the individual databases separately.

Usage

Input file

First, prepare an input file in the CSV format that looks as follows:

assemblies_sheet.csv:

csv prefix,assembly,taxid BU_ATCC8492VPI0062,/path/to/BU_ATCC8492VPI0062_NT5002.fa,820 EC_ASM584v2,/path/to/GCF_000005845.2.fna,562 ...

Here, prefix is the prefix and the locus tag that will be assigned to output files and proteins during the annotation process; maximum length is 24 characters;

assembly is the path to where the assembly file in FASTA format is located;

taxid is the NCBI TaxId (if the species-level TaxId is not known, a TaxId for a higher taxonomic level can be used). If the taxonomy is known, look up the TaxID here.

Finding TaxIds

If NCBI taxonomies of input genomes are not known, a tool such as CAT/BAT can be used. Follow the instructions for getting the tool and downloading the NCBI nr database for it.

If using CAT/BAT, here is the suggested process for making the mettannotator input file:

```bash

Run BAT on each input genome, saving all results to the same folder

CAT bins -b ${genomename}.fna -d ${pathtoCATdatabase} -t ${pathtoCATtaxfolder} -o BATresults/${genomename}

Optional: to check what taxa were assigned, you can add names to them

CAT addnames -i BATresults/${genomename}.bin2classification.txt -o BATresults/${genomename}.name.txt -t ${pathtoCATtax_folder} ```

To generate an input file for mettannotator, use generateinputfile.py:

``` python3 preprocessing/generateinputfile.py -h usage: generateinputfile.py [-h] -i INFILE -d INPUTDIR -b BATDIR -o OUTFILE [--no-prefix]

The script takes a list of genomes and the taxonomy results generated by BAT and makes a mettannotator input csv file. The user has the option to either use the genome file name (minus the extension) as the prefix for mettannotator or leave the prefix off and fill it out themselves after the script generates an input file with just the FASTA location and the taxid. It is expected that for all genomes, BAT results are stored in the same folder and are named as {fastabasename}.bin2classification.txt. The script will use the lowest- level taxid without an asterisk as the taxid for the genome.

optional arguments: -h, --help show this help message and exit -i INFILE A file containing a list of genome files to include (file name only, with file extension, unzipped, one file per line). -d INPUTDIR Full path to the directory where the input FASTA files are located. -b BATDIR Folder with BAT results. Results for all genomes should be in the same folder and should be named {fastabasename}.bin2classification.txt -o OUTFILE Path to the file where the output will be saved to. --no-prefix Skip prefix generation and leave the first column of the output file empty for the user to fill out. Default: False ```

For example:

bash python3 generate_input_file.py -i list_of_genome_fasta_files.txt -d /path/to/the/fasta/files/folder/ -b BAT_results/ -o mettannotator_input.csv

It is always best to check the outputs to ensure the results are as expected. Correct any wrongly detected taxa before starting mettannotator.

Note, that by default the script uses FASTA file names as prefixes and truncates them to 24 characters if they exceed the limit.

Running mettannotator

Running mettannotator with the --help option will pull the repository and display the help message:

[!NOTE] We use the -latest flag with the nextflow run command, which ensures that the latest available version of the pipeline is pulled. If you encounter any issues with the nextflow run command, please refer to the Nextflow documentation.

``angular2html $ nextflow run -latest ebi-metagenomics/mettannotator/main.nf --help N E X T F L O W ~ version 23.04.3 Launchingmettannotator/main.nf` [disturbed_davinci] DSL2 - revision: f2a0e51af6

ebi-metagenomics/mettannotator

Typical pipeline command:

nextflow run ebi-metagenomics/mettannotator --input assemblies_sheet.csv -profile docker

Input/output options --input [string] Path to comma-separated file containing information about the assemblies with the prefix to be used. --outdir [string] The output directory where the results will be saved. You have to use absolute paths to storage on Cloud infrastructure. --fast [boolean] Run the pipeline in fast mode. In this mode, InterProScan, UniFIRE, and SanntiS won't be executed, saving resources and speeding up the pipeline. --email [string] Email address for completion summary. --multiqc_title [string] MultiQC report title. Printed as page header, used for filename if not otherwise specified.

Reference databases --dbs [string] Folder for the tools' reference databases used by the pipeline for downloading. It's important to note that mixing the --dbs flag with individual database paths and versions is not allowed; they are mutually exclusive. --interproscandb [string] The InterProScan reference database, ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/ --interproscandbversion [string] The InterProScan reference database version. [default: 5.62-94.0] --interproentrylist [string] TSV file listing basic InterPro entry information - the accessions, types and names, ftp://ftp.ebi.ac.uk/pub/databases/interpro/releases/94.0/entry.list --interproentrylistversion [string] InterPro entry list version [default: 94] --eggnogdb [string] The EggNOG reference database folder, https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.12#requirements --eggnogdbversion [string] The EggNOG reference database version. [default: 5.0.2] --rfamncrnamodels [string] Rfam ncRNA models, ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/genomes-pipeline/ncrna/ --rfamncrnamodelsrfamversion [string] Rfam release version where the models come from. [default: 14.9] --amrfinderplusdb [string] AMRFinderPlus reference database, https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobialresistance/AMRFinderPlus/database/. Go to the following documentation for the db setup https://github.com/ncbi/amr/wiki/Upgrading#database-updates. --amrfinderplusdbversion [string] The AMRFinderPlus reference database version. [default: 2023-02-23.1] --defensefinderdb [string] Defense Finder reference models, https://github.com/mdmparis/defense-finder#updating-defensefinder. The Microbiome Informatics team provides a pre-indexed version of the models for version 1.2.3 on this ftp location: ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/defense-finder/defense-finder-models1.2.3.tar.gz. --defensefinderdbversion [string] The Defense Finder models version. [default: 1.2.3] --antismashdb [string] antiSMASH reference database, go to this documentation to do the database setup https://docs.antismash.secondarymetabolites.org/install/#installing-the-latest-antismash-release. --antismashdbversion [string] The antiSMASH reference database version. [default: 7.1.0] --dbcandb [string] dbCAN indexed reference database, please go to the documentation for the setup https://dbcan.readthedocs.io/en/latest/. The Microbiome Informatics team provides a pre-indexed version of the database for version 4.0 on this ftp location: ftp://ftp.ebi.ac.uk/pub/databases/metagenomics/pipelines/tool-dbs/dbcan/dbcan4.0.tar.gz --dbcandbversion [string] The dbCAN reference database version. [default: 4.1.3V12] --pseudofinderdb [string] Pseudofinder reference database. Mettannotator uses SwissProt as the database for Pseudofinder. --pseudofinderdbversion [string] SwissProt version. [default: 2024_06]

Generic options --multiqcmethodsdescription [string] Custom MultiQC yaml file containing HTML including a methods description.

Other parameters --bakta [boolean] Use Bakta instead of Prokka for CDS annotation. Prokka will still be used for archaeal genomes.

!! Hiding 17 params, use --validationShowHiddenParams to show them !!

If you use ebi-metagenomics/mettannotator for your analysis please cite:

The nf-core framework https://doi.org/10.1038/s41587-020-0439-x
Software dependencies

https://github.com/ebi-metagenomics/mettannotator/blob/master/CITATIONS.md

```

Now, you can run the pipeline using:

bash nextflow run ebi-metagenomics/mettannotator \ -profile <docker/singularity/...> \ --input assemblies_sheet.csv \ --outdir <OUTDIR> \ --dbs <PATH/TO/WHERE/DBS/WILL/BE/SAVED>

[!WARNING] Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Running the pipeline from the source code

If the Nextflow integration with Git does not work, users can download the tarball from the releases page. After extracting the tarball, the pipeline can be run directly by executing the following command:

bash $ nextflow run path-to-source-code/main.nf --help

Local execution

The pipeline can be run on a desktop or laptop, with the caveat that it will take a few hours to complete depending on the resources. There is a local profile in the Nextflow config that limits the total resources the pipeline can use to 8 cores and 12 GB of RAM. In order to run it (Docker or Singularity are still required):

bash nextflow run -latest ebi-metagenomics/mettannotator \ -profile local,<docker or singulairty> \ --input assemblies_sheet.csv \ --outdir <OUTDIR> \ --dbs <PATH/TO/WHERE/DBS/WILL/BE/SAVED>

Gene caller choice

By default, mettannotator uses Prokka to identify protein-coding genes. Users can choose to use Bakta instead by running mettannotator with the --bakta flag. mettannotator runs Bakta without ncRNA and CRISPR annotation as these are produced by separate tools in the pipeline. Archaeal genomes will continue to be annotated using Prokka as Bakta is only intended for annotation of bacterial genomes.

Fast mode

To reduce the compute time and the amount of resources used, the pipeline can be executed with the --fast flag. When run in the fast mode, mettannotator will skip InterProScan, UniFIRE and SanntiS. This could be a suitable option for a first-pass of annotation or if computational resources are limited, however, we recommend running the full version of the pipeline whenever possible.

When generating an input file for a fast mode run, it is sufficient to indicate the taxid of the superkingdom (2 for bacteria and 2157 for Archaea) in the "taxid" column rather than the taxid of the lowest known taxon.

Test

To run the pipeline using a test dataset, execute the following command:

```bash wget https://raw.githubusercontent.com/EBI-Metagenomics/mettannotator/master/tests/test.csv

nextflow run -latest ebi-metagenomics/mettannotator \ -profile \ --input test.csv \ --outdir \ --dbs ```

Outputs

The output folder structure will look as follows:

``` └─ ├─antimicrobialresistance │ └─amrfinderplus ├─antiphagedefense │ └─defensefinder ├─biosyntheticgeneclusters │ ├─antismash │ ├─gecco │ └─sanntis ├─functionalannotation │ ├─dbcan │ ├─eggnogmapper │ ├─interproscan │ ├─mergedgff │ ├─prokka │ ├─pseudofinder │ └─unifire ├─mobilome │ └─crisprcasfinder ├─quast │ └─ │ ├─basicstats │ └─icarusviewers ├─rnas │ ├─ncrna │ └─trna ├─multiqc │ ├─multiqcdata │ └─multiqcplots │ ├─pdf │ ├─png │ └─svg ├─pipelineinfo │ ├─softwareversions.yml │ ├─executionreport.txt │ ├─executionreport.html │ ├─executiontimeline.txt │ ├─executiontimeline.html │ ├─executiontrace.txt │ ├─executiontrace.html │ └─pipelinedag.html

```

Merged GFF

The two main output files for each genome are located in <OUTDIR>/<PREFIX>/functional_annotation/merged_gff/:

<PREFIX>_annotations.gff: annotations produced by all tools merged into a single file
<PREFIX>_annotations_with_descriptions.gff: a version of the GFF file above that includes descriptions of all InterPro terms to make the annotations human-readable. Not generated if --fast flag was used.

Both files include the genome sequence in the FASTA format at the bottom of the file.

Additionally, for genomes with no more than 50 annotated contigs, a Circos plot of the <PREFIX>_annotations.gff file is generated and included in the same folder. An example of such plot is shown below:

Data sources

Below is an explanation of how each field in column 3 and 9 of the final GFF file is populated. In most cases, information is taken as is from the reporting tool's output.

| Feature (column 3) | Attribute Name (column 9) | Reporting Tool | Description | | --------------------- | ----------------------------------------------------------------------- | --------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ncRNA | all* | cmscan + Rfam | ncRNA annotation (excluding tRNA) | | tRNA | all* | tRNAscan-SE | tRNA annotation | | LeftFLANK, RightFLANK | all* | CRISPRCasFinder | CRISPR array flanking sequence | | CRISPRdr | all* | CRISPRCasFinder | Direct repeat region of a CRISPR array | | CRISPRspacer | all* | CRISPRCasFinder | CRISPR spacer | | CDS | ID, eC_number, Name, Dbxref, gene, inference, locus_tag | Prokka/Bakta | Protein annotation | | CDS | product | mettannotator | Product assigned as described in Determining the product | | CDS | product_source | mettannotator | Tool that reported the product chosen by mettannotator | | CDS | eggNOG | eggNOG-mapper | Seed ortholog from eggNOG | | CDS | cog | eggNOG-mapper | COG category | | CDS | kegg | eggNOG-mapper | KEGG orthology term | | CDS | Ontology_term | eggNOG-mapper | GO associations | | CDS | pfam | InterProScan | Pfam accessions | | CDS | interpro | InterProScan | InterPro accessions. In <PREFIX>_annotations_with_descriptions.gff each accession is followed by its description and entry type: Domain [D], Family [F], Homologous Superfamily [H], Repeat [R], Site [S] | | CDS | nearest_MiBIG | SanntiS | MiBIG accession of the nearest BGC to the cluster in the MIBIG space | | CDS | nearest_MiBIG_class | SanntiS | BGC class of nearestMiBIG | | CDS | `geccobgctype| GECCO | BGC type | | CDS |antismashbgcfunction| antiSMASH | BGC function | | CDS |amrfinderplusgenesymbol| AMRFinderPlus | Gene symbol according to AMRFinderPlus | | CDS |amrfinderplussequencename| AMRFinderPlus | Product description | | CDS |amrfinderplusscope| AMRFinderPlus | AMRFinderPlus database (core or plus) | | CDS |elementtype,elementsubtype| AMRFinderPlus | Functional category | | CDS |drugclass,drugsubclass| AMRFinderPlus | Class and subclass of drugs that this gene is known to contribute to resistance of | | CDS |dbcanprottype| run_dbCAN | Predicted protein function: transporter (TC), transcription factor (TF), signal transduction protein (STP), CAZyme | | CDS |dbcanprotfamily| run_dbCAN | Predicted protein family | | CDS |substratedbcan-pul` | rundbCAN | Substrate predicted by dbCAN-PUL search | | CDS | substrate_dbcan-sub | rundbCAN | Substrate predicted by dbCAN-subfam | | CDS | `defensefindertype,defensefindersubtype| DefenseFinder | Type and subtype of the anti-phage system found | | CDS |ufprotrecfullname,ufprotrecshortname,ufprotrececnumber| UniFIRE | Protein recommended full name, short name and EC number according to UniFIRE | | CDS |ufprotaltfullname,ufprotaltshortname,ufprotaltecnumber| UniFIRE | Protein alternative full name, short name and EC number according to UniFIRE | | CDS |ufchebi| UniFIRE | ChEBI identifiers | | CDS |ufontologyterm| UniFIRE | GO associations | | CDS |ufkeyword| UniFIRE | UniFIRE keywords | | CDS |ufgenename,ufgenenamesynonym| UniFIRE | Gene name and gene name synonym according to UniFIRE | | CDS |ufpirsrcofactor` | UniFIRE | Cofactor names from PIRSR |

*all attributes in column 9 are populated by the tool

Determining the product

The following logic is used by mettannotator to fill out the product field in the 9th column of the GFF:

If the pipeline is executed with the --fast flag, only the output of eggNOG-mapper is used to determine the product of proteins that were labeled as hypothetical by the gene caller.

Detection of pseudogenes and spurious ORFs

mettannotator uses several approaches to detect pseudogenes and spurious ORFs:

If Bakta is used as the initial annotation tool, mettannotator will inherit the pseudogene labels assigned by Bakta.
mettannotator runs Pseudofinder and labels genes that Pseudofinder predicts to be pseudogenes by adding "pseudo=true" to the 9th column of the final merged GFF file. If there is a disagreement between Pseudofinder and Bakta and one of the tools calls a gene a pseudogene, it will be labeled as a pseudogene.
AntiFam, which is a part of InterPro, is used to identify potential spurious ORFs. If an ORF has an AntiFam hit, mettannotator will remove it from the final merged GFF file. These ORFs will still appear in the raw outputs of Bakta/Prokka and may appear in other tool outputs.

mettannotator produces a report file which is located in the merged_gff folder and includes a list of CDS with AntiFam hits and pseudogenes. For each pseudogene, the report shows which tool predicted it.

Contents of the tool output folders

The output folders of each individual tool contain select output files of the third-party tools used by mettannotator. For file descriptions, please refer to the tool documentation. For some tools that don't output a GFF, mettannotator converts the output into a GFF.

Note: if the pipeline completed without errors but some of the tool-specific output folders are empty, those particular tools did not generate any annotations to output.

Preparing annotations for ENA or GenBank submission

mettannotator produces a final annotation file in GFF3 format. To submit the annotations to data archives, it is first necessary to convert the GFF3 file into the required format, using third-party tools available. mettannotator outputs a specially formatted GFF3 file, named <prefix>_submission.gff to be used with converters.

ENA

ENA accepts annotations in the EMBL flat-file format. Please use EMBLmyGFF3 to perform the conversion; the repository includes detailed instructions. The two files required for conversion are:

the genome FASTA file
<mettannotator_results_folder>/<prefix>/functional_annotation/merged_gff/<prefix>_submission.gff

Please note that it is necessary to register the project and locus tags in ENA prior to conversion. Follow links in the EMBLmyGFF3 repository for more details.

GenBank

To convert annotations for GenBank submission, please use table2asn. Three files are required:

the genome FASTA file
<mettannotator_results_folder>/<prefix>/functional_annotation/merged_gff/<prefix>_submission.gff
Submission template file (can be generated here)

More instructions on running table2asn are available via GenBank.

Mobilome annotation

The mobilome annotation workflow is not currently integrated into mettannotator. However, the outputs produced by mettannotator can be used to run VIRify and the mobilome annotation pipeline and the outputs of these tools can be integrated back into the GFF file produced by mettannotator.

After installing both tools, follow these steps to add the mobilome annotation:

Run the viral annotation pipeline:

bash nextflow run \ emg-viral-pipeline/virify.nf \ -profile <profile> \ --fasta <genome_fasta.fna> \ --output <prefix>

Run the mobilome annotation pipeline:

bash nextflow run mobilome-annotation-pipeline/main.nf \ --assembly <genome_fasta.fna> \ --user_genes true \ --prot_gff <mettannotator_results_folder/<prefix>/functional_annotation/merged_gff/<prefix>_annotations.gff \ --virify true # only if the next two VIRify files exist, otherwise skip this line \ --vir_gff Virify_output_folder/08-final/gff/<prefix>_virify.gff # only if file exists, otherwise skip this line \ --vir_checkv Virify_output_folder/07-checkv/\*quality_summary.tsv # only if the GFF file above exists, otherwise skip this line \ --outdir <mobilome_output_folder> \ --skip_crispr true \ --skip_amr true \ -profile <profile>"

Integrate the output into the mettannotator GFF

```bash

Add mobilome to the merged GFF produced by mettannotator

python3 postprocessing/addmobilometogff.py \ -m <mobilomeoutputfolder>/gffoutputfiles/mobilomenogenes.gff \ -i //functionalannotation/mergedgff/annotations.gff \ -o annotationswithmobilome.gff

Add mobilome to the GFF with descriptions produced by mettannotator

python3 postprocessing/addmobilometogff.py \ -m <mobilomeoutputfolder>/gffoutputfiles/mobilomenogenes.gff \ -i //functionalannotation/mergedgff/annotationswithdescriptions.gff \ -o annotationswithdescriptionswithmobilome.gff ```

Optional: regenerate the Circos plot with the mobilome track added

```bash pip install pycirclize pip install matplotlib

python3 bin/circosplot.py \ -i annotationswithmobilome.gff \ -o plot.png \ -p \ --mobilome ```

Credits

ebi-metagenomics/mettannotator was originally written by the Microbiome Informatics Team at EMBL-EBI

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

If you use the software, please cite:

Gurbich TA, Beracochea M, De Silva NH, Finn RD. mettannotator: a comprehensive and scalable Nextflow annotation pipeline for prokaryotic assemblies. bioRxiv 2024.07.11.603040; doi: https://doi.org/10.1101/2024.07.11.603040

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

Name: MGnify
Login: EBI-Metagenomics
Kind: organization
Email: metagenomics-help@ebi.ac.uk
Location: Genome Campus, UK

Website: https://www.ebi.ac.uk/metagenomics/
Twitter: MGnifyDB
Repositories: 153
Profile: https://github.com/EBI-Metagenomics

MGnify (formerly known as EBImetagenomics) is a free resource for the assembly, analysis, archiving and browsing all types of microbiome derived sequence data

Citation (CITATIONS.md)

# ebi-metagenomics/mettannotator: Citations

## [mettannotator: a comprehensive and scalable Nextflow annotation pipeline for prokaryotic assemblies (pre-print)](https://doi.org/10.1101/2024.07.11.603040)

> Gurbich TA, Beracochea M, De Silva NH, Finn RD. mettannotator: a comprehensive and scalable Nextflow annotation pipeline for prokaryotic assemblies. doi: https://doi.org/10.1101/2024.07.11.603040

## [MGnify Genomes](https://pubmed.ncbi.nlm.nih.gov/36806692/)

> Gurbich TA, Almeida A, Beracochea M, Burdett T, Burgin J, Cochrane G, Raj S, Richardson L, Rogers AB, Sakharova E, Salazar GA and Finn RD. MGnify Genomes: A Resource for Biome-specific Microbial Genome Catalogues. J Mol Biol. 2023 Jul; 435(14). doi: https://doi.org/10.1016/j.jmb.2023.168016. PubMed PMID:

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

- [Prokka](https://pubmed.ncbi.nlm.nih.gov/24642063/)

  > Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068-9. doi: 10.1093/bioinformatics/btu153. Epub 2014 Mar 18. PMID: 24642063.

- [Bakta](https://pubmed.ncbi.nlm.nih.gov/34739369/)

  > Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A. Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microb Genom. 2021 Nov;7(11):000685. doi: 10.1099/mgen.0.000685. PMID: 34739369; PMCID: PMC8743544.

- [InterProScan](https://pubmed.ncbi.nlm.nih.gov/24451626/)

  > Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, Pesseat S, Quinn AF, Sangrador-Vegas A, Scheremetjew M, Yong SY, Lopez R, Hunter S. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014 May 1;30(9):1236-40. doi: 10.1093/bioinformatics/btu031. Epub 2014 Jan 21. PMID: 24451626; PMCID: PMC3998142.

- [eggNOG-mapper](https://pubmed.ncbi.nlm.nih.gov/34597405/)

  > Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Mol Biol Evol. 2021 Dec 9;38(12):5825-5829. doi: 10.1093/molbev/msab293. PMID: 34597405; PMCID: PMC8662613.

- [UniFIRE](https://pubmed.ncbi.nlm.nih.gov/32399560/)

  > MacDougall A, Volynkin V, Saidi R, Poggioli D, Zellner H, Hatton-Ellis E, Joshi V, O'Donovan C, Orchard S, Auchincloss AH, Baratin D, Bolleman J, Coudert E, de Castro E, Hulo C, Masson P, Pedruzzi I, Rivoire C, Arighi C, Wang Q, Chen C, Huang H, Garavelli J, Vinayaka CR, Yeh LS, Natale DA, Laiho K, Martin MJ, Renaux A, Pichler K; UniProt Consortium. UniRule: a unified rule resource for automatic annotation in the UniProt Knowledgebase. Bioinformatics. 2020 Nov 1;36(17):4643-4648. doi: 10.1093/bioinformatics/btaa485. Erratum in: Bioinformatics. 2021 Apr 1;36(22-23):5562. PMID: 32399560; PMCID: PMC7750954.

- [AMRFinderPlus](https://pubmed.ncbi.nlm.nih.gov/34135355/)

  > Feldgarden M, Brover V, Gonzalez-Escalona N, Frye JG, Haendiges J, Haft DH, Hoffmann M, Pettengill JB, Prasad AB, Tillman GE, Tyson GH, Klimke W. AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. Sci Rep. 2021 Jun 16;11(1):12728. doi: 10.1038/s41598-021-91456-0. PMID: 34135355; PMCID: PMC8208984.

- [DefenseFinder](https://pubmed.ncbi.nlm.nih.gov/35538097/)

  > Tesson F, Hervé A, Mordret E, Touchon M, d'Humières C, Cury J, Bernheim A. Systematic and quantitative view of the antiviral arsenal of prokaryotes. Nat Commun. 2022 May 10;13(1):2561. doi: 10.1038/s41467-022-30269-9. PMID: 35538097; PMCID: PMC9090908.

- [GECCO](https://www.biorxiv.org/content/10.1101/2021.05.03.442509v1)

  > Carroll LM, Larralde M, Fleck JS, Ponnudurai R, Milanese A, Barazzone EC, Zeller G. Accurate de novo identification of biosynthetic gene clusters with GECCO. bioRxiv 2021.05.03.442509; doi:10.1101/2021.05.03.442509

- [antiSMASH](https://pubmed.ncbi.nlm.nih.gov/37140036/)

  > Blin K, Shaw S, Augustijn HE, Reitz ZL, Biermann F, Alanjary M, Fetter A, Terlouw BR, Metcalf WW, Helfrich EJN, van Wezel GP, Medema MH, Weber T. antiSMASH 7.0: new and improved predictions for detection, regulation, chemical structures and visualisation. Nucleic Acids Res. 2023 Jul 5;51(W1):W46-W50. doi: 10.1093/nar/gkad344. PMID: 37140036; PMCID: PMC10320115.

- [SanntiS](https://www.biorxiv.org/content/10.1101/2023.05.23.540769v2)

  > Sanchez S, Rogers JD, Rogers AB, Nassar M, McEntyre J, Welch M, Hollfelder F, Finn RD. Expansion of novel biosynthetic gene clusters from diverse environments using SanntiS. bioRxiv 2023.05.23.540769; doi: https://doi.org/10.1101/2023.05.23.540769

- [Pseudofinder](https://doi.org/10.1093/molbev/msac153)

  > Syberg-Olsen MJ, Garber AI, Keeling PJ, McCutcheon JP, Husnik F. Pseudofinder: Detection of Pseudogenes in Prokaryotic Genomes. Mol Biol Evol. 2022 Jul 2;39(7):msac153. doi: 10.1093/molbev/msac153. PMID: 35801562; PMCID: PMC9336565.

- [run_dbCAN](https://pubmed.ncbi.nlm.nih.gov/37125649/)

  > Zheng J, Ge Q, Yan Y, Zhang X, Huang L, Yin Y. dbCAN3: automated carbohydrate-active enzyme and substrate annotation. Nucleic Acids Res. 2023 Jul 5;51(W1):W115-W121. doi: 10.1093/nar/gkad328. PMID: 37125649; PMCID: PMC10320055.

- [CRISPRCasFinder](https://pubmed.ncbi.nlm.nih.gov/29790974/)

  > Couvin D, Bernheim A, Toffano-Nioche C, Touchon M, Michalik J, Néron B, Rocha EPC, Vergnaud G, Gautheret D, Pourcel C. CRISPRCasFinder, an update of CRISRFinder, includes a portable version, enhanced performance and integrates search for Cas proteins. Nucleic Acids Res. 2018 Jul 2;46(W1):W246-W251. doi: 10.1093/nar/gky425. PMID: 29790974; PMCID: PMC6030898.

- [Infernal](https://pubmed.ncbi.nlm.nih.gov/24008419/)

  > Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013 Nov 15;29(22):2933-5. doi: 10.1093/bioinformatics/btt509. Epub 2013 Sep 4. PMID: 24008419; PMCID: PMC3810854.

- [Rfam](https://pubmed.ncbi.nlm.nih.gov/29112718/)

  > Kalvari I, Argasinska J, Quinones-Olvera N, Nawrocki EP, Rivas E, Eddy SR, Bateman A, Finn RD, Petrov AI. Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res. 2018 Jan 4;46(D1):D335-D342. doi: 10.1093/nar/gkx1038. PMID: 29112718; PMCID: PMC5753348.

- [tRNAscan-SE](https://pubmed.ncbi.nlm.nih.gov/34417604/)

  > Chan PP, Lin BY, Mak AJ, Lowe TM. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic Acids Res. 2021 Sep 20;49(16):9077-9096. doi: 10.1093/nar/gkab688. PMID: 34417604; PMCID: PMC8450103.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total

Create event: 7
Issues event: 4
Release event: 1
Watch event: 20
Delete event: 5
Issue comment event: 14
Push event: 108
Pull request review event: 16
Pull request review comment event: 9
Pull request event: 14

Last Year

Create event: 7
Issues event: 4
Release event: 1
Watch event: 20
Delete event: 5
Issue comment event: 14
Push event: 108
Pull request review event: 16
Pull request review comment event: 9
Pull request event: 14

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 2
Total pull requests: 1
Average time to close issues: 3 months
Average time to close pull requests: N/A
Total issue authors: 2
Total pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 1
Average time to close issues: 3 months
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

davidmadariaga (2)
xeniaroda29 (1)
Andrewab9 (1)
tgurbich (1)
zhanxw (1)
vicru93 (1)
ndreey (1)

Pull Request Authors

tgurbich (29)
mberacochea (11)
nds (3)
Ales-ibt (1)

Top Labels

Issue Labels

bug (4) help wanted (2) stale (1)

Pull Request Labels

bug (1)

mettannotator

Science Score: 75.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

mettannotator

Introduction

Workflow and tools

Installation and dependencies

Reference databases

Usage

Input file

Finding TaxIds

Run BAT on each input genome, saving all results to the same folder

Optional: to check what taxa were assigned, you can add names to them

Running mettannotator

ebi-metagenomics/mettannotator

!! Hiding 17 params, use --validationShowHiddenParams to show them !!

https://github.com/ebi-metagenomics/mettannotator/blob/master/CITATIONS.md

Running the pipeline from the source code

Local execution

Gene caller choice

Fast mode

Test

Outputs

Merged GFF

Data sources

Determining the product

Detection of pseudogenes and spurious ORFs

Contents of the tool output folders

Preparing annotations for ENA or GenBank submission

ENA

GenBank

Mobilome annotation

Add mobilome to the merged GFF produced by mettannotator

Add mobilome to the GFF with descriptions produced by mettannotator

Credits

Contributions and Support

Citations

Owner

Citation (CITATIONS.md)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels