bakta

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids

https://github.com/oschwengers/bakta

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 86 DOI reference(s) in README
  • Academic publication links
    Links to: ncbi.nlm.nih.gov, zenodo.org
  • Committers with academic emails
    1 of 14 committers (7.1%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.4%) to scientific vocabulary

Keywords

annotation bacteria bacterial-genomes bioinformatics genome-annotation mag metagenome-assembled-genomes microbial-genomics plasmids
Last synced: 6 months ago · JSON representation ·

Repository

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids

Basic Info
  • Host: GitHub
  • Owner: oschwengers
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 99.4 MB
Statistics
  • Stars: 544
  • Watchers: 15
  • Forks: 67
  • Open Issues: 20
  • Releases: 49
Topics
annotation bacteria bacterial-genomes bioinformatics genome-annotation mag metagenome-assembled-genomes microbial-genomics plasmids
Created about 6 years ago · Last pushed 6 months ago
Metadata Files
Readme License Code of conduct Citation

README.md

DOI:10.1099/mgen.0.000685 DOI License: GPL v3 PyPI - Python Version PyPI - Status GitHub release

PyPI Conda Docker Image Version Spack Galaxy Toolshed - Tool Version Static Badge

Bakta: rapid & standardized annotation of bacterial genomes, MAGs & plasmids

Bakta is a tool for the rapid & standardized annotation of bacterial genomes and plasmids from both isolates and MAGs. It provides dbxref-rich, sORF-including and taxon-independent annotations in machine-readable JSON & bioinformatics standard file formats for automated downstream analysis.

Contents

Description

  • Comprehensive & taxonomy-independent database Bakta provides a large and taxonomy-independent database using UniProt's entire UniRef protein sequence cluster universe. Thus, it achieves favourable annotations in terms of sensitivity and specificity along the broad continuum ranging from well-studied species to unknown genomes from MAGs.

  • Protein sequence identification Bakta exactly identifies known identical protein sequences (IPS) from RefSeq and UniProt allowing the fine-grained annotation of gene alleles (AMR) or closely related but distinct protein families. This is achieved via an alignment-free sequence identification (AFSI) approach using full-length MD5 protein sequence hash digests.

  • Fast This AFSI approach substantially accellerates the annotation process by avoiding computationally expensive homology searches for identified genes. Thus, Bakta can annotate a typical bacterial genome in 10 ±5 min on a laptop, plasmids in a couple of seconds/minutes.

  • Database cross-references Fostering the FAIR principles, Bakta exploits its AFSI approach to annotate CDS with database cross-references (dbxref) to RefSeq (WP_*), UniRef100 (UniRef100_*) and UniParc (UPI*). By doing so, IPS allow the surveillance of distinct gene alleles and streamlining comparative analysis as well as posterior (external) annotations of putative & hypothetical protein sequences which can be mapped back to existing CDS via these exact & stable identifiers (E. coli gene ymiA ...more). Currently, Bakta identifies ~350 mio, ~330 mio and ~290 mio distinct protein sequences from UniParc, UniRef100 and RefSeq, respectively. Hence, for certain genomes, up to 99 % of all CDS can be identified this way, skipping computationally expensive sequence alignments.

  • FAIR annotations To provide standardized annotations adhearing to FAIR principles, Bakta utilizes a versioned custom annotation database comprising UniProt's UniRef100 & UniRef90 protein clusters (FAIR -> DOI/DOI) enriched with dbxrefs (GO, COG, EC) and annotated by specialized niche databases. For each DB version we provide a comprehensive log file of all imported sequences and annotations.

  • Small proteins / short open reading frames Bakta detects and annotates small proteins/short open reading frames (sORF) which are not predicted by tools like Prodigal.

  • Expert annotation systems To provide high quality annotations for certain proteins of higher interest, e.g. AMR & VF genes, Bakta includes & merges different expert annotation systems. Currently, Bakta uses NCBI's AMRFinderPlus for AMR gene annotations as well as an generalized protein sequence expert system with distinct coverage, identity and priority values for each sequence, currenlty comprising the VFDB as well as NCBI's BlastRules.

  • Comprehensive workflow Bakta annotates ncRNA cis-regulatory regions, oriC/oriV/oriT and assembly gaps as well as standard feature types: tRNA, tmRNA, rRNA, ncRNA genes, CRISPR, CDS and pseudogenes.

  • GFF3 & INSDC conform annotations Bakta writes GFF3 and INSDC-compliant (Genbank & EMBL) annotation files ready for submission (checked via GenomeTools GFF3Validator, table2asn_GFF and ENA Webin-CLI for GFF3 and EMBL file formats, respectively for representative genomes of all ESKAPE species).

  • Bacteria & plasmids only Bakta was designed to annotate bacteria (isolates & MAGs) and plasmids, only. This decision by design has been made in order to tweak the annotation process regarding tools, preferences & databases and to streamline further development & maintenance of the software.

  • Reasoning By annotating bacterial genomes in a standardized, taxonomy-independent, high-throughput and local manner, Bakta aims at a well-balanced tradeoff between fully featured but computationally demanding pipelines like PGAP and rapid highly customizable offline tools like Prokka. Indeed, Bakta is heavily inspired by Prokka (kudos to Torsten Seemann) and many command line options are compatible for the sake of interoperability and user convenience. Hence, if Bakta does not fit your needs, please consider trying Prokka.

Installation

Bakta can be installed via BioConda, Docker, Singularity and Pip. However, we encourage to use Conda or Docker/Singularity to automatically install all required 3rd party dependencies.

In all cases a mandatory database must be downloaded.

BioConda

bash conda install -c conda-forge -c bioconda bakta

Podman (Docker)

We maintain a Docker image oschwengers/bakta providing an entrypoint, so that containers can be used like an executable:

bash podman pull oschwengers/bakta podman run oschwengers/bakta --help

Installation instructions and get-started guides: Podman docs. For further convenience, we provide a shell script (bakta-podman.sh) handling Podman related parameters (volume mounting, user IDs, etc):

bash bakta-podman.sh --db <db-path> --output <output-path> <input>

For experienced users and full functionality (bakta_db & bakta_proteins), an image without entrypoint might be a better option. For these cases, please use one of the Biocontainer images:

bash export CONTAINER="quay.io/biocontainers/bakta:1.8.2--pyhdfd78af_0" podman run -it --rm $CONTAINER bakta --help podman run -it --rm $CONTAINER bakta_db --help

Pip

bash python3 -m pip install --user bakta

Bakta requires the following 3rd party software tools which must be installed and executable to use the full set of features:

Database download

Bakta requires a mandatory database which is publicly hosted at Zenodo: DOI We provide 2 types: full and light. To get best annotation results and to use all features, we recommend using the full (default). If you seek for maximum runtime performance or if download time/storage requirements are an issue, please try the light version. Further information is provided in the database section below.

List available DB versions (available as either full or light):

bash bakta_db list ...

To download the most recent compatible database version we recommend to use the internal database download & setup tool:

bash bakta_db download --output <output-path> --type [light|full]

Of course, the database can also be downloaded and installed manually:

bash wget https://zenodo.org/record/14916843/files/db-light.tar.xz bakta_db install -i db-light.tar.xz

If required, or desired, the AMRFinderPlus DB can also be updated manually:

bash amrfinder_update --force_update --database db-light/amrfinderplus-db/

If you're using bakta on Docker:

bash docker run -v /path/to/desired-db-path:/db --entrypoint /bin/bash oschwengers/bakta:latest -c "bakta_db download --output /db --type [light|full]"

As an additional data repository backup, we provide the most recent database version via our institute servers: full, light. However, the bandwith is limited. Hence, please use it with caution and only if Zenodo might be temporarily unreachable or slow.

Update an existing database:

bash bakta_db update --db <existing-db-path> [--tmp-dir <tmp-directory>]

Update using Docker:

bash docker run -v /path/to/desired-db-path:/db --entrypoint /bin/bash oschwengers/bakta:latest -c "bakta_db update --db /db/db-[light|full]"

The database path can be provided either via parameter (--db) or environment variable (BAKTA_DB):

```bash bakta --db genome.fasta

export BAKTA_DB= bakta genome.fasta ```

For system-wide setups, the database can also be copied to the Bakta base directory:

bash cp -r db/ <bakta-installation-dir>

As Bakta takes advantage of AMRFinderPlus for the annotation of AMR genes, AMRFinder is required to setup its own internal databases in a <amrfinderplus-db> subfolder within the Bakta database <db-path>, once via amrfinder_update --force_update --database <db-path>/amrfinderplus-db/. To ease this process we recommend to use Bakta's internal download procedure.

Examples

Simple:

bash bakta --db <db-path> genome.fasta

Expert: verbose output writing results to results directory with ecoli123 file prefix and eco634 locus tag using an existing prodigal training file, using additional replicon information and 8 threads:

bash bakta --db <db-path> --verbose --output results/ --prefix ecoli123 --locus-tag eco634 --prodigal-tf eco.tf --replicons replicon.tsv --threads 8 genome.fasta

Input and Output

Input

Bakta accepts bacterial genomes and plasmids (complete / draft assemblies) in (zipped) fasta format. For a full description of how further genome information can be provided and workflow customizations can be set, please have a look at the Usage section or this manual.

Replicon meta data table

To fine-tune the very details of each sequence in the input fasta file, Bakta accepts a replicon meta data table provided in csv or tsv file format: --replicons <file.tsv>. Thus, complete replicons within partially completed draft assemblies can be marked & handled as such, e.g. detection & annotation of features spanning sequence edges.

Table format:

original sequence id | new sequence id | type | topology | name ----|----------------|----------------|----------------|---------------- old id | new id, <empty> | chromosome, plasmid, contig, <empty> | circular, linear, <empty> | name, <empty>

For each input sequence recognized via the original locus id a new locus id, the replicon type and the topology as well a name can be explicitly set.

Shortcuts:

  • chromosome: c
  • plasmid: p
  • circular: c
  • linear: l

<empty> values (- / `) will be replaced by defaults. If **new locus id** isempty`, a new contig name will be autogenerated.

Defaults:

  • type: contig
  • topology: linear

Example:

original locus id | new locus id | type | topology | name ----|----------------|----------------|----------------|---------------- NODE1 | chrom | chromosome | circular | - NODE2 | p1 | plasmid | c | pXYZ1 NODE3 | p2 | p | c | pXYZ2 NODE4 | special-contig-name-xyz | - | - | - NODE_5 | `|-|-|-`

User-provided regions

Bakta accepts pre-annotated (a priori), user-provided feature regions via --regions in either GFF3 or GenBank format. These regions supersede all de novo-predicted regions, but are equally subject to the internal functional annotation process. Currently, only CDS are supported. A maximum overlap with de novo-predicted CDS of 30 bp is allowed. If you would like to provide custom functional annotations, you can provide these via --proteins which is described in the following section.

User-provided protein sequences

Bakta accepts user-provided trusted protein sequences via --proteins in either GenBank (CDS features) or Fasta format which are used in the functional annotation process. Using the Fasta format, each reference sequence can be provided in a short or long format:

```bash

short:

id gene~~~product~~~dbxrefs MAQ...

long:

id minidentity~~~minquerycov~~~minsubject_cov~~~gene~~~product~~~dbxrefs MAQ... ```

Allowed values:

field | value(s) | example ----|----------------|---------------- minidentity | int, float | 80, 90.3 minquerycov | int, float | 80, 90.3 minsubject_cov | int, float | 80, 90.3 gene | <empty>, string | msp product | string | my special protein dbxrefs | <empty>, db:id, , separated list | VFDB:VF0511

Protein sequences provided in short Fasta or GenBank format are searched with default thresholds of 90%, 80% and 80% for minimal identity, query and subject coverage, respectively.

User-provided HMMs

Bakta accepts user-provided trusted HMMs via --hmms in HMMER's text format. If set, Bakta will adhere to the trusted cutoff specified in the HMM header. In addition, a max. evalue threshold of 1e-6 is applied. By default, Bakta uses the HMM description line as a product description. Further information can be provided via the HMM description line using the short format as explained above in the User-provided protein sequences section.

```bash

default

HMMER3/f [3.1b2 | February 2015] NAME id ACC id DESC product LENG 435 TC 600 600

short

NAME id ACC id DESC gene~~~product~~~dbxrefs LENG 435 TC 600 600 ```

Output

Annotation results are provided in standard bioinformatics file formats:

  • <prefix>.tsv: annotations as simple human readble TSV
  • <prefix>.gff3: annotations & sequences in GFF3 format
  • <prefix>.gbff: annotations & sequences in (multi) GenBank format
  • <prefix>.embl: annotations & sequences in (multi) EMBL format
  • <prefix>.fna: replicon/contig DNA sequences as FASTA
  • <prefix>.ffn: feature nucleotide sequences as FASTA
  • <prefix>.faa: CDS/sORF amino acid sequences as FASTA
  • <prefix>.inference.tsv: inference metrics (score, evalue, coverage, identity) for annotated accessions as TSV
  • <prefix>.hypotheticals.tsv: further information on hypothetical protein CDS as simple human readble tab separated values
  • <prefix>.hypotheticals.faa: hypothetical protein CDS amino acid sequences as FASTA
  • <prefix>.txt: summary as TXT
  • <prefix>.png: circular genome annotation plot as PNG
  • <prefix>.svg: circular genome annotation plot as SVG
  • <prefix>.json: all (internal) annotation & sequence information as JSON

The <prefix> can be set via --prefix <prefix>. If no prefix is set, Bakta uses the input file prefix.

Of note, Bakta provides all detailed (internal) information on each annotated feature in a standardized machine-readable JSON file <prefix>.json:

json { "genome": { "genus": "Escherichia", "species": "coli", ... }, "stats": { "size": 5594605, "gc": 0.497, ... }, "features": [ { "type": "cds", "contig": "contig_1", "start": 971, "stop": 1351, "strand": "-", "gene": "lsoB", "product": "type II toxin-antitoxin system antitoxin LsoB", ... }, ... ], "sequences": [ { "id": "c1", "description": "[organism=Escherichia coli] [completeness=complete] [topology=circular]", "nt": "AGCTTT...", "length": 5498578, "complete": true, "type": "chromosome", "topology": "circular" ... }, ... ] }

Bakta provides a helper function to create above mentioned output files from the (GNU-zipped) JSON result file, thus helping potential long-term or large-scale annotation projects to reduce overall storage requirements.

```bash bakta_io --output --prefix result.json.gz

bakta_io --help ```

Exemplary annotation result files for several genomes (mostly ESKAPE species) are hosted at Zenodo: DOI

Usage

```bash usage: bakta [--db DB] [--min-contig-length MINCONTIGLENGTH] [--prefix PREFIX] [--output OUTPUT] [--force] [--genus GENUS] [--species SPECIES] [--strain STRAIN] [--plasmid PLASMID] [--complete] [--prodigal-tf PRODIGAL_TF] [--translation-table {11,4,25}] [--gram {+,-,?}] [--locus LOCUS] [--locus-tag LOCUS_TAG] [--locus-tag-increment {1,5,10}] [--keep-contig-headers] [--compliant] [--replicons REPLICONS] [--regions REGIONS] [--proteins PROTEINS] [--hmms HMMS] [--meta] [--skip-trna] [--skip-tmrna] [--skip-rrna] [--skip-ncrna] [--skip-ncrna-region] [--skip-crispr] [--skip-cds] [--skip-pseudo] [--skip-sorf] [--skip-gap] [--skip-ori] [--skip-filter] [--skip-plot] [--help] [--verbose] [--debug] [--threads THREADS] [--tmp-dir TMP_DIR] [--version]

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids

positional arguments: Genome sequences in (zipped) fasta format

Input / Output: --db DB, -d DB Database path (default = /db). Can also be provided as BAKTADB environment variable. --min-contig-length MINCONTIGLENGTH, -m MINCONTIG_LENGTH Minimum contig/sequence size (default = 1; 200 in compliant mode) --prefix PREFIX, -p PREFIX Prefix for output files --output OUTPUT, -o OUTPUT Output directory (default = current working directory) --force, -f Force overwriting existing output folder (except for current working directory)

Organism: --genus GENUS Genus name --species SPECIES Species name --strain STRAIN Strain name --plasmid PLASMID Plasmid name

Annotation: --complete All sequences are complete replicons (chromosome/plasmid[s]) --prodigal-tf PRODIGALTF Path to existing Prodigal training file to use for CDS prediction --translation-table {11,4,25} Translation table: 11/4/25 (default = 11) --gram {+,-,?} Gram type for signal peptide predictions: +/-/? (default = ?) --locus LOCUS Locus prefix (default = 'contig') --locus-tag LOCUSTAG Locus tag prefix (default = autogenerated) --locus-tag-increment {1,5,10} Locus tag increment: 1/5/10 (default = 1)

--keep-contig-headers Keep original contig/sequence headers --compliant Force Genbank/ENA/DDJB compliance --replicons REPLICONS, -r REPLICONS Replicon information table (tsv/csv) --regions REGIONS Path to pre-annotated regions in GFF3 or Genbank format (regions only, no functional annotations). --proteins PROTEINS Fasta file of trusted protein sequences for CDS annotation --hmms HMMS HMM file of trusted hidden markov models in HMMER format for CDS annotation --meta Run in metagenome mode. This only affects CDS prediction.

Workflow: --skip-trna Skip tRNA detection & annotation --skip-tmrna Skip tmRNA detection & annotation --skip-rrna Skip rRNA detection & annotation --skip-ncrna Skip ncRNA detection & annotation --skip-ncrna-region Skip ncRNA region detection & annotation --skip-crispr Skip CRISPR array detection & annotation --skip-cds Skip CDS detection & annotation --skip-pseudo Skip pseudogene detection & annotation --skip-sorf Skip sORF detection & annotation --skip-gap Skip gap detection & annotation --skip-ori Skip oriC/oriT detection & annotation --skip-filter Skip feature overlap filters --skip-plot Skip generation of circular genome plots

General: --help, -h Show this help message and exit --verbose, -v Print verbose information --debug Run Bakta in debug mode. Temp data will not be removed. --threads THREADS, -t THREADS Number of threads to use (default = number of available CPUs) --tmp-dir TMP_DIR Location for temporary files (default = system dependent auto detection) --version show program's version number and exit ```

Annotation Workflow

RNAs

  1. tRNA genes: tRNAscan-SE 2.0
  2. tmRNA genes: Aragorn
  3. rRNA genes: Infernal vs. Rfam rRNA covariance models
  4. ncRNA genes: Infernal vs. Rfam ncRNA covariance models
  5. ncRNA cis-regulatory regions: Infernal vs. Rfam ncRNA covariance models
  6. CRISPR arrays: PILER-CR

Bakta distinguishes ncRNA genes and (cis-regulatory) regions in order to enable the distinct handling thereof during the annotation process, i.e. feature overlap detection.

ncRNA gene types:

  • sRNA
  • antisense
  • ribozyme
  • antitoxin

ncRNA (cis-regulatory) region types:

  • riboswitch
  • thermoregulator
  • leader
  • frameshift element

Coding sequences

The structural prediction is conducted via Pyrodigal and complemented by a custom detection of sORF < 30 aa. In addition, superseding regions of pre-predicted CDS can be provided via --regions.

To rapidly identify known protein sequences with exact sequence matches and to conduct a comprehensive annotations, Bakta utilizes a compact read-only SQLite database comprising protein sequence digests and pre-assigned annotations for millions of known protein sequences and clusters.

Conceptual terms:

  • UPS: unique protein sequences identified via length and MD5 hash digests (100% coverage & 100% sequence identity)
  • IPS: identical protein sequences comprising seeds of UniProt's UniRef100 protein sequence clusters
  • PSC: protein sequences clusters comprising seeds of UniProt's UniRef90 protein sequence clusters
  • PSCC: protein sequences clusters of clusters comprising annotations of UniProt's UniRef50 protein sequence clusters

CDS:

  1. De novo-prediction via Pyrodigal respecting sequences' completeness (distinct prediction for complete replicons and uncompleted contigs)
  2. Discard spurious CDS via AntiFam
  3. Detect translational exceptions (selenocysteines)
  4. Import of superseding user-provided CDS regions (optional)
  5. Detection of UPSs via MD5 digests and lookup of related IPS and PCS
  6. Sequence alignments of remainder via Diamond vs. PSC (query/subject coverage=0.8, identity=0.5)
  7. Assignment to UniRef90 or UniRef50 clusters if alignment hits achieve identities larger than 0.9 or 0.5, respectively
  8. Execution of expert systems:
    • AMR: AMRFinderPlus
    • Expert proteins: NCBI BlastRules, VFDB
    • User proteins (optionally via --proteins <Fasta/GenBank>)
  9. Prediction of signal peptides (optionally via --gram <+/->)
  10. Detection of pseudogenes:
    1. Search for reference PCSs using hypothetical CDS as seed sequences
    2. Translated alignment (blastx) of reference PCSs against up-/downstream-elongated CDS regions
    3. Analysis of translated alignments and detection of pseudogenization causes & effects
  11. Combination of IPS, PSC, PSCC and expert system information favouring more specific annotations and avoiding redundancy

CDS without IPS or PSC hits as well as those without gene symbols or product descriptions different from hypothetical will be marked as hypothetical.

Such hypothetical CDS are further analyzed:

  1. Detection of Pfam domains, repeats & motifs
  2. Calculation of protein sequence statistics, i.e. molecular weight, isoelectric point

sORFs:

  1. Custom sORF detection & extraction with amino acid lengths < 30 aa
  2. Apply strict feature type-dependent overlap filters
  3. discard spurious sORF via AntiFam
  4. Detection of UPS via MD5 hashes and lookup of related IPS
  5. Sequence alignments of remainder via Diamond vs. an sORF subset of PSCs (coverage=0.9, identity=0.9)
  6. Exclude sORF without sufficient annotation information
  7. Prediction of signal peptides (optionally via --gram <+/->)

sORF not identified via IPS or PSC will be discarded. Additionally, all sORF without gene symbols or product descriptions different from hypothetical will be discarded. Due due to uncertain nature of sORF prediction, only those identified via IPS / PSC hits exhibiting proper gene symbols or product descriptions different from hypothetical will be included in the final annotation.

Miscellaneous

  1. Gaps: in-mem detection & annotation of sequence gaps
  2. oriC/oriV/oriT: Blast+ (cov=0.8, id=0.8) vs. MOB-suite oriT & DoriC oriC/oriV sequences. Annotations of ori regions take into account overlapping Blast+ hits and are conducted based on a majority vote heuristic. Region edges are fuzzy - use with caution!

Database

The Bakta database comprises a set of AA & DNA sequence databases as well as HMM & covariance models. At its core Bakta utilizes a compact read-only SQLite DB storing protein sequence digests, lengths, pre-assigned annotations and dbxrefs of UPS, IPS and PSC from:

  • UPS: UniParc / UniProtKB (350,631,327)
  • IPS: UniProt UniRef100 (330,865,009)
  • PSC: UniProt UniRef90 (135,274,518)
  • PSCC: UniProt UniRef50 (37,008,138)

This allows the exact protein sequences identification via MD5 digests & sequence lengths as well as the rapid subsequent lookup of related information. Protein sequence digests are checked for hash collisions while the DB creation process. IPS & PSC have been comprehensively pre-annotated integrating annotations & database dbxrefs from:

  • NCBI nonredundant proteins (UPS: 290,693,966)
  • NCBI COG DB (PSC: 3,513,643)
  • KEGG Kofams (PSC: 24,267,514)
  • SwissProt EC/GO terms (PSC: 337,264)
  • NCBI NCBIfams (PSC: 21,758,901)
  • PHROG (PSC: 11,717)
  • NCBI AMRFinderPlus (IPS: 8,382)
  • ISFinder DB (IPS: 155,449, PSC: 14,481)
  • Pfam families (PSC: 659,781)

To provide high quality annotations for distinct protein sequences of high importance (AMR, VF, etc) which cannot sufficiently be covered by the IPS/PSC approach, Bakta provides additional expert systems. For instance, AMR genes, are annotated via NCBI's AMRFinderPlus. An expandable alignment-based expert system supports the incorporation of high quality annotations from multiple sources. This currenlty comprises NCBI's BlastRules as well as VFDB and will be complemented with more expert annotation sources over time. Internally, this expert system is based on a Diamond DB comprising the following information in a standardized format:

  • source: e.g. BlastRules
  • rank: a precedence rank
  • min identity
  • min query coverage
  • min model coverage
  • gene lable
  • product description
  • dbxrefs

Rfam covariance models:

  • ncRNA: 779
  • ncRNA cis-regulatory regions: 288

ori sequences:

  • oriC/V: 6,690
  • oriT: 502

To provide FAIR annotations, the database releases are SemVer versioned (w/o patch level), i.e. <major>.<minor>. For each version we provide a comprehensive log file tracking all imported sequences as well as annotations thereof. The DB schema is represented by the <major> digit and automatically checked at runtime by Bakta in order to ensure compatibility. Content updates are tracked by the <minor> digit.

As this taxonomic-untargeted database is fairly demanding in terms of storage consumption, we also provide a lightweight DB type providing all non-coding feature information but only PSCC information from UniRef50 clusters for CDS. If download bandwiths or storage requirements become an issue or if shorter runtimes are favored over more-specific annotation, the light DB will do the job.

Latest database version: 6.0 DB types:

  • light: 1.3 Gb zipped, 3.9 Gb unzipped, MD5: 4a6e059ded39e9c5537ef4137d2f5648
  • full: 30 Gb zipped, 84 Gb unzipped, MD5: 4c1115e40abfa2b464ae5dd988bdd88e

All database releases are hosted at Zenodo: DOI

Genome Submission

Most genomes annotated with Bakta should be ready-to-submid to INSDC member databases GenBank and ENA. As a first step, please register your BioProject (e.g. PRJNA123456) and your locus_tag prefix (e.g. ESAKAI).

```bash

annotate your genome in --compliant mode:

$ bakta --db -v --genus Escherichia --species "coli O157:H7" --strain Sakai --complete --compliant --locus-tag ESAKAI test/data/GCF_000008865.2.fna.gz ```

GenBank

Genomes are submitted to GenBank via Fasta (.fna) and SQN files. Therefore, .sqn files can be created with NCBI's new table2asn tool via Bakta's .gff3 files. Please, have a look at the documentation and have all additional files (template.txt) prepared:

```bash

download table2asn for Linux

$ wget https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/table2asn/linux64.table2asn.gz $ gunzip linux64.table2asn.gz

or MacOS

$ wget https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/table2asn/mac.table2asn.gz $ gunzip mac.table2asn.gz

$ chmod 755 linux64.table2asn.gz mac.table2asn.gz

create the SQN file:

$ linux64.table2asn -Z -W -M n -J -c w -t template.txt -V vbt -l paired-ends -i GCF000008865.2.fna -f GCF000008865.2.gff3 -o GCF_000008865.2.sqn ```

ENA

Genomes are submitted to ENA as EMBL (.embl) files via EBI's Webin-CLI tool. Please have all additional files (manifest.tsv, chrom-list.tsv) prepared as described here.

```bash

download ENA Webin-CLI

$ wget https://github.com/enasequence/webin-cli/releases/download/8.1.0/webin-cli-8.1.0.jar

$ gzip -k GCF_000008865.2.embl $ gzip -k chrom-list.tsv $ java -jar webin-cli-8.1.0.jar -submit -userName= -password -context genome -manifest manifest.tsv ```

Exemplarey manifest.tsv and chrom-list.tsv files might look like:

```bash $ cat manifest.tsv STUDY PRJEB44484 SAMPLE ERS6291240 ASSEMBLYNAME GCF ASSEMBLYTYPE isolate COVERAGE 100 PROGRAM SPAdes PLATFORM Illumina MOLECULETYPE genomic DNA FLATFILE GCF000008865.2.embl.gz CHROMOSOME_LIST chrom-list.tsv.gz

$ cat chrom-list.tsv contig1 contig1 circular-chromosome contig2 contig2 circular-plasmid contig3 contig3 circular-plasmid ```

Protein bulk annotation

For the direct bulk annotation of protein sequences aside from the genome, Bakta provides a dedicated CLI entry point bakta_proteins:

Examples:

```bash bakta_proteins --db input.fasta

bakta_proteins --db --prefix test --output test --proteins special.faa --threads 8 input.fasta ```

Output

Annotation results are provided in standard bioinformatics file formats:

  • <prefix>.tsv: annotations as simple human readble TSV
  • <prefix>.faa: protein sequences as FASTA
  • <prefix>.hypotheticals.tsv: further information on hypothetical proteins as simple human readble tab separated values
  • <prefix>.json: all (internal) annotation & sequence information as JSON

The <prefix> can be set via --prefix <prefix>. If no prefix is set, Bakta uses the input file prefix.

Usage

```bash usage: bakta_proteins [--db DB] [--output OUTPUT] [--prefix PREFIX] [--force] [--proteins PROTEINS] [--help] [--verbose] [--debug] [--threads THREADS] [--tmp-dir TMP_DIR] [--version]

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids

positional arguments: Protein sequences in (zipped) fasta format

Input / Output: --db DB, -d DB Database path (default = /db). Can also be provided as BAKTA_DB environment variable. --output OUTPUT, -o OUTPUT Output directory (default = current working directory) --prefix PREFIX, -p PREFIX Prefix for output files --force, -f Force overwriting existing output folder

Annotation: --proteins PROTEINS Fasta file of trusted protein sequences for annotation

General: --help, -h Show this help message and exit --verbose, -v Print verbose information --debug Run Bakta in debug mode. Temp data will not be removed. --threads THREADS, -t THREADS Number of threads to use (default = number of available CPUs) --tmp-dir TMP_DIR Location for temporary files (default = system dependent auto detection) --version, -V show program's version number and exit ```

Genome plots

Bakta allows the creation of circular genome plots via pyCirclize. Plots are generated as part of the default workflow and saved as PNG and SVG files. In addition to the default workflow, Bakta provides a dedicated CLI entry point bakta_plot:

Examples:

```bash bakta_plot input.json

bakta_plot --output test --prefix test --config config.yaml --sequences 1,2 input.json ```

It accepts the results of a former annotation process in JSON format and allows the selection of distinct sequences, either denoted by their FASTA identifiers or sequential number starting by 1. Colors for each feature type can be adopted via a simple configuration file in YAML format, e.g. config.yaml. Currently, two default plot types are supported, i.e. features and cog. Examples for chromosomes and plasmids are provided in here

Usage

```bash usage: bakta_plot [--config CONFIG] [--output OUTPUT] [--prefix PREFIX] [--sequences SEQUENCES] [--type {features,cog}] [--label LABEL] [--size {4,8,16}] [--dpi {150,300,600}] [--help] [--verbose] [--debug] [--tmp-dir TMP_DIR] [--version]

Rapid & standardized annotation of bacterial genomes, MAGs & plasmids

positional arguments: Bakta annotations in (zipped) JSON format

Input / Output: --config CONFIG, -c CONFIG Plotting configuration in YAML format --output OUTPUT, -o OUTPUT Output directory (default = current working directory) --prefix PREFIX, -p PREFIX Prefix for output files

Plotting: --sequences SEQUENCES Sequences to plot: comma separated number or name (default = all, numbers one-based) --type {features,cog} Plot type: feature/cog (default = features) --label LABEL Plot center label (for line breaks use '|') --size {4,8,16} Plot size in inches: 4/8/16 (default = 8) --dpi {150,300,600} Plot resolution as dots per inch: 150/300/600 (default = 300)

General: --help, -h Show this help message and exit --verbose, -v Print verbose information --debug Run Bakta in debug mode. Temp data will not be removed. --tmp-dir TMP_DIR Location for temporary files (default = system dependent auto detection) --version show program's version number and exit ```

Description

Currently, there are two types of plots: features (the default) and cog. In default mode (features), all features are plotted on two rings representing the forward and reverse strand from outer to inner, respectively using the following feature colors:

  • CDS: #cccccc
  • tRNA/tmRNA: #b2df8a
  • rRNA: #fb8072
  • ncRNA: #fdb462
  • ncRNA-region: #80b1d3
  • CRISPR: #bebada
  • Gap: #000000
  • Misc: #666666

In the cog mode, all protein-coding genes (CDS) are colored due to assigned COG functional categories. To better distinguish non-coding genes, these are plotted on an additional 3rd ring.

In addition, both plot types share two innermost GC content and GC skew rings. The first ring represents the GC content per sliding window over the entire sequence(s) in green (#33a02c) and red #e31a1c representing GC above and below average, respectively. The 2nd ring represents the GC skew in orange (#fdbf6f) and blue (#1f78b4). The GC skew gives hints on a replicon's replication bubble and hence, on the completeness of the assembly. On a complete & circular bacterial chromosome, you normally see two inflection points at the origin of replication and at its opposite region -> Wikipedia

Custom plot labels (text in the center) can be provided via --label:

bash bakta_plot --sequences 2 --dpi 300 --size 8 --prefix plot-cog-p2 --type cog --label="pO157|plasmid, 92.7 kbp"

Plot example of Bakta test genome.

Auxiliary scripts

Often, the usage of Bakta is a necessary upfront task followed by deeper analyses implemented in custom scripts. In scripts we'd like to collect & offer a pool of scripts addressing common tasks:

  • collect-annotation-stats.py: Collect annotation stats for a cohort of genomes and print a condensed TSV.
  • extract-region.py: Extract genome features within a given genomic range and export them as GFF3, Embl, Genbank, FAA and FFN

Of course, pull requests are welcome ;-)

Web version

For further convenience, we developed an accompanying web application available at https://bakta.computational.bio.

This interactive web application provides an interactive genome browsers, aggregated feature counts and a searchable data table with detailed information on each predicted feature as well as dbxref-linked records to public databases.

Of note, this web application can also be used to visualize offline annotation results conducted by using the command line version. Therefore, the web application provides an offline viewer accepting JSON result files which are parsed and visualized locally within the browser without sending any data to the server.

Citation

If you use Bakta in your research, please cite this paper:

Schwengers O., Jelonek L., Dieckmann M. A., Beyvers S., Blom J., Goesmann A. (2021). Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11). https://doi.org/10.1099/mgen.0.000685

Bakta is standing on the shoulder of giants taking advantage of many great software tools and databases. If you find any of these useful for your research, please cite these primary sources, as well.

Tools

Databases

FAQ

  • AMRFinder fails If AMRFinder constantly crashes even on fresh setups and Bakta's database was downloaded manually, then AMRFinder needs to setup its own internal database. This is required only once: amrfinder_update --force_update --database <bakta-db>/amrfinderplus-db. You could also try Bakta's internal database download logic automatically taking care of this: bakta_db download --output <bakta-db>

  • DeepSig not found in Conda environment For the prediction of signal predictions, Bakta uses DeepSig that is currently not available for MacOS and only up to Bakta v1.9.4. Therefore, we decided to exclude DeepSig from Bakta's default Conda dependencies because otherwise it would not be installable on MacOS systems. On Linux systems it can be installed via conda install -c conda-forge -c bioconda python=3.8 deepsig.

  • Nice, but I'm mising XYZ... Bakta is quite new and we're keen to constantly improve it and further expand its feature set. In case there's anything missing, please do not hesitate to open an issue and ask for it!

  • Bakta is running too long without CPU load... why? Bakta takes advantage of an SQLite DB which results in high storage IO loads. If this DB is stored on a remote / network volume, the lookup of IPS/PSC annotations might take a long time. In these cases, please, consider moving the DB to a local volume or hard drive.

Issues and Feature Requests

Bakta is new and like in every software, expect some bugs lurking around. So, if you run into any issues with Bakta, we'd be happy to hear about it. Therefore, please, execute bakta in debug mode (--debug) and do not hesitate to file an issue including as much information as possible:

  • a detailed description of the issue
  • command line output
  • log file (<prefix>.log)
  • result file (<prefix>.json) if possible
  • a reproducible example of the issue with an input file that you can share if possible

Owner

  • Name: Oliver Schwengers
  • Login: oschwengers
  • Kind: user
  • Location: Giessen, Germany
  • Company: @ag-computational-bio - JLU Giessen

Microbial bioinformatics, WGS bacteria, plasmids, PostDoc, father of 2, husband, astrophotographer

Citation (CITATION.bib)

@ARTICLE{Schwengers2021-fd,
title = "Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification",
author = "Schwengers, Oliver and Jelonek, Lukas and Dieckmann, Marius Alfred and Beyvers, Sebastian and Blom, Jochen and Goesmann, Alexander",
abstract = "Command-line annotation software tools have continuously gained popularity compared to centralized online services due to the worldwide increase of sequenced bacterial genomes. However, results of existing command-line software pipelines heavily depend on taxon-specific databases or sufficiently well annotated reference genomes. Here, we introduce Bakta, a new command-line software tool for the robust, taxon-independent, thorough and, nonetheless, fast annotation of bacterial genomes. Bakta conducts a comprehensive annotation workflow including the detection of small proteins taking into account replicon metadata. The annotation of coding sequences is accelerated via an alignment-free sequence identification approach that in addition facilitates the precise assignment of public database cross-references. Annotation results are exported in GFF3 and International Nucleotide Sequence Database Collaboration (INSDC)-compliant flat files, as well as comprehensive JSON files, facilitating automated downstream analysis. We compared Bakta to other rapid contemporary command-line annotation software tools in both targeted and taxonomically broad benchmarks including isolates and metagenomic-assembled genomes. We demonstrated that Bakta outperforms other tools in terms of functional annotations, the assignment of functional categories and database cross-references, whilst providing comparable wall-clock runtimes. Bakta is implemented in Python 3 and runs on MacOS and Linux systems. It is freely available under a GPLv3 license at https://github.com/oschwengers/bakta. An accompanying web version is available at https://bakta.computational.bio.",
journal = "Microbial genomics",
volume =  7,
number =  11,
month =  nov,
year =  2021,
language = "en",
issn = "2057-5858",
pmid = "34739369",
doi = "10.1099/mgen.0.000685"
}

GitHub Events

Total
  • Create event: 13
  • Release event: 8
  • Issues event: 57
  • Watch event: 89
  • Delete event: 8
  • Issue comment event: 107
  • Push event: 64
  • Pull request review comment event: 5
  • Pull request review event: 10
  • Pull request event: 19
  • Fork event: 13
Last Year
  • Create event: 13
  • Release event: 8
  • Issues event: 57
  • Watch event: 89
  • Delete event: 8
  • Issue comment event: 107
  • Push event: 64
  • Pull request review comment event: 5
  • Pull request review event: 10
  • Pull request event: 19
  • Fork event: 13

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 1,015
  • Total Committers: 14
  • Avg Commits per committer: 72.5
  • Development Distribution Score (DDS): 0.028
Past Year
  • Commits: 98
  • Committers: 5
  • Avg Commits per committer: 19.6
  • Development Distribution Score (DDS): 0.061
Top Committers
Name Email Commits
Oliver Schwengers o****s@c****e 987
Julian Hahnfeld 8****d 7
Anna-Rehm 7****m 5
Mark Lubberts 1****s 3
Lukas Jelonek l****k 3
Ed Davis d****v@g****m 2
Alexandra Weisberg a****g@g****m 1
Eric Deveaud e****d@g****m 1
Michael Foster 5****r 1
Oskar Hickl 4****l 1
Sebastian Beyvers S****i 1
Sebastian Jaenicke s****n@j****g 1
Tadas Tamošauskas 2****t 1
bart 4****s 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 205
  • Total pull requests: 67
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 18 days
  • Total issue authors: 144
  • Total pull request authors: 17
  • Average comments per issue: 3.37
  • Average comments per pull request: 0.78
  • Merged pull requests: 49
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 48
  • Pull requests: 25
  • Average time to close issues: 28 days
  • Average time to close pull requests: 2 days
  • Issue authors: 38
  • Pull request authors: 4
  • Average comments per issue: 2.4
  • Average comments per pull request: 0.32
  • Merged pull requests: 17
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • oschwengers (23)
  • davidtong28 (4)
  • alexweisberg (4)
  • Dx-wmc (4)
  • Jigyasa3 (3)
  • azat-badretdin (3)
  • EricDeveaud (3)
  • splaisan (3)
  • EdderDaniel (2)
  • geboro (2)
  • AhmedElsherbini (2)
  • ohickl (2)
  • patriciatran (2)
  • Rridley7 (2)
  • jotech (2)
Pull Request Authors
  • oschwengers (44)
  • jhahnfeld (10)
  • mark-lubberts (6)
  • EricDeveaud (4)
  • bgruening (2)
  • ohickl (2)
  • mjfos2r (2)
  • bartns (1)
  • St4NNi (1)
  • standage (1)
  • marade (1)
  • lukasjelonek (1)
  • rujinlong (1)
  • TrellixVulnTeam (1)
  • alexweisberg (1)
Top Labels
Issue Labels
bug (78) enhancement (48) help wanted (40) question (14) feature (11) info (2) upstream (2) maybe (2) web (1)
Pull Request Labels
enhancement (35) bug (17) feature (11) documentation (2)

Packages

  • Total packages: 3
  • Total downloads:
    • pypi 461 last-month
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 1
    (may contain duplicates)
  • Total versions: 94
  • Total maintainers: 1
proxy.golang.org: github.com/oschwengers/bakta
  • Versions: 42
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 9.0%
Average: 9.6%
Dependent repos count: 10.2%
Last synced: 6 months ago
pypi.org: bakta

Bakta: rapid & standardized annotation of bacterial genomes, MAGs & plasmids

  • Versions: 50
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 461 Last month
Rankings
Stargazers count: 3.4%
Forks count: 6.4%
Dependent packages count: 10.1%
Average: 10.9%
Downloads: 12.7%
Dependent repos count: 21.6%
Maintainers (1)
Last synced: 6 months ago
spack.io: py-bakta

Bakta: rapid & standardized annotation of bacterial genomes, MAGs & plasmids

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Stargazers count: 14.0%
Forks count: 20.8%
Average: 23.0%
Dependent packages count: 57.3%
Maintainers (1)
Last synced: 6 months ago

Dependencies

setup.py pypi
  • alive-progress *
  • biopython *
  • requests *
  • xopen *
.github/workflows/cd-docker-hub.yml actions
  • actions/checkout v2 composite
  • docker/build-push-action v1 composite
.github/workflows/cd-pypi.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
  • pypa/gh-action-pypi-publish master composite
Dockerfile docker
  • alpine 3.12 build
.github/workflows/ci-lint.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
.github/workflows/ci-package.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
.github/workflows/ci-test.yml actions
  • actions/checkout v2 composite
  • mamba-org/setup-micromamba v1 composite
environment.yml conda
  • alive-progress >=3.0.1
  • aragorn >=1.2.41
  • biopython >=1.78
  • blast >=2.12.0
  • circos >=0.69.8
  • diamond >=2.0.14
  • infernal >=1.1.4
  • ncbi-amrfinderplus >=3.11.2
  • piler-cr
  • pyhmmer >=0.10.0
  • pyrodigal >=3.1.0
  • pyyaml >=6.0
  • requests >=2.25.1
  • trnascan-se >=2.0.11
  • xopen >=1.5.0