recognizer

A tool for domain based annotation with databases from the Conserved Domains Database

https://github.com/iquasere/recognizer

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: ncbi.nlm.nih.gov, sciencedirect.com
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.1%) to scientific vocabulary

Keywords

annotation cog-assignment fasta functional-annotation genomics protein quantification
Last synced: 6 months ago · JSON representation ·

Repository

A tool for domain based annotation with databases from the Conserved Domains Database

Basic Info
  • Host: GitHub
  • Owner: iquasere
  • License: bsd-3-clause
  • Language: HTML
  • Default Branch: master
  • Homepage:
  • Size: 61.4 MB
Statistics
  • Stars: 30
  • Watchers: 1
  • Forks: 2
  • Open Issues: 1
  • Releases: 51
Topics
annotation cog-assignment fasta functional-annotation genomics protein quantification
Created about 6 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

reCOGnizer

A tool for domain-based annotation with databases from the Conserved Domains Database.

Features

reCOGnizer performs domain-based annotation with RPS-BLAST and databases from CDD as reference. * Reference databases currently implemented: CDD, NCBIfam, Pfam, TIGRFAM, Protein Clusters, SMART, COG and KOG. * reCOGnizer performs multithread annotation with RPS-BLAST, significantly increasing the speed of annotation. * After domain assignment to proteins, reCOGnizer converts CDD IDs to the IDs of the respective DBs, and obtains domain descriptions available at CDD. * Further information is retrieved depending on the database in question: * NCBIfam, Pfam, TIGRFAM and Protein Clusters annotations are complemented with taxonomic classifications and EC numbers. * SMART annotations are complemented with SMART descriptions. * COG and KOG annotations are complemented with COG/KOG categories, EC numbers and KEGG Orthologs.

A detailed representation of reCOGnizer's workflow is presented in Fig. 1.

Installing reCOGnizer

To install reCOGnizer, simply run: conda install -c conda-forge -c bioconda recognizer

Annotation with reCOGnizer

The simplest way to run reCOGnizer is to just specify the fasta filename and an output directory - though even the output directory is not mandatory. recognizer -f input_file.faa -o output_directory

Output

reCOGnizer takes a FASTA file (of aminoacids, commonly either .fasta or .faa) as input and produces two main outputs into the output directory: * reCOGnizer_results.tsv and reCOGnizer_results.xlsx, tables with the annotations from every database for each protein * cog_quantification.tsv and respective Krona representation (Fig. 2), which describes the functional landscape of the proteins in the input file

Image Alt Text

Fig. 2. Krona plot with the quantification of COGs identified in the simulated dataset used to test MOSCA and reCOGnizer. Click in the plot to see the interactive version that is outputed by reCOGnizer.

Using previously gathered taxonomic information

reCOGnizer can make use of taxonomic information by filtering Markov Models for the specific taxa of interest. This can be done by providing a file with the taxonomic information of the proteins. To simulate this, run the following commands, after installing reCOGnizer: git clone https://github.com/iquasere/reCOGnizer.git cd reCOGnizer/ci recognizer -f proteomes.fasta --f UPIMAPI_results.tsv --tax-col 'Taxonomic lineage IDs (SPECIES)' --protein-id-col qseqid --species-taxids Running reCOGnizer this way will usually obtain better results, but will likely take much longer to finish.

reCOGnizer parameters

``` options: -h, --help show this help message and exit -f FILE, --file FILE Fasta file with protein sequences for annotation -t THREADS, --threads THREADS Number of threads for reCOGnizer to use [max available] --evalue EVALUE Maximum e-value to report annotations for [1e-3] -o OUTPUT, --output OUTPUT Output directory [reCOGnizerresults] -dr DOWNLOADRESOURCES, --download-resources DOWNLOADRESOURCES This parameter is deprecated. Please do not use it [None] -rd RESOURCESDIRECTORY, --resources-directory RESOURCESDIRECTORY Output directory for storing databases and other resources [~/recognizerresources] -dbs DATABASES, --databases DATABASES Databases to include in functional annotation (comma-separated) [all available] --custom-databases If databases inputted were NOT produced by reCOGnizer [False]. Default databases of reCOGnizer (e.g., COG, TIGRFAM, ...) can't be used simultaneously with custom databases. Use together with the '--databases' parameter. -mts MAXTARGETSEQS, --max-target-seqs MAXTARGETSEQS Number of maximum identifications for each protein [1] --keep-spaces BLAST ignores sequences IDs after the first space. This option changes all spaces to underscores to keep the full IDs. --no-output-sequences Protein sequences from the FASTA input will be stored in their own column. --no-blast-info Information from the alignment will be stored in their own columns. --output-rpsbproc-cols Output columns obtained with RPSBPROC - 'Superfamilies', 'Sites' and 'Motifs'. -sd SKIPDOWNLOADED, --skip-downloaded SKIPDOWNLOADED This parameter is deprecated. Please do not use it [None] --keep-intermediates Keep intermediate annotation files generated in reCOGnizer's workflow, i.e., ASN, RPSBPROC and BLAST reports and split FASTA inputs. --quiet Don't output download information, used mainly for CI. --debug Print all commands running in the background, including those of rpsblast and rpsbproc. --test-run This parameter is only appropriate for reCOGnizer's tests on GitHub. Should not be used. -v, --version show program's version number and exit

Taxonomy Arguments: --tax-file TAXFILE File with taxonomic identification of proteins inputted (TSV). Must have one line per query, query name on first column, taxid on second. --protein-id-col PROTEINIDCOL Name of column with protein headers as in supplied FASTA file [qseqid] --tax-col TAXCOL Name of column with tax IDs of proteins [Taxonomic identifier (SPECIES)] --species-taxids If tax col contains Tax IDs of species (required for running COG taxonomic) ```

Referencing reCOGnizer

If you use reCOGnizer, please cite its publication.

Owner

  • Name: João Sequeira
  • Login: iquasere
  • Kind: user
  • Location: Portugal
  • Company: University of Minho

PhD student | Universidade do Minho Uncovering the role of conductive nanomaterials in anaerobic digestion

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: "Sequeira"
    given-names: "João C."
    orcid: "https://orcid.org/0000-0002-2691-9950"
  - family-names: "Rocha"
    given-names: "Miguel"
    orcid: "https://orcid.org/0000-0001-8439-8172"
  - family-names: "Alves"
    given-names: "M. Madalena"
    orcid: "https://orcid.org/0000-0002-9078-3613"
  - family-names: "Salvador"
    given-names: "Andreia F."
    orcid: "https://orcid.org/0000-0001-6037-4248"
title: "reCOGnizer - A tool for domain based annotation with databases from the Conserved Domains Database"
version: 1.6.4
doi: "10.1016/J.CSBJ.2022.03.042"
date-released: 2022-01-18
url: "https://github.com/iquasere/reCOGnizer"
preferred-citation:
  type: article
  authors:
    - family-names: "Sequeira"
      given-names: "João C."
      orcid: "https://orcid.org/0000-0002-2691-9950"
    - family-names: "Rocha"
      given-names: "Miguel"
      orcid: "https://orcid.org/0000-0001-8439-8172"
    - family-names: "Alves"
      given-names: "M. Madalena"
      orcid: "https://orcid.org/0000-0002-9078-3613"
    - family-names: "Salvador"
      given-names: "Andreia F."
      orcid: "https://orcid.org/0000-0001-6037-4248"
  doi: "10.1016/J.CSBJ.2022.03.042"
  journal: "Computational and Structural Biotechnology Journal"
  start: 1798
  end: 1810
  title: "UPIMAPI, reCOGnizer and KEGGCharter: Bioinformatics tools for functional annotation and visualization of (meta)-omics datasets"
  volume: 20
  year: 2022

GitHub Events

Total
  • Issues event: 3
  • Issue comment event: 2
Last Year
  • Issues event: 3
  • Issue comment event: 2

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 160
  • Total Committers: 1
  • Avg Commits per committer: 160.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 22
  • Committers: 1
  • Avg Commits per committer: 22.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
iquasere m****a@g****m 160

Dependencies

.github/workflows/main.yml actions
  • actions/checkout v2 composite
  • actions/download-artifact v2 composite
  • actions/upload-artifact v2 composite
  • docker/build-push-action v2 composite
  • docker/setup-buildx-action v1 composite
envs/environment.yml conda
  • blast >=2.12
  • krona
  • lxml
  • openpyxl
  • pandas
  • python
  • pyyaml
  • requests
  • tqdm
  • wget
  • xlsxwriter
Dockerfile docker
  • continuumio/miniconda3 latest build