cogclassifier

A tool for classifying prokaryote protein sequences into COG(Cluster of Orthologous Genes) functional category

https://github.com/moshi4/cogclassifier

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.7%) to scientific vocabulary

Keywords

bioinformatics cog comparative-genomics functional-analysis functional-annotation genome-analysis genomics microbial-genomics protein python visualization
Last synced: 4 months ago · JSON representation ·

Repository

A tool for classifying prokaryote protein sequences into COG(Cluster of Orthologous Genes) functional category

Basic Info
  • Host: GitHub
  • Owner: moshi4
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 33.6 MB
Statistics
  • Stars: 71
  • Watchers: 2
  • Forks: 6
  • Open Issues: 0
  • Releases: 8
Topics
bioinformatics cog comparative-genomics functional-analysis functional-annotation genome-analysis genomics microbial-genomics protein python visualization
Created almost 4 years ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

COGclassifier

Python3 OS License Latest PyPI version Bioconda CI workflow

Table of Contents

Overview

COG(Cluster of Orthologous Genes) is a database that plays an important role in the annotation, classification, and analysis of microbial gene function. Functional annotation, classification, and analysis of each gene in newly sequenced bacterial genomes using the COG database is a common task. However, there was no COG functional classification command line software that is easy-to-use and capable of producing publication-ready figures. Therefore, I developed COGclassifier to fill this need. COGclassifier can automatically perform the processes from searching query sequences into the COG database, to annotation and classification of gene functions, to generation of publication-ready figures (See figure below).

ecoli_barchart_fig
Fig.1: Barchart of COG funcitional category classification result for E.coli

ecoli_piechart_fig
Fig.2: Piechart of COG funcitional category classification result for E.coli

Installation

Python 3.9 or later is required for installation. Installation of RPS-BLAST(ncbi-blast+) is also necessary.

Install bioconda package:

conda install -c conda-forge -c bioconda cogclassifier

Install PyPI stable package:

pip install cogclassifier

Workflow

Description of COGclassifier's automated workflow. This workflow was created based in part on cdd2cog.

1. Setup COG & CDD resources

Download & load 4 required COG & CDD files from FTP site.

  • cog-24.fun.tab (https://ftp.ncbi.nih.gov/pub/COG/COG2024/data/cog-24.fun.tab)
    Descriptions of COG functional categories.
    This resource file is included in the package as cog_func_category.tsv.

    Show more information > Tab-delimited plain text file with descriptions of COG functional categories > The categories form four functional groups: > 1\. INFORMATION STORAGE AND PROCESSING > 2\. CELLULAR PROCESSES AND SIGNALING > 3\. METABOLISM > 4\. POORLY CHARACTERIZED > Columns: > 1\. Functional category ID (one letter) > 2\. Functional group (1-4, as above) > 3\. Hexadecimal RGB color associated with the functional category > 4\. Functional category description > Each line corresponds to one functional category. The order of the categories is meaningful (reflects a hierarchy of functions; determines the order of display) > > (From )
  • cog-24.def.tab (https://ftp.ncbi.nih.gov/pub/COG/COG2024/data/cog-24.def.tab)
    COG descriptions such as 'COG ID', 'COG functional category', 'COG name', etc...
    This resource file is included in the package as cog_definition.tsv.

    Show more information > Tab-delimited plain text file with COG descriptions > Columns: > 1\. COG ID > 2\. COG functional category (could include multiple letters in the order of importance) > 3\. COG name > 4\. Gene name associated with the COG (optional) > 5\. Functional pathway associated with the COG (optional) > 6\. PubMed ID, associated with the COG (multiple entries are semicolon-separated; optional) > 7\. PDB ID of the structure associated with the COG (multiple entries are semicolon-separated; optional) > Each line corresponds to one COG. The order of the COGs is arbitrary (displayed in the lexicographic order) > > (From )
  • cddid.tbl.gz (https://ftp.ncbi.nih.gov/pub/mmdb/cdd/)
    Summary information about the CD(Conserved Domain) model.

    Show more information >"cddid.tbl.gz" contains summary information about the CD models in this >distribution, which are part of the default "cdd" search database and are >indexed in NCBI's Entrez database. This is a tab-delimited text file, with a >single row per CD model and the following columns: > >PSSM-Id (unique numerical identifier) >CD accession (starting with 'cd', 'pfam', 'smart', 'COG', 'PRK' or "CHL') >CD "short name" >CD description >PSSM-Length (number of columns, the size of the search model) > > (From )
  • Cog_LE.tar.gz (https://ftp.ncbi.nih.gov/pub/mmdb/cdd/little_endian/)
    COG database, a part of CDD(Conserved Domain Database), for RPS-BLAST search.

2. RPS-BLAST search against COG database

Run query sequences RPS-BLAST against COG database [Default: E-value = 1e-2]. Best-hit (=lowest e-value) blast results are extracted and used in next functional classification step.

3. Classify query sequences into COG functional category

From best-hit results, extract relationship between query sequences and COG functional category as described below.

  1. Best-hit results -> CDD ID
  2. CDD ID -> COG ID (From cddid.tbl.gz)
  3. COG ID -> COG Functional Category Letter (From cog-24.def.tab)
  4. COG Functional Category Letter -> COG Functional Category Definition (From cog-24.fun.tab)

:warning: If functional category with multiple letters exists, first letter is treated as functional category (e.g. COG4862 has multiple letters KTN. A letter K is treated as functional category).

Using the above information, the number of query sequences classified into each COG functional category is calculated and functional annotation and classification results are output.

Usage

Basic Command

COGclassifier -i [protein fasta file] -o [output directory]

Options

$ COGclassifier --help

Usage: COGclassifier [OPTIONS]                                                                                       

A tool for classifying prokaryote protein sequences into COG functional category                                     

╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --infile        -i        Input query protein fasta file [required]                                             │
│ *  --outdir        -o        Output directory [required]                                                           │
│    --download_dir  -d        Download COG & CDD resources directory [default: /home/user/.cache/cogclassifier_v2]  │
│    --thread_num    -t        RPS-BLAST num_thread parameter [default: MaxThread - 1]                               │
│    --evalue        -e        RPS-BLAST e-value parameter [default: 0.01]                                           │
│    --quiet         -q        No print log on screen                                                                │
│    --version       -v        Print version information                                                             │
│    --help          -h        Show this message and exit.                                                           │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Example Command

Click here to download example protein fasta files.

COGclassifier -i ./example/ecoli.faa -o ./ecoli_cogclassifier

Output Contents

  • rpsblast.tsv (example)
    RPS-BLAST against COG database result (format = outfmt 6).

  • cog_classify.tsv (example)
    Query sequences classified into COG functional category result.
    This file contains all classified query sequences and associated COG information.

    Table of detailed tsv format information (9 columns) | Columns | Contents | Example Value | | ---------------- | -------------------------------------- | ----------------------------------- | | QUERY_ID | Query ID | NP_414544.1 | | COG_ID | COG ID of RPS-BLAST top hit result | COG0083 | | CDD_ID | CDD ID of RPS-BLAST top hit result | 223161 | | EVALUE | RPS-BLAST top hit evalue | 2.5e-150 | | IDENTITY | RPS-BLAST top hit identity | 45.806 | | GENE_NAME | Abbreviated gene name | ThrB | | COG_NAME | COG gene name | Homoserine kinase | | COG_LETTER | Letter of COG functional category | E | | COG_DESCRIPTION | Description of COG functional category | Amino acid transport and metabolism |
  • cog_count.tsv (example)
    Count classified sequences per COG functional category result.

    Table of detailed tsv format information (5 columns) | Columns | Contents | Example Value | | ------------| --------------------------------------- | ----------------------------------------------- | | LETTER | Letter of COG functional category | J | | COUNT | Count of COG classified sequence | 259 | | GROUP | COG functional group | INFORMATION STORAGE AND PROCESSING | | COLOR | Symbol color of COG functional category | #FCCCFC | | DESCRIPTION | Description of COG functional category | Translation, ribosomal structure and biogenesis |
  • cogclassifier.log (example)
    COGclassifier log file.

  • cog_count_barchart.[png|html]
    Barchart of COG funcitional category classification result.
    COGclassifier uses Altair visualization library for plotting charts.

cog_count_barchart

  • cog_count_piechart.[png|html]
    Piechart of COG funcitional category classification result.
    Functional category with percentages less than 1% don't display letter on piechart.

cog_count_piechart

Customize Charts

COGclassifier also provides barchart & piechart plotting API/CLI to customize charts appearence. See notebooks and command below for details.

plotcogcount_barchart

$ plot_cog_count_barchart --help

Usage: plot_cog_count_barchart [OPTIONS]                                                      

Plot COGclassifier count barchart figure                                                      

╭─ Options ───────────────────────────────────────────────────────────────────────────────────╮
│ *  --infile         -i        Input COG count result file ('cog_count.tsv') [required]      │
│ *  --outfile        -o        Output barchart figure file (*.png|*.svg|*.html) [required]   │
│    --width                    Figure pixel width [default: 440]                             │
│    --height                   Figure pixel height [default: 340]                            │
│    --bar_width                Figure bar width [default: 15]                                │
│    --y_limit                  Y-axis max limit value                                        │
│    --percent_style            Plot percent style instead of number count                    │
│    --sort                     Enable descending sort by number count                        │
│    --dpi                      Figure DPI [default: 100]                                     │
│    --help           -h        Show this message and exit.                                   │
╰─────────────────────────────────────────────────────────────────────────────────────────────╯

plotcogcount_piechart

$ plot_cog_count_piechart --help

Usage: plot_cog_count_piechart [OPTIONS]                                                      

Plot COGclassifier count piechart figure                                                      

╭─ Options ───────────────────────────────────────────────────────────────────────────────────╮
│ *  --infile       -i        Input COG count result file ('cog_count.tsv') [required]        │
│ *  --outfile      -o        Output piechart figure file (*.png|*.svg|*.html) [required]     │
│    --width                  Figure pixel width [default: 380]                               │
│    --height                 Figure pixel height [default: 380]                              │
│    --show_letter            Show functional category lettter on piechart                    │
│    --sort                   Enable descending sort by number count                          │
│    --dpi                    Figure DPI [default: 100]                                       │
│    --help         -h        Show this message and exit.                                     │
╰─────────────────────────────────────────────────────────────────────────────────────────────╯

Owner

  • Name: moshi
  • Login: moshi4
  • Kind: user

Web Developer / Bioinformatics / GIS

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite it as below.
authors:
  - family-names: Shimoyama
    given-names: Yuki
title: "COGclassifier: A tool for classifying prokaryote protein sequences into COG functional category"
date-released: 2022-03-20
url: https://github.com/moshi4/COGclassifier

GitHub Events

Total
  • Release event: 1
  • Watch event: 15
  • Delete event: 1
  • Push event: 9
  • Pull request event: 2
  • Create event: 2
Last Year
  • Release event: 1
  • Watch event: 15
  • Delete event: 1
  • Push event: 9
  • Pull request event: 2
  • Create event: 2

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 108
  • Total Committers: 1
  • Avg Commits per committer: 108.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 16
  • Committers: 1
  • Avg Commits per committer: 16.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
moshi s****1@g****m 108

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 1
  • Total pull requests: 1
  • Average time to close issues: 4 days
  • Average time to close pull requests: less than a minute
  • Total issue authors: 1
  • Total pull request authors: 1
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • sowptika (1)
Pull Request Authors
  • moshi4 (2)
Top Labels
Issue Labels
question (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 47 last-month
  • Total docker downloads: 10
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 8
  • Total maintainers: 1
pypi.org: cogclassifier

A tool for classifying prokaryote protein sequences into COG functional category

  • Versions: 8
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 47 Last month
  • Docker Downloads: 10
Rankings
Docker downloads count: 4.1%
Dependent packages count: 10.1%
Stargazers count: 11.9%
Forks count: 14.2%
Average: 14.7%
Dependent repos count: 21.6%
Downloads: 26.1%
Maintainers (1)
Last synced: 4 months ago

Dependencies

poetry.lock pypi
  • atomicwrites 1.4.0 develop
  • black 22.1.0 develop
  • click 8.0.4 develop
  • colorama 0.4.4 develop
  • coverage 6.3.2 develop
  • flake8 4.0.1 develop
  • iniconfig 1.1.1 develop
  • mccabe 0.6.1 develop
  • mypy-extensions 0.4.3 develop
  • packaging 21.3 develop
  • pathspec 0.9.0 develop
  • platformdirs 2.5.1 develop
  • pluggy 1.0.0 develop
  • py 1.11.0 develop
  • pycodestyle 2.8.0 develop
  • pydocstyle 6.1.1 develop
  • pyflakes 2.4.0 develop
  • pyparsing 3.0.7 develop
  • pytest 7.1.1 develop
  • pytest-cov 3.0.0 develop
  • snowballstemmer 2.2.0 develop
  • tomli 2.0.1 develop
  • typing-extensions 4.1.1 develop
  • altair 4.2.0
  • attrs 21.4.0
  • certifi 2021.10.8
  • charset-normalizer 2.0.12
  • entrypoints 0.4
  • idna 3.3
  • importlib-resources 5.4.0
  • jinja2 3.0.3
  • jsonschema 4.4.0
  • markupsafe 2.1.1
  • numpy 1.22.3
  • pandas 1.4.1
  • pyrsistent 0.18.1
  • python-dateutil 2.8.2
  • pytz 2022.1
  • requests 2.27.1
  • six 1.16.0
  • toolz 0.11.2
  • urllib3 1.26.9
  • zipp 3.7.0
pyproject.toml pypi
  • black ^22.1.0 develop
  • flake8 ^4.0.1 develop
  • pydocstyle ^6.1.1 develop
  • pytest ^7.1.1 develop
  • pytest-cov ^3.0.0 develop
  • altair ^4.2.0
  • pandas ^1.4.1
  • python ^3.8
  • requests ^2.27.1
.github/workflows/ci.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
.github/workflows/publish_to_pypi.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v4 composite