https://github.com/aalto-ics-kepaco/go-ltr-prediction

Protein function prediction through latent tensor reconstruction

https://github.com/aalto-ics-kepaco/go-ltr-prediction

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.5%) to scientific vocabulary

Keywords

gene-ontology latent-tensor-reconstruction multi-view-learning protein-function-prediction
Last synced: 6 months ago · JSON representation

Repository

Protein function prediction through latent tensor reconstruction

Basic Info
  • Host: GitHub
  • Owner: aalto-ics-kepaco
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 14.6 MB
Statistics
  • Stars: 3
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
gene-ontology latent-tensor-reconstruction multi-view-learning protein-function-prediction
Created about 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License

README.md

Protein Function Prediction through Multi-view Multi-label Latent Tensor Reconstruction

Cite this paper

bibtex * @article{armah2024protein, title={Protein function prediction through multi-view multi-label latent tensor reconstruction}, author={Armah-Sekum, Robert Ebo and Szedmak, Sandor and Rousu, Juho}, journal={BMC bioinformatics}, volume={25}, number={1}, pages={174}, year={2024}, publisher={Springer} }

In this project, we utilized the latent tensor reconstruction (LTR) approach to model the joint interactions between different protein features to predict protein functional terms (i.e: Gene Ontology terms).

Software

The code is developed using python>=3.8. The main algorithm ./scripts/goltrmain.py is based on LTR software which is available at GO-LTR. The following packages which can be downloaded free of charge on pypi, are required to run the file: * numpy * scipy * itertools

Scripts

  • ./scripts/ltrsolvermultiview0164.py - base LTR solver on which the goltr_main.py algorithm runs
  • ./scripts/goltrmain.py - main file for running GO-LTR and generating predictions

Dataset

The UniProtKB IDs of the Swiss-prot manually reviewed protein sequences used for the study are in ./dataset directory. Using the IDs one can find the full specification of each protein in the UniProtKB database. The ascession numbers obtained from the UniProtKB search can then be used to query other databases such as AlphaFoldDB, Rhea-DB, etc, for specific protein feature information. The full manually reviewed Swiss-prot sequences can be downloaded at https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/

Clustering of sequences was done with mmseqs2

Structure of repository

  • dataset: Contains the UniProtKB IDs of all sequences used in our experiments. There are .txt files for each ontology branch: Molecular Function Ontology (MFO), Cellular Component Ontology (CCO) and Biological Process Ontology (BPO)
  • images: Contains the image files for the workflow of the GO-LTR model
  • scripts: Contains the main script for training the GO-LTR model and generating predictions

Feature representation and parameter tensor factorization

Image Alt text

We leveraged 3 different protein features: Sequence embeddings generated from ProtT5 Protein language model, InterPro fingerprints and Protein-protein interaction (PPI) data from StringDB.

GO-LTR multiview framework

Image Alt text As shown above, the functions associated with a particular protein forms a consistent graph in the Gene Ontology (GO) graph. The functional terms also follow the true-path annotation rule -- where a protein annotated to a deep level term in the ontology is automatically annotated to all the parents of the child term.

Mathematical formulations underpining LTR

Given a multi-view (multimodal) data sample

Given: a sample $$ \mathcal{S} =((\mathbf{x}i^{(1)},\dots, \mathbf{x}i^{(nd)}), \mathbf{y}i) \mid i\in [m] \qquad \mathbf{x}^{(d)}i \in \mathbb{R}^{n{xd}},\ d\in [nd] $$

Evaluation.

We used the CAFA-evaluator script for performance evaluation of the models considered under the study.

Papers on LTR method

bibtex * @article{szedmak2020solution, title={A solution for large scale nonlinear regression with high rank and degree at constant memory complexity via latent tensor reconstruction}, author={Szedmak, Sandor and Cichonska, Anna and Julkunen, Heli and Pahikkala, Tapio and Rousu, Juho}, journal={arXiv preprint arXiv:2005.01538}, year={2020} }

bibtex * @article{wang2021modeling, title={Modeling drug combination effects via latent tensor reconstruction}, author={Wang, Tianduanyi and Szedmak, Sandor and Wang, Haishan and Aittokallio, Tero and Pahikkala, Tapio and Cichonska, Anna and Rousu, Juho}, journal={Bioinformatics}, volume={37}, number={Supplement\_1}, pages={i93--i101}, year={2021}, publisher={Oxford University Press} }

Cite this paper

bibtex * @article{armah2024protein, title={Protein function prediction through multi-view multi-label latent tensor reconstruction}, author={Armah-Sekum, Robert Ebo and Szedmak, Sandor and Rousu, Juho}, journal={BMC bioinformatics}, volume={25}, number={1}, pages={174}, year={2024}, publisher={Springer} }

Owner

  • Name: KEPACO
  • Login: aalto-ics-kepaco
  • Kind: organization
  • Location: Espoo, Finland

Kernel Machines, Pattern Analysis and Computational Metabolomics - Research group at Aalto University

GitHub Events

Total
  • Fork event: 1
Last Year
  • Fork event: 1

Dependencies

requirements.txt pypi
  • Biopython >=1.81
  • goatools >=1.3.1
  • itertools *
  • numpy >=1.20.0
  • obonet >=1.0.0
  • scikit-bio >=0.5.8
  • scipy >=1.9