https://github.com/aalto-ics-kepaco/go-ltr-prediction

Protein function prediction through latent tensor reconstruction

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.5%) to scientific vocabulary

Keywords

gene-ontology latent-tensor-reconstruction multi-view-learning protein-function-prediction

Last synced: 6 months ago · JSON representation

Repository

Protein function prediction through latent tensor reconstruction

Basic Info

Host: GitHub
Owner: aalto-ics-kepaco
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 14.6 MB

Statistics

Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

gene-ontology latent-tensor-reconstruction multi-view-learning protein-function-prediction

Created about 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License

`Protein Function Prediction through Multi-view Multi-label Latent Tensor Reconstruction`

Cite this paper

bibtex * @article{armah2024protein, title={Protein function prediction through multi-view multi-label latent tensor reconstruction}, author={Armah-Sekum, Robert Ebo and Szedmak, Sandor and Rousu, Juho}, journal={BMC bioinformatics}, volume={25}, number={1}, pages={174}, year={2024}, publisher={Springer} }

In this project, we utilized the latent tensor reconstruction (LTR) approach to model the joint interactions between different protein features to predict protein functional terms (i.e: Gene Ontology terms).

Software

The code is developed using python>=3.8. The main algorithm ./scripts/goltrmain.py is based on LTR software which is available at GO-LTR. The following packages which can be downloaded free of charge on pypi, are required to run the file: * numpy * scipy * itertools

Scripts

./scripts/ltrsolvermultiview0164.py - base LTR solver on which the goltr_main.py algorithm runs
./scripts/goltrmain.py - main file for running GO-LTR and generating predictions

Dataset

The UniProtKB IDs of the Swiss-prot manually reviewed protein sequences used for the study are in ./dataset directory. Using the IDs one can find the full specification of each protein in the UniProtKB database. The ascession numbers obtained from the UniProtKB search can then be used to query other databases such as AlphaFoldDB, Rhea-DB, etc, for specific protein feature information. The full manually reviewed Swiss-prot sequences can be downloaded at https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/

Clustering of sequences was done with mmseqs2

Structure of repository

dataset: Contains the UniProtKB IDs of all sequences used in our experiments. There are .txt files for each ontology branch: Molecular Function Ontology (MFO), Cellular Component Ontology (CCO) and Biological Process Ontology (BPO)
images: Contains the image files for the workflow of the GO-LTR model
scripts: Contains the main script for training the GO-LTR model and generating predictions

Feature representation and parameter tensor factorization

Image Alt text

We leveraged 3 different protein features: Sequence embeddings generated from ProtT5 Protein language model, InterPro fingerprints and Protein-protein interaction (PPI) data from StringDB.

GO-LTR multiview framework

Image Alt text As shown above, the functions associated with a particular protein forms a consistent graph in the Gene Ontology (GO) graph. The functional terms also follow the true-path annotation rule -- where a protein annotated to a deep level term in the ontology is automatically annotated to all the parents of the child term.

Mathematical formulations underpining LTR

Given a multi-view (multimodal) data sample

Given: a sample $$ \mathcal{S} =((\mathbf{x}i^{(1)},\dots, \mathbf{x}i^{(nd)}), \mathbf{y}i) \mid i\in [m] \qquad \mathbf{x}^{(d)}i \in \mathbb{R}^{n{xd}},\ d\in [nd] $$

Evaluation.

We used the CAFA-evaluator script for performance evaluation of the models considered under the study.

Papers on LTR method

bibtex * @article{szedmak2020solution, title={A solution for large scale nonlinear regression with high rank and degree at constant memory complexity via latent tensor reconstruction}, author={Szedmak, Sandor and Cichonska, Anna and Julkunen, Heli and Pahikkala, Tapio and Rousu, Juho}, journal={arXiv preprint arXiv:2005.01538}, year={2020} }

bibtex * @article{wang2021modeling, title={Modeling drug combination effects via latent tensor reconstruction}, author={Wang, Tianduanyi and Szedmak, Sandor and Wang, Haishan and Aittokallio, Tero and Pahikkala, Tapio and Cichonska, Anna and Rousu, Juho}, journal={Bioinformatics}, volume={37}, number={Supplement\_1}, pages={i93--i101}, year={2021}, publisher={Oxford University Press} }

Cite this paper

Owner

Name: KEPACO
Login: aalto-ics-kepaco
Kind: organization
Location: Espoo, Finland

Website: http://research.ics.aalto.fi/kepaco/
Repositories: 29
Profile: https://github.com/aalto-ics-kepaco

Kernel Machines, Pattern Analysis and Computational Metabolomics - Research group at Aalto University

GitHub Events

Total

Fork event: 1

Last Year

Fork event: 1

Dependencies

requirements.txt pypi

Biopython >=1.81
goatools >=1.3.1
itertools *
numpy >=1.20.0
obonet >=1.0.0
scikit-bio >=0.5.8
scipy >=1.9

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/aalto-ics-kepaco/go-ltr-prediction

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

`Protein Function Prediction through Multi-view Multi-label Latent Tensor Reconstruction`

Cite this paper

Software

Scripts

Dataset

Structure of repository

Feature representation and parameter tensor factorization

GO-LTR multiview framework

Mathematical formulations underpining LTR

Evaluation.

Papers on LTR method

Cite this paper

Owner

GitHub Events

Total

Last Year

Dependencies