https://github.com/aalto-ics-kepaco/go-ltr-prediction
Protein function prediction through latent tensor reconstruction
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.5%) to scientific vocabulary
Keywords
Repository
Protein function prediction through latent tensor reconstruction
Basic Info
Statistics
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Protein Function Prediction through Multi-view Multi-label Latent Tensor Reconstruction
Cite this paper
bibtex
* @article{armah2024protein,
title={Protein function prediction through multi-view multi-label latent tensor reconstruction},
author={Armah-Sekum, Robert Ebo and Szedmak, Sandor and Rousu, Juho},
journal={BMC bioinformatics},
volume={25},
number={1},
pages={174},
year={2024},
publisher={Springer}
}
In this project, we utilized the latent tensor reconstruction (LTR) approach to model the joint interactions between different protein features to predict protein functional terms (i.e: Gene Ontology terms).
Software
The code is developed using python>=3.8.
The main algorithm ./scripts/goltrmain.py is based on LTR software which is available at
GO-LTR.
The following packages which can be downloaded free of charge on pypi, are required to run the file:
* numpy
* scipy
* itertools
Scripts
- ./scripts/ltrsolvermultiview0164.py - base LTR solver on which the goltr_main.py algorithm runs
- ./scripts/goltrmain.py - main file for running GO-LTR and generating predictions
Dataset
The UniProtKB IDs of the Swiss-prot manually reviewed protein sequences used for the study are in ./dataset directory.
Using the IDs one can find the full specification of each protein in the UniProtKB database.
The ascession numbers obtained from the UniProtKB search can then be used to query other databases such as AlphaFoldDB, Rhea-DB, etc, for specific protein feature information.
The full manually reviewed Swiss-prot sequences can be downloaded at https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/complete/
Clustering of sequences was done with mmseqs2
Structure of repository
dataset: Contains theUniProtKBIDs of all sequences used in our experiments. There are .txt files for each ontology branch: Molecular Function Ontology (MFO), Cellular Component Ontology (CCO) and Biological Process Ontology (BPO)images: Contains the image files for the workflow of the GO-LTR modelscripts: Contains the main script for training the GO-LTR model and generating predictions
Feature representation and parameter tensor factorization

We leveraged 3 different protein features: Sequence embeddings generated from ProtT5 Protein language model, InterPro fingerprints and Protein-protein interaction (PPI) data from StringDB.
GO-LTR multiview framework
As shown above, the functions associated with a particular protein forms a consistent graph in the Gene Ontology (GO) graph. The functional terms also follow the true-path annotation rule -- where a protein annotated to a deep level term in the ontology is automatically annotated to all the parents of the child term.
Mathematical formulations underpining LTR
Given a multi-view (multimodal) data sample
Given: a sample $$ \mathcal{S} =((\mathbf{x}i^{(1)},\dots, \mathbf{x}i^{(nd)}), \mathbf{y}i) \mid i\in [m] \qquad \mathbf{x}^{(d)}i \in \mathbb{R}^{n{xd}},\ d\in [nd] $$
Evaluation.
We used the CAFA-evaluator script for performance evaluation of the models considered under the study.
Papers on LTR method
bibtex
* @article{szedmak2020solution,
title={A solution for large scale nonlinear regression with high rank and degree at constant memory complexity via latent tensor reconstruction},
author={Szedmak, Sandor and Cichonska, Anna and Julkunen, Heli and Pahikkala, Tapio and Rousu, Juho},
journal={arXiv preprint arXiv:2005.01538},
year={2020}
}
bibtex
* @article{wang2021modeling,
title={Modeling drug combination effects via latent tensor reconstruction},
author={Wang, Tianduanyi and Szedmak, Sandor and Wang, Haishan and Aittokallio, Tero and Pahikkala, Tapio and Cichonska, Anna and Rousu, Juho},
journal={Bioinformatics},
volume={37},
number={Supplement\_1},
pages={i93--i101},
year={2021},
publisher={Oxford University Press}
}
Cite this paper
bibtex
* @article{armah2024protein,
title={Protein function prediction through multi-view multi-label latent tensor reconstruction},
author={Armah-Sekum, Robert Ebo and Szedmak, Sandor and Rousu, Juho},
journal={BMC bioinformatics},
volume={25},
number={1},
pages={174},
year={2024},
publisher={Springer}
}
Owner
- Name: KEPACO
- Login: aalto-ics-kepaco
- Kind: organization
- Location: Espoo, Finland
- Website: http://research.ics.aalto.fi/kepaco/
- Repositories: 29
- Profile: https://github.com/aalto-ics-kepaco
Kernel Machines, Pattern Analysis and Computational Metabolomics - Research group at Aalto University
GitHub Events
Total
- Fork event: 1
Last Year
- Fork event: 1
Dependencies
- Biopython >=1.81
- goatools >=1.3.1
- itertools *
- numpy >=1.20.0
- obonet >=1.0.0
- scikit-bio >=0.5.8
- scipy >=1.9