https://github.com/compomics/tissue_prediction_manuscript

https://github.com/compomics/tissue_prediction_manuscript

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 7 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: CompOmics
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 71.5 MB
Statistics
  • Stars: 1
  • Watchers: 8
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Archived
Created over 3 years ago · Last pushed over 1 year ago
Metadata Files
Readme

README.md

Tissue prediction manuscript

This github repository contains code used in the manuscript on tissue prediction:

Machine Learning on Large-Scale Proteomics Data Identifies Tissue and Cell-Type Specific Proteins
Tine Claeys, Maxime Menu, Robbin Bouwmeester, Kris Gevaert, Lennart Martens
Journal of Proteome Research (2023) doi:10.1021/acs.jproteome.2c00644

The datasets and database can be found and downloaded from Zenodo: doi.org/10.5281/zenodo.7135199.

Abstract

"Using data from 183 public human data sets from PRIDE, a machine learning model was trained to identify tissue and cell-type specific protein patterns. PRIDE projects were searched with ionbot and tissue/cell type annotation was manually added. Data from physiological samples were used to train a Random Forest model on protein abundances to classify samples into tissues and cell types. Subsequently, a one-vs-all classification and feature importance were used to analyse the most discriminating protein abundances per class. Based on protein abundance alone, the model was able to predict tissues with 98% accuracy, and cell types with 99% accuracy. The F-scores describe a clear view on tissue-specific proteins and tissue-specific protein expression patterns. In-depth feature analysis shows slight confusion between physiologically similar tissues, demonstrating the capacity of the algorithm to detect biologically relevant patterns. These results can in turn inform downstream uses, from identification of the tissue of origin of proteins in complex samples such as liquid biopsies, to studying the proteome of tissue-like samples such as organoids and cell lines. "

Notebooks

Model training notebooks

These notebooks contain code used to train multiple classifiers on the filtered and complete datasets: * Tissuepredictorsfiltered * Tissuepredictorsfull * Celltypepredictorsfiltered * Celltypepredictorsfull

License

All data and downstream results of this research follow CC-BY-NC-4.0.

Owner

  • Name: Computational Omics and Systems Biology Group
  • Login: CompOmics
  • Kind: organization
  • Email: compomics.list@gmail.com

The CompOmics group, headed by Prof. Dr. Lennart Martens, specializes in the management, analysis and integration of high-throughput Omics data.

GitHub Events

Total
Last Year