xai-proteins

Insights into the inner workings of transformer models for protein function prediction

https://github.com/markuswenzel/xai-proteins

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 14 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.2%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Insights into the inner workings of transformer models for protein function prediction

Basic Info

Host: GitHub
Owner: MarkusWenzel
Language: Python
Default Branch: main
Size: 145 KB

Statistics

Stars: 13
Watchers: 1
Forks: 2
Open Issues: 0
Releases: 0

Created over 3 years ago · Last pushed over 2 years ago

Metadata Files

Readme Citation

Insights into the inner workings of transformer models for protein function prediction

About

Finetuning pretrained universal protein language models to downstream tasks provides large benefits in protein function prediction. The used neural networks are, at the same time, notorious for having often millions and sometimes billions of trainable parameters. Therefore, it can be very difficult to interpret the decision making logic or strategy of these complex models.

Consequently, explainable machine learning is starting to gain traction in the field of proteomics too. We are exploring how explainability methods can help to shed light into the inner workings of transformers for protein function prediction.

Attribution methods, such as integrated gradients, make it possible to identify those features in the input space that the model apparently focuses on, because these features turn out to be relevant for the final classification decision of the model. We extended integrated gradients such that latent representations inside of transformers can be inspected too (separately for each head and layer).

To find out if the identified relevant sequence regions match expectations informed by knowledge from biology or chemistry, we combined this method with a subsequent statistical analysis across proteins where we correlated the obtained relevance with annotations of interest from sequence databases. In this way, we identified heads inside of the transformer architecture that are specialized for specific protein function prediction tasks.

The two folders of this repository are dedicated to the explainability analysis for the Gene Ontology (GO) term and Enzyme Commission (EC) number prediction (see the GO and EC README files) .

Publication

You find more information in our article:

Markus Wenzel, Erik Grner, Nils Strodthoff (2024). Insights into the inner workings of transformer models for protein function prediction, Bioinformatics, btae031.

@article{10.1093/bioinformatics/btae031, author = {Wenzel, Markus and Grner, Erik and Strodthoff, Nils}, title = "{Insights into the inner workings of transformer models for protein function prediction}", journal = {Bioinformatics}, pages = {btae031}, year = {2024}, month = {01}, issn = {1367-4811}, doi = {10.1093/bioinformatics/btae031}, url = {https://doi.org/10.1093/bioinformatics/btae031}}

Related works

If you are interested in this topic, you are welcome to have a look at our related papers: * Nils Strodthoff, Patrick Wagner, Markus Wenzel, and Wojciech Samek (2020). UDSMProt: universal deep sequence models for protein classification. Bioinformatics, 36(8), 24012409. * Johanna Vielhaben, Markus Wenzel, Wojciech Samek, and Nils Strodthoff (2020). USMPep: universal sequence models for major histocompatibility complex binding affinity prediction, BMC Bioinformatics, 21, 1-16.

Datasets

EC datasets by Nils Strodthoff, Patrick Wagner, Markus Wenzel, and Wojciech Samek (2020). UDSMProt: universal deep sequence models for protein classification. Bioinformatics, 36(8), 24012409.
GO datasets (data-2016.tar.gz, data-cafa.tar.gz) by Maxat Kulmanov and Robert Hoehndorf (2020). DeepGOPlus: improved protein function prediction from sequence. Bioinformatics, 36(2), 422429.

EC and GO data were preprocessed as detailed on https://github.com/nstrodt/UDSMProt with https://github.com/nstrodt/UDSMProt/blob/master/code/create_datasets.sh, resulting in six files for EC40 and EC50 on levels L0, L1, and L2, and in two files for GO "2016" (a.k.a. "temporalsplit") and GO "CAFA3". Preprocessed data can be accessed here (EC) and here (GO).

Authors

Markus Wenzel, Erik Grner, Nils Strodthoff (2024)

Owner

Login: MarkusWenzel
Kind: user

Repositories: 1
Profile: https://github.com/MarkusWenzel

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - given-names: Markus
    family-names: Wenzel
  - given-names: Erik
    family-names: Grüner
  - given-names: Nils
    family-names: Strodthoff
title: "Insights into the inner workings of transformer models for protein function prediction"
version: 1
doi: 10.1093/bioinformatics/btae031
date-released: 2024-01-19
url: "https://github.com/MarkusWenzel/xai-proteins/"
preferred-citation:
  type: article
  authors:
  - given-names: Markus
    family-names: Wenzel
  - given-names: Erik
    family-names: Grüner
  - given-names: Nils
    family-names: Strodthoff
  doi: "10.1093/bioinformatics/btae031"
  journal: "Bioinformatics"
  month: 1
  start: btae031
  title: "Insights into the inner workings of transformer models for protein function prediction"
  year: 2024

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

xai-proteins

Science Score: 57.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Insights into the inner workings of transformer models for protein function prediction

About

Publication

Related works

Datasets

Authors

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year