adopt

ADOPT: intrinsic protein disorder prediction through deep bidirectional transformers

https://github.com/peptoneltd/adopt

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.1%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

ADOPT: intrinsic protein disorder prediction through deep bidirectional transformers

Basic Info
  • Host: GitHub
  • Owner: PeptoneLtd
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 18.5 MB
Statistics
  • Stars: 18
  • Watchers: 1
  • Forks: 1
  • Open Issues: 13
  • Releases: 13
Created over 4 years ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License Citation Codeowners

README.md

Attention based DisOrder PredicTor

This repository containes the code and the trained models for intrinsic protein disorder prediction through deep bidirectional transformers from Peptone Ltd.

DOI GitHub Super-Linter

ADOPT has been introduced in our paper ADOPT: intrinsic protein disorder prediction through deep bidirectional transformers and it's also available as webserver at adopt.peptone.io.

Our disorder predictor is made up of two main blocks, namely: a self-supervised encoder and a supervised disorder predictor. We use Facebook’s Evolutionary Scale Modeling (ESM) library to extract dense residue evel representations, which feed the supervised machine learning based predictor.

The ESM library exploits a set of deep Transformer encoder models, which processes character sequences of amino acids as inputs.

ADOPT makes use of two datasets: the CheZoD “1325” and the CheZoD “117” databases containing 1325 and 117 sequences, respectively, together with their residue level Z-scores.

Table of Contents

Intrinsic disorder trained models

| Model | Pre-trained model | Datasets | Split level | CV | |-------|-------------------|----------|-------------|----| | lasso_esm-1b_cleared_residue | ESM-1b | Chezod 1325 cleared and Chezod 117 | residue | :x: | | lasso_esm-1v_cleared_residue | ESM-1v | Chezod 1325 cleared and Chezod 117 | residue | :x: | | lasso_esm-msa_cleared_residue | ESM-MSA | Chezod 1325 cleared and Chezod 117 | residue | :x: | | lasso_combined_cleared_residue | Combined | Chezod 1325 cleared and Chezod 117 | residue | :x: | | lasso_esm-1b_residue_cv | ESM-1b | Chezod 1325 | residue | :whitecheckmark: | | lasso_esm-1v_residue_cv | ESM-1v | Chezod 1325 | residue | :whitecheckmark: | | lasso_esm-msa_residue_cv | ESM-MSA | Chezod 1325 | residue | :whitecheckmark: | | lasso_esm-1b_cleared_residue_cv | ESM-1b | Chezod 1325 cleared | residue | :whitecheckmark: | | lasso_esm-1v_cleared_residue_cv | ESM-1v | Chezod 1325 cleared | residue | :whitecheckmark: | | lasso_esm-msa_cleared_residue_cv | ESM-MSA | Chezod 1325 cleared | residue | :whitecheckmark: | | lasso_esm-1b_cleared_sequence_cv | ESM-1b | Chezod 1325 cleared | sequence | :whitecheckmark: | | lasso_esm-1v_cleared_sequence_cv | ESM-1v | Chezod 1325 cleared | sequence | :whitecheckmark: | | lasso_esm-msa_cleared_sequence_cv | ESM-MSA | Chezod 1325 cleared | sequence | :whitecheckmark: |

Usage

Quick start

Prerequisites (we suggest creating a dedicated python venv or conda env)

pip install \ pandas \ fair-esm \ biopython \ bertviz \ skl2onnx \ onnxruntime \ spacy \ plotly \ wandb

Install the adopt package:

Run

bash git clone https://github.com/PeptoneInc/ADOPT.git cd ADOPT git submodule update --init --recursive python setup.py install

Then, you can predict the intrinsic disorder of each reesidue in a protein sequence, as follows:

```python from adopt import MultiHead, ZScorePred

Prepare protein sequence and name i.e brmid

SEQUENCE = "SLQDGVRQSRASDKQTLLPNDQLYQPLKDREDDQYSHLQGNQLRRN" PROTID = "Protein 18890"

Choose model type and training strategy

MODELTYPE = "esm-1b" STRATEGY = "trainoncleared1325teston117residue_split"

Extract residue level representations

multihead = MultiHead(MODELTYPE) representation, tokens = multihead.getrepresentation(SEQUENCE, PROTID)

Predict the Z score related to each residue in the sequence specified above

zscorepred = ZScorePred(STRATEGY, MODELTYPE) predictedzscores = zscorepred.getz_score(representation) ````

MSA setting (optional)

In order to enable the esm-msa based variant of ADOPT, MSAs for each sequence are also required. We provide a stand alone, docker based tool you must use to exploit all the functionalities of ADOPT for msa related tasks.

First time setup

As a prerequisite, you must have Docker installed.

Clone the ADOPT repository, go to the ADOPT directory and run the MSA scripts you are interested in.

Notes

The $LOCAL_MSA_DIR in the MSA scripts serves as the main directory for the MSA related procedures and can be empty initially when running the above scripts. Under the hood, each MSA script will:

  1. Download uniclust dataset (in this case "2020.06") into the $LOCAL_MSA_DIR/databases subdirectory. \ !NOTE: under the hood, ADOPT checks, whether uniclust is already in this subdirectory. If not, downloading can take several hours, given the size of this dataset is approx 180GB! Download step is skipped only if the $LOCAL_MSA_DIR/databases folder is non empty and the tar file (UniRef30202006hhsuite.tar.gz) is found in the ```$LOCALMSA_DIR``` folder.

  2. Once the relevant uniclust is there, a docker image named msa-gen-adopt is run with the volume $LOCAL_MSA_DIR mounted on it.

Note that this setup procedure creates four subfolders:

+-- $LOCAL_MSA_DIR
|   +-- databases
|   +-- msas
|   +-- msa_fastas
|   +-- esm_msa_reprs

databases will hold the uniclust; msas is where MSAs (.a3m files) will be saved later, see STEP 2 below; msa_fastas is where .fasta files already used for MSA queries will be saved; esm_msa_reprs is allocated for potential esm-msa representations;

The MSAs will be placed in the $LOCAL_MSA_DIR/msas folder.

More notes

You can set the ESM_MODELS_DIR and ADOPT_MODELS_DIR respectively to paths where the ESM and ADOPT pretrained models are stored. All models will be downloaded from public repositories if not found locally.

Scripts

The scripts directory contains:

  • inference script to predict, in bulk, the disorder of each residue in each protein sequence reported in a FASTA file, with ADOPT where you need to specify:
    • NEW_PROT_FASTA_FILE_PATH defining your FASTA file path
    • NEW_PROT_RES_REPR_DIR_PATH defining where the residue level representations will be extracted
  • training script to train the ADOPT where you need to specify:
    • TRAIN_STRATEGY defining the training strategy you want to use
  • MSA inference script, which allows to perform inference also with the esm-msa model. The predicted Z scores will be written on the host (optional)
  • MSA training script, which allows to perform training also with the esm-msa model. The trained models will be written in the ADOPT/models directory (optional)

Notebooks

The notebooks directory contains:

Compute residue level representations

In order to predict the Z score related to each residue in a protein sequence, we have to compute the residue level representations, extracted from the pretrained model.

In the ADOPT directory run:

bash python adopt/embedding.py <fasta_file_path> \ <residue_level_representation_dir>

Where:

  • <fasta_file_path> defines the FASTA file containing the proteins for which you want to compute the intrinsic disorder
  • <residue_level_representation_dir> defines the path where you want to save the residue level representations
  • --msa runs the MSA procedure to get esm-msa representations. We suggest you take a look to the MSA inference script as a quick example (optional)
  • -h shows help message and exit

A subdirectory containing the residue level representation extracted from each pre-trained model available will be created under both the residue_level_representation_dir.

Important to note that in order to obtain the representations from the esm-msa model as well, the relevant MSAs have to be placed in the root directory /msas in the system, where ADOPT is running. The MSAs can be created as described in the MSA setting above.

Predict intrinsic disorder with ADOPT

Once we have extracted the residue level representations we can predict the intrinsic disorder (Z score).

In the ADOPT directory run:

bash python adopt/inference.py <inference_fasta_file> \ <inference_repr_dir> \ <predicted_z_scores_file> \ --train_strategy <training_strategy> \ --model_type <model_type>

Where:

  • <inference_fasta_file> defines the FASTA file containing the proteins for which you want to compute the intrinsic disorder
  • <inference_repr_dir> defines the path where you've already saved the residue level representations
  • <predicted_z_scores_file> defines the path where you want the Z scores to be saved
  • --train_strategy defines the training strategies defined below
  • --model_type defines the pre-trained model we want to use. We suggest you use the esm-1b model
  • -h shows help message and exit

The output is a .json file contains the Z scores related to each residue of each protein in the FASTA file where you put the proteins you are intereseted in.

| Training strategy | Pre-trained models | |-------------------|-------------------| | train_on_cleared_1325_test_on_117_residue_split | esm-1b, esm-1v, esm-msa and combined | | train_on_1325_cv_residue_split| esm-1b, esm-1v and esm-msa | | train_on_cleared_1325_cv_residue_split| esm-1b, esm-1v and esm-msa | | train_on_cleared_1325_cv_sequence_split| esm-1b, esm-1v and esm-msa | | train_on_total| esm-1b, esm-1v|

Train ADOPT disorder predictor

Once we have extracted the residue level representations of the protein for which we want to predict the intrinsic disorder (Z score), we can train the predictor.

NOTE: This step is not mandatory because we've already trained such models. You can find them in the models bucket.

In the ADOPT directory run:

bash python adopt/training.py <train_json_file_path> \ <test_json_file_path> \ <train_residue_level_representation_dir> \ <test_residue_level_representation_dir> \ --train_strategy <training_strategy>

Where:

  • <train_json_file_path> defines the JSON containing the proteins we want to use as training set
  • <test_json_file_path> defines the JSON containing the proteins we want to use as test set
  • <train_residue_level_representation_dir> defines the path where we saved the residue level representations of the proteins in the training set
  • <test_residue_level_representation_dir> defines the path where we saved the residue level representations of the proteins in the test set
  • --train_strategy defines the training strategies defined above
  • --msa runs the MSA procedure to get trained models fed with the esm-msa representations. We suggest you take a look to the MSA training script as a quick example (optional)
  • -h shows help message and exit

Run benchmarks

Once we have extracted the residue level representations we can benchmark ADOPT against other methods.

In the ADOPT directory run:

bash python adopt/benchmarks.py <benchmark_data_path> \ <train_json_file_path> \ <test_json_file_path> \ <train_residue_level_representation_dir> \ <test_residue_level_representation_dir> \ --train_strategy <training_strategy>

Where:

  • <benchmark_data_path> defines the directory containing the predictions of the method we want to benchmark againbst ADOPT
  • -h shows help message and exit

AlphaFold2 benchmarks (optional)

We benchmarked ADOPT against AlphaFold2 computing the spearman correlations between actual Z-scores and predicted pLDDT5 scores along with actual Z-scores and predicted SASA5 scores, obtained by AlphaFold2, collected for the task linked to the model evaluated on the CheZoD “117” validation set and described in the ADOPT paper.

As a prerequisite, you must have Docker installed.

Run:

bash docker run ghcr.io/peptoneinc/adopt_alphafold2_comparison:1.0.2

Here is the script used to extract the correlations and here are the predictions obtained from Alphafold2.

Citations

If you use this work in your research, please cite the the relevant paper:

bibtex @article{10.1093/nargab/lqad041, author = {Redl, Istvan and Fisicaro, Carlo and Dutton, Oliver and Hoffmann, Falk and Henderson, Louie and Owens, Benjamin M J and Heberling, Matthew and Paci, Emanuele and Tamiola, Kamil}, title = "{ADOPT: intrinsic protein disorder prediction through deep bidirectional transformers}", journal = {NAR Genomics and Bioinformatics}, volume = {5}, number = {2}, year = {2023}, month = {05}, abstract = "{Intrinsically disordered proteins (IDPs) are important for a broad range of biological functions and are involved in many diseases. An understanding of intrinsic disorder is key to develop compounds that target IDPs. Experimental characterization of IDPs is hindered by the very fact that they are highly dynamic. Computational methods that predict disorder from the amino acid sequence have been proposed. Here, we present ADOPT (Attention DisOrder PredicTor), a new predictor of protein disorder. ADOPT is composed of a self-supervised encoder and a supervised disorder predictor. The former is based on a deep bidirectional transformer, which extracts dense residue-level representations from Facebook’s Evolutionary Scale Modeling library. The latter uses a database of nuclear magnetic resonance chemical shifts, constructed to ensure balanced amounts of disordered and ordered residues, as a training and a test dataset for protein disorder. ADOPT predicts whether a protein or a specific region is disordered with better performance than the best existing predictors and faster than most other proposed methods (a few seconds per sequence). We identify the features that are relevant for the prediction performance and show that good performance can already be gained with \\&lt;100 features. ADOPT is available as a stand-alone package at https://github.com/PeptoneLtd/ADOPT and as a web server at https://adopt.peptone.io/.}", issn = {2631-9268}, doi = {10.1093/nargab/lqad041}, url = {https://doi.org/10.1093/nargab/lqad041}, note = {lqad041}, eprint = {https://academic.oup.com/nargab/article-pdf/5/2/lqad041/50150244/lqad041.pdf}, }

Licence

This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.

Update

A new version of ADOPT has been trained as described in Optimizing protein language models with Sentence Transformers, NeurIPS (2023). Code is available at https://github.com/PeptoneLtd/contrastive-finetuning-plms

Owner

  • Name: Peptone
  • Login: PeptoneLtd
  • Kind: organization
  • Email: hello@peptone.io
  • Location: London, United Kingdom

World's first, end-to-end Protein Engineering Operating System (PeOS).

Citation (CITATION.cff)

cff-version: 0.5.0
message: "If you use this software, please cite it as below."
authors:
  - given-names: "Carlo Fisicaro"
    family-names: "Fisicaro"
    email: "carlo@peptone.io"
    affiliation: "Peptone Ltd."
    orcid: "0000-0002-2029-7230"
title: "Attention based DisOrder PredicTor"
version: 0.5.0
doi: 
date-released: 
url: "https://github.com/PeptoneLtd/ADOPT"

GitHub Events

Total
  • Issues event: 2
  • Watch event: 2
  • Push event: 1
Last Year
  • Issues event: 2
  • Watch event: 2
  • Push event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 29
  • Total pull requests: 12
  • Average time to close issues: 22 days
  • Average time to close pull requests: 7 days
  • Total issue authors: 6
  • Total pull request authors: 3
  • Average comments per issue: 0.34
  • Average comments per pull request: 0.67
  • Merged pull requests: 11
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • CFisicaro (21)
  • oliverdutton (3)
  • ktamiola (1)
  • klaasdewaele (1)
  • agarrubio (1)
  • BharatRaviIyengar (1)
Pull Request Authors
  • CFisicaro (10)
  • fabio-peptone (1)
  • falkph (1)
Top Labels
Issue Labels
feature (14) enhancement (5) documentation (3) bug (2)
Pull Request Labels
feature (3) bug (1) enhancement (1)

Dependencies

requirements.txt pypi
  • bertviz *
  • biopython *
  • fair-esm *
  • kaleido *
  • onnxruntime *
  • plotly *
  • scikit-learn *
  • skl2onnx *
  • spacy *
.github/workflows/linter.yml actions
  • actions/checkout v2 composite
  • github/super-linter/slim v4 composite
setup.py pypi