discotope-3.0

Code for the DiscoTope-3.0 paper and model

https://github.com/magnushhoie/discotope-3.0

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README
  • Academic publication links
    Links to: frontiersin.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.5%) to scientific vocabulary

Keywords

b-cell-receptor epitopes epitopes-prediction
Last synced: 6 months ago · JSON representation ·

Repository

Code for the DiscoTope-3.0 paper and model

Basic Info
Statistics
  • Stars: 5
  • Watchers: 3
  • Forks: 3
  • Open Issues: 0
  • Releases: 0
Topics
b-cell-receptor epitopes epitopes-prediction
Created about 3 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Overview

DiscoTope-3.0 predicts epitopes on input protein structures, using inverse folding representations from the ESM-IF1 model. The tool accepts both solved and predicted structures in the PDB format, and outputs per-residue epitope propensity scores in a CSV format.

Webserver

To try DiscoTope-3.0 without installing it, please see our DTU Healthtech webserver

Repo contents

  • data: Example input files, including test set
  • discotope3: Source code
  • output: DiscoTope-3.0 output examples

Quickstart guide

```bash

Setup environment and install

conda create --name inverse python=3.9 -y conda activate inverse conda install -c pyg pyg -y conda install -c conda-forge pip -y

git clone https://github.com/Magnushhoie/discotope3web/ cd discotope3web/ pip install .

Unzip models to use

unzip models.zip

1. Predict single PDB (solved structure)

python discotope3/main.py --pdborzipfile data/examplepdbs_solved/7c4s.pdb

CPU only:

python discotope3/main.py --cpuonly --pdborzipfile data/examplepdbssolved/7c4s.pdb ```

Installation guide

We highly recommend using an Ubuntu OS and Conda (miniconda or anaconda) for installing required dependencies.

Predictions are faster using a GPU and the recommended versions of pytorch, pytorch-geometric and cudatoolkit, but these exact versions are not required.

For Linux & GPU with conda (recommended, ~2 mins)

```bash

Setup environment with conda

conda create -n inverse python=3.9 conda activate inverse conda install pytorch=1.11 cudatoolkit=11.3 -c pytorch conda install pyg -c pyg -c conda-forge conda install pip

install pip dependencies

pip install . ```

Linux & GPU with pip (~5 mins)

```bash

install pip dependencies

pip install -r requirements_recommended.txt pip install . ```

Recommended system requirements

Running DiscoTope-3.0

DiscoTope-3.0 can predict a single PDB, a folder or ZIP file of PDBs, or fetch PDBs using their IDs from RCSB or AlphafoldDB to predict them.

On a common workstation with a GPU, predictions takes <1 second per PDB chain with ~ 15 seconds for loading needed libraries and model weight.

Set the --struc_type parameter to 'solved' for experimentally solved structures (default) or 'alphafold' for modelled structures.

Note that DiscoTope-3.0 splits PDB structures into single chains before prediction, unless --multichainmode is set.

```bash

Unzip models

unzip models.zip

Now select one of multiple options:

1. Predict single PDB (solved)

python discotope3/main.py --pdborzipfile data/examplepdbs_solved/7c4s.pdb

2. Predict AlphaFold structure

python discotope3/main.py --pdborzipfile data/examplepdbsalphafold/7tdmB.pdb --struc_type alphafold

3. Predict a folder of PDBs

python discotope3/main.py --pdbdir data/examplepdbssolved --outdir output/examplepdbssolved

4. Predict a ZIP file of PDBs

python discotope3/main.py --pdborzipfile pdbsinzipfile.zip --outdir output/pdbsinzipfile

5. Fetch PDBs from RCSB

python discotope3/main.py --listfile pdblistsolved.txt --structype solved --outdir output/pdblist_solved

6. Fetch PDBs from Alphafolddb

python discotope3/main.py --listfile pdblistaf2.txt --structype alphafold --outdir output/pdblist_af2

Predict B-cell epitope propensity on input protein PDB structures

optional arguments: -h, --help show this help message and exit -f PDBORZIPFILE, --pdborzipfile PDBORZIPFILE Input file, either single PDB or compressed zip file with multiple PDBs --listfile LISTFILE File with PDB or Uniprot IDs, fetched from RCSB/AlphaFolddb --structype STRUCTYPE Structure type from file (solved | alphafold) --pdbdir PDBDIR Directory with AF2 PDBs --outdir OUTDIR Job output directory --modelsdir MODELSDIR Path for .json files containing trained XGBoost ensemble --calibratedscoreepithreshold CALIBRATEDSCOREEPITHRESHOLD Calibrated-score threshold for epitopes [low 0.40, moderate (0.90), higher 1.50] --nocalibratednormalization Skip Calibrated-normalization of PDBs --checkexistingembeddings CHECKEXISTINGEMBEDDINGS Check for existing embeddings to load in pdbdir --cpuonly Use CPU even if GPU is available (default uses GPU if available) --maxgpupdblength MAXGPUPDBLENGTH Maximum PDB length to embed on GPU (1000), otherwise CPU --multichainmode Predicts entire complexes, unsupported and untested --saveembeddings SAVEEMBEDDINGS Save embeddings to pdbdir --webserver_mode Flag for printing HTML output -v VERBOSE, --verbose VERBOSE Verbose logging

```

DiscoTope-3.0 output

DiscoTope-3.0 splits input PDBs into single-chain PDB files, then predict per-residue epitope propensity scores. Outputs are saved in both PDB and CSV format.

The CSV output files contains per-residue outputs, with the following column headers: - PDB ID and chain name - Relative residue index (re-numbered from 1) - Amino-acid residue, 1-letter - DiscoTope-3.0 score (0.00 - 1.00) - Predicted epitope (True or False), based on calibratedscoreepi_threshold (default 0.90) - Relative surface accessibility (Shrake-Rupley, normalized using Sander scale) - AlphaFold pLDDT score (0-100, set to 100 for non-AlphaFold structures) - Chain length - A binary feature set to 0 for solved and 1 for AlphaFold structures.

The PDB output files contain individual single chains with the B-factor column replaced with per-residue DiscoTope-3.0 scores (2nd right-most column). Note that the scores are multiplied by 100 as PDB files only allow 2 decimals of precision.

Example input PDB (see 7c4s.pdb): bash python discotope3/main.py --pdb_or_zip_file data/example_pdbs_solved/7c4s.pdb

Example output CSV (see 7c4sAdiscotope3.csv): text pdb,res_id,residue,DiscoTope-3.0_score,rsa,pLDDTs,length,alphafold_struc_flag 7c4s_A,14,G,0.15186,0.80634,100,282,0 7c4s_A,15,Q,0.13953,0.45077,100,282,0 7c4s_A,16,E,0.23955,0.72919,100,282,0

Example output PDB (see 7c4sAdiscotope3.pdb): (Note DiscoTope-3.0 scores in the B-factor column) text ATOM 1 N GLY A 14 -16.773 -32.069 23.105 1.00 15.19 N ATOM 2 CA GLY A 14 -15.595 -32.029 23.955 1.00 15.19 C ATOM 3 C GLY A 14 -14.287 -31.844 23.204 1.00 15.19 C ATOM 4 O GLY A 14 -13.284 -32.465 23.555 1.00 15.19 O

Reproduce test-set predictions (AlphaFold2 structures)

```bash

Unzip AlphaFold2 test set

unzip data/testsetaf2.zip -d data/

Run predictions on PDB folder

python discotope3/main.py \ --pdbdir data/testsetaf2 \ --structype alphafold \ --outdir output/testset_af2 ```

Troubleshooting

  • No valid amino-acid backbone found" - DiscoTope-3.0 only predicts epitopes on amino-acids, not on non-amino acid entities like heteroatoms (e.g. water, solvents like dimethyl sulfoxide). These chains should not be specified as input.
  • PDBConstructionWarning regarding discontinuous chains - Common issue with some PDB files (experimental structures only) missing co-ordinates for some atoms. As long as no backbone co-ordinates (C, Ca, N) are missing, it does not impact predictions.

Installation gcc or g++ errors, missing torch-scatter build ...

```bash

Make sure gcc and g++ versions are updated, pybind11 is available

torch-scatter should be listed with 'conda list' or 'pip list'

With conda:

conda install -c conda-forge pybind11 gcc cxx-compiler

With apt-get

sudo apt-get install gcc g++ pip install pybind11 ```

Citing this work

The code and data in this package is based on the following paper DiscoTope-3.0. If you use it, please cite:

tex @ARTICLE{discotope3, AUTHOR={Høie, Magnus Haraldson and Gade, Frederik Steensgaard and Johansen, Julie Maria and Würtzen, Charlotte and Winther, Ole and Nielsen, Morten and Marcatili, Paolo }, TITLE={DiscoTope-3.0: improved B-cell epitope prediction using inverse folding latent representations}, JOURNAL={Frontiers in Immunology}, VOLUME={15}, YEAR={2024}, URL={https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2024.1322712}, DOI={10.3389/fimmu.2024.1322712}, ISSN={1664-3224}, }

License

This source code is licensed under the Creative Commons license found in the LICENSE file in the root directory of this source tree.

Owner

  • Name: Magnus Haraldson Høie
  • Login: Magnushhoie
  • Kind: user
  • Location: Copenhagen, Denmark
  • Company: Technical University of Denmark

PhD candidate, Immunoinformatics and Machine Learning at Technical University of Denmark

Citation (citation.bib)

@ARTICLE{discotope3,
AUTHOR={Høie, Magnus Haraldson  and Gade, Frederik Steensgaard  and Johansen, Julie Maria  and Würtzen, Charlotte  and Winther, Ole  and Nielsen, Morten  and Marcatili, Paolo },
TITLE={DiscoTope-3.0: improved B-cell epitope prediction using inverse folding latent representations},
JOURNAL={Frontiers in Immunology},
VOLUME={15},
YEAR={2024},
URL={https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2024.1322712},
DOI={10.3389/fimmu.2024.1322712},
ISSN={1664-3224},
ABSTRACT={<p>Accurate computational identification of B-cell epitopes is crucial for the development of vaccines, therapies, and diagnostic tools. However, current structure-based prediction methods face limitations due to the dependency on experimentally solved structures. Here, we introduce DiscoTope-3.0, a markedly improved B-cell epitope prediction tool that innovatively employs inverse folding structure representations and a positive-unlabelled learning strategy, and is adapted for both solved and predicted structures. Our tool demonstrates a considerable improvement in performance over existing methods, accurately predicting linear and conformational epitopes across multiple independent datasets. Most notably, DiscoTope-3.0 maintains high predictive performance across solved, relaxed and predicted structures, alleviating the need for experimental structures and extending the general applicability of accurate B-cell epitope prediction by 3 orders of magnitude. DiscoTope-3.0 is made widely accessible on two web servers, processing over 100 structures per submission, and as a downloadable package. In addition, the servers interface with RCSB and AlphaFoldDB, facilitating large-scale prediction across over 200 million cataloged proteins. DiscoTope-3.0 is available at: <ext-link ext-link-type="uri" xlink:href="https://services.healthtech.dtu.dk/service.php?DiscoTope-3.0" xmlns:xlink="http://www.w3.org/1999/xlink">https://services.healthtech.dtu.dk/service.php?DiscoTope-3.0</ext-link>.</p>}
}

GitHub Events

Total
  • Watch event: 3
Last Year
  • Watch event: 3