discotope-3.0
Code for the DiscoTope-3.0 paper and model
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 5 DOI reference(s) in README -
✓Academic publication links
Links to: frontiersin.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.5%) to scientific vocabulary
Keywords
Repository
Code for the DiscoTope-3.0 paper and model
Basic Info
- Host: GitHub
- Owner: Magnushhoie
- License: other
- Language: Python
- Default Branch: master
- Homepage: https://services.healthtech.dtu.dk/service.php?DiscoTope-3.0
- Size: 12.1 MB
Statistics
- Stars: 5
- Watchers: 3
- Forks: 3
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Overview
DiscoTope-3.0 predicts epitopes on input protein structures, using inverse folding representations from the ESM-IF1 model. The tool accepts both solved and predicted structures in the PDB format, and outputs per-residue epitope propensity scores in a CSV format.
- Paper: 10.3389/fimmu.2024.1322712
- Datasets: https://services.healthtech.dtu.dk/service.php?DiscoTope-3.0
- Web server (DTU): https://services.healthtech.dtu.dk/service.php?DiscoTope-3.0
- Web server (BioLib): https://biolib.com/DTU/DiscoTope-3/
- Google Colab:
Webserver
To try DiscoTope-3.0 without installing it, please see our DTU Healthtech webserver
Repo contents
- data: Example input files, including test set
- discotope3: Source code
- output: DiscoTope-3.0 output examples
Quickstart guide
```bash
Setup environment and install
conda create --name inverse python=3.9 -y conda activate inverse conda install -c pyg pyg -y conda install -c conda-forge pip -y
git clone https://github.com/Magnushhoie/discotope3web/ cd discotope3web/ pip install .
Unzip models to use
unzip models.zip
1. Predict single PDB (solved structure)
python discotope3/main.py --pdborzipfile data/examplepdbs_solved/7c4s.pdb
CPU only:
python discotope3/main.py --cpuonly --pdborzipfile data/examplepdbssolved/7c4s.pdb ```
Installation guide
We highly recommend using an Ubuntu OS and Conda (miniconda or anaconda) for installing required dependencies.
Predictions are faster using a GPU and the recommended versions of pytorch, pytorch-geometric and cudatoolkit, but these exact versions are not required.
For Linux & GPU with conda (recommended, ~2 mins)
```bash
Setup environment with conda
conda create -n inverse python=3.9 conda activate inverse conda install pytorch=1.11 cudatoolkit=11.3 -c pytorch conda install pyg -c pyg -c conda-forge conda install pip
install pip dependencies
pip install . ```
Linux & GPU with pip (~5 mins)
```bash
install pip dependencies
pip install -r requirements_recommended.txt pip install . ```
Recommended system requirements
- GPU is optional. Recommended 16 GB ram, 2+ cores CPU.
- Linux operating system (e.g. Ubuntu 18.04), but works on MacOS
- Python 3.9
- Pytorch 1.11
- cudatoolkit 11.3
- Pytorch geometric 2.0.4
- Biopython
- Biotite
- pandas
- numpy
- py-xgboost-gpu
Running DiscoTope-3.0
DiscoTope-3.0 can predict a single PDB, a folder or ZIP file of PDBs, or fetch PDBs using their IDs from RCSB or AlphafoldDB to predict them.
On a common workstation with a GPU, predictions takes <1 second per PDB chain with ~ 15 seconds for loading needed libraries and model weight.
Set the --struc_type parameter to 'solved' for experimentally solved structures (default) or 'alphafold' for modelled structures.
Note that DiscoTope-3.0 splits PDB structures into single chains before prediction, unless --multichainmode is set.
```bash
Unzip models
unzip models.zip
Now select one of multiple options:
1. Predict single PDB (solved)
python discotope3/main.py --pdborzipfile data/examplepdbs_solved/7c4s.pdb
2. Predict AlphaFold structure
python discotope3/main.py --pdborzipfile data/examplepdbsalphafold/7tdmB.pdb --struc_type alphafold
3. Predict a folder of PDBs
python discotope3/main.py --pdbdir data/examplepdbssolved --outdir output/examplepdbssolved
4. Predict a ZIP file of PDBs
python discotope3/main.py --pdborzipfile pdbsinzipfile.zip --outdir output/pdbsinzipfile
5. Fetch PDBs from RCSB
python discotope3/main.py --listfile pdblistsolved.txt --structype solved --outdir output/pdblist_solved
6. Fetch PDBs from Alphafolddb
python discotope3/main.py --listfile pdblistaf2.txt --structype alphafold --outdir output/pdblist_af2
Predict B-cell epitope propensity on input protein PDB structures
optional arguments: -h, --help show this help message and exit -f PDBORZIPFILE, --pdborzipfile PDBORZIPFILE Input file, either single PDB or compressed zip file with multiple PDBs --listfile LISTFILE File with PDB or Uniprot IDs, fetched from RCSB/AlphaFolddb --structype STRUCTYPE Structure type from file (solved | alphafold) --pdbdir PDBDIR Directory with AF2 PDBs --outdir OUTDIR Job output directory --modelsdir MODELSDIR Path for .json files containing trained XGBoost ensemble --calibratedscoreepithreshold CALIBRATEDSCOREEPITHRESHOLD Calibrated-score threshold for epitopes [low 0.40, moderate (0.90), higher 1.50] --nocalibratednormalization Skip Calibrated-normalization of PDBs --checkexistingembeddings CHECKEXISTINGEMBEDDINGS Check for existing embeddings to load in pdbdir --cpuonly Use CPU even if GPU is available (default uses GPU if available) --maxgpupdblength MAXGPUPDBLENGTH Maximum PDB length to embed on GPU (1000), otherwise CPU --multichainmode Predicts entire complexes, unsupported and untested --saveembeddings SAVEEMBEDDINGS Save embeddings to pdbdir --webserver_mode Flag for printing HTML output -v VERBOSE, --verbose VERBOSE Verbose logging
```
DiscoTope-3.0 output
DiscoTope-3.0 splits input PDBs into single-chain PDB files, then predict per-residue epitope propensity scores. Outputs are saved in both PDB and CSV format.
The CSV output files contains per-residue outputs, with the following column headers: - PDB ID and chain name - Relative residue index (re-numbered from 1) - Amino-acid residue, 1-letter - DiscoTope-3.0 score (0.00 - 1.00) - Predicted epitope (True or False), based on calibratedscoreepi_threshold (default 0.90) - Relative surface accessibility (Shrake-Rupley, normalized using Sander scale) - AlphaFold pLDDT score (0-100, set to 100 for non-AlphaFold structures) - Chain length - A binary feature set to 0 for solved and 1 for AlphaFold structures.
The PDB output files contain individual single chains with the B-factor column replaced with per-residue DiscoTope-3.0 scores (2nd right-most column). Note that the scores are multiplied by 100 as PDB files only allow 2 decimals of precision.
Example input PDB (see 7c4s.pdb):
bash
python discotope3/main.py --pdb_or_zip_file data/example_pdbs_solved/7c4s.pdb
Example output CSV (see 7c4sAdiscotope3.csv):
text
pdb,res_id,residue,DiscoTope-3.0_score,rsa,pLDDTs,length,alphafold_struc_flag
7c4s_A,14,G,0.15186,0.80634,100,282,0
7c4s_A,15,Q,0.13953,0.45077,100,282,0
7c4s_A,16,E,0.23955,0.72919,100,282,0
Example output PDB (see 7c4sAdiscotope3.pdb):
(Note DiscoTope-3.0 scores in the B-factor column)
text
ATOM 1 N GLY A 14 -16.773 -32.069 23.105 1.00 15.19 N
ATOM 2 CA GLY A 14 -15.595 -32.029 23.955 1.00 15.19 C
ATOM 3 C GLY A 14 -14.287 -31.844 23.204 1.00 15.19 C
ATOM 4 O GLY A 14 -13.284 -32.465 23.555 1.00 15.19 O
Reproduce test-set predictions (AlphaFold2 structures)
```bash
Unzip AlphaFold2 test set
unzip data/testsetaf2.zip -d data/
Run predictions on PDB folder
python discotope3/main.py \ --pdbdir data/testsetaf2 \ --structype alphafold \ --outdir output/testset_af2 ```
Troubleshooting
- No valid amino-acid backbone found" - DiscoTope-3.0 only predicts epitopes on amino-acids, not on non-amino acid entities like heteroatoms (e.g. water, solvents like dimethyl sulfoxide). These chains should not be specified as input.
- PDBConstructionWarning regarding discontinuous chains - Common issue with some PDB files (experimental structures only) missing co-ordinates for some atoms. As long as no backbone co-ordinates (C, Ca, N) are missing, it does not impact predictions.
Installation gcc or g++ errors, missing torch-scatter build ...
```bash
Make sure gcc and g++ versions are updated, pybind11 is available
torch-scatter should be listed with 'conda list' or 'pip list'
With conda:
conda install -c conda-forge pybind11 gcc cxx-compiler
With apt-get
sudo apt-get install gcc g++ pip install pybind11 ```
Citing this work
The code and data in this package is based on the following paper DiscoTope-3.0. If you use it, please cite:
tex
@ARTICLE{discotope3,
AUTHOR={Høie, Magnus Haraldson and Gade, Frederik Steensgaard and Johansen, Julie Maria and Würtzen, Charlotte and Winther, Ole and Nielsen, Morten and Marcatili, Paolo },
TITLE={DiscoTope-3.0: improved B-cell epitope prediction using inverse folding latent representations},
JOURNAL={Frontiers in Immunology},
VOLUME={15},
YEAR={2024},
URL={https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2024.1322712},
DOI={10.3389/fimmu.2024.1322712},
ISSN={1664-3224},
}
License
This source code is licensed under the Creative Commons license found in the LICENSE file in the root directory of this source tree.
Owner
- Name: Magnus Haraldson Høie
- Login: Magnushhoie
- Kind: user
- Location: Copenhagen, Denmark
- Company: Technical University of Denmark
- Website: https://magnushhoie.github.io/
- Twitter: magnushoie
- Repositories: 7
- Profile: https://github.com/Magnushhoie
PhD candidate, Immunoinformatics and Machine Learning at Technical University of Denmark
Citation (citation.bib)
@ARTICLE{discotope3,
AUTHOR={Høie, Magnus Haraldson and Gade, Frederik Steensgaard and Johansen, Julie Maria and Würtzen, Charlotte and Winther, Ole and Nielsen, Morten and Marcatili, Paolo },
TITLE={DiscoTope-3.0: improved B-cell epitope prediction using inverse folding latent representations},
JOURNAL={Frontiers in Immunology},
VOLUME={15},
YEAR={2024},
URL={https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2024.1322712},
DOI={10.3389/fimmu.2024.1322712},
ISSN={1664-3224},
ABSTRACT={<p>Accurate computational identification of B-cell epitopes is crucial for the development of vaccines, therapies, and diagnostic tools. However, current structure-based prediction methods face limitations due to the dependency on experimentally solved structures. Here, we introduce DiscoTope-3.0, a markedly improved B-cell epitope prediction tool that innovatively employs inverse folding structure representations and a positive-unlabelled learning strategy, and is adapted for both solved and predicted structures. Our tool demonstrates a considerable improvement in performance over existing methods, accurately predicting linear and conformational epitopes across multiple independent datasets. Most notably, DiscoTope-3.0 maintains high predictive performance across solved, relaxed and predicted structures, alleviating the need for experimental structures and extending the general applicability of accurate B-cell epitope prediction by 3 orders of magnitude. DiscoTope-3.0 is made widely accessible on two web servers, processing over 100 structures per submission, and as a downloadable package. In addition, the servers interface with RCSB and AlphaFoldDB, facilitating large-scale prediction across over 200 million cataloged proteins. DiscoTope-3.0 is available at: <ext-link ext-link-type="uri" xlink:href="https://services.healthtech.dtu.dk/service.php?DiscoTope-3.0" xmlns:xlink="http://www.w3.org/1999/xlink">https://services.healthtech.dtu.dk/service.php?DiscoTope-3.0</ext-link>.</p>}
}
GitHub Events
Total
- Watch event: 3
Last Year
- Watch event: 3