panspecies-dti
Structure-aware PRotein ligand INTeraction (SPRINT) is a ultrafast deep learning framework for drug-target interaction prediction.
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.3%) to scientific vocabulary
Repository
Structure-aware PRotein ligand INTeraction (SPRINT) is a ultrafast deep learning framework for drug-target interaction prediction.
Basic Info
Statistics
- Stars: 9
- Watchers: 4
- Forks: 4
- Open Issues: 7
- Releases: 0
Metadata Files
README.md
SPRINT
Code for the paper Scaling Structure Aware Virtual Screening to Billions of Molecules with SPRINT and the MLSB 2024 paper SPRINT: Ultrafast Drug-Target Interaction Prediction with Structure-Aware Protein Embeddings.
Structure-aware PRotein ligand INTeraction (SPRINT) is a ultrafast deep learning framework for drug-target interaction prediction and binding affinity prediction.
SPRINT can be used in a Google Colab notebook here:
All datasets are located in the data folder.
Overview
The protein and ligand are co-embedded in a shared space, enabling interaction prediction at the speed of a single dot product. Proteins are embedded with SaProt, followed by a Attention-Pooling layer, and small MLP. Ligands are embedded using Morgan Fingerprints and a small MLP. The model is trained in a fully supervised manner to predict the interaction between proteins and ligands.
Install
```
Install from source
git clone https://github.com/abhinadduri/panspecies-dti.git cd panspecies-dti pip install -e .
Or install directly from pip
pip install git+https://github.com/abhinadduri/panspecies-dti.git
``
If you want to use DDP for faster training, first follow the above installation instructions.
Then manually downgrade lightning to 2.0.8 viapip install lightning==2.0.8`
Download MERGED dataset
Script to download splits and data:
cd data/MERGED/huge_data/
bash download.sh
cd -
Reproducing the paper
Reproducing the drug-target interaction models in the paper.
DTI Prediction
The code below reproduces the DTI prediction on the DAVIS dataset. ```
Reproducing ConPLex
ultrafast-train --exp-id DAVIS --config configs/conplex_config.yaml
ConPLex-attn
ultrafast-train --exp-id DAVIS --config configs/saprotaggconfig.yaml --prot_proj agg
SPRINT-sm
ultrafast-train --exp-id DAVIS --config configs/saprotaggconfig.yaml
SPRINT
ultrafast-train --exp-id DAVIS --config configs/saprotaggconfig.yaml --model-size large
`
Other DTI dataset models can be reproduced by adding--taskto the commandline with:biosnap,bindingdb,biosnapprot(Unseen Targets),biosnapmol(Unseen Drugs), ormerged``
Lit-PCBA
```
SPRINT
ultrafast-train --exp-id LitPCBA --config configs/saprotaggconfig.yaml --epochs 15 --ship-model data/MERGED/hugedata/uniprotsexcludedat90.txt
SPRINT-Average
ultrafast-train --exp-id LitPCBA --config configs/saprotaggconfig.yaml --prot-proj avg --epochs 15 --ship-model data/MERGED/hugedata/uniprotsexcludedat90.txt
SPRINT-ProtBert
ultrafast-train --exp-id LitPCBA --config configs/saprotaggconfig.yaml --target-featurizer ProtBertFeaturizer --epochs 15 --ship-model data/MERGED/hugedata/uniprotsexcludedat90.txt
`
Adding--eval-pcba`` can show the performance on the Lit-PCBA dataset after epoch of training.
TDC Leaderboard
```
SPRINT
ultrafast-train --exp-id TDC --config configs/TDC_config.yaml
SPRINT-ProtBert
ultrafast-train --exp-id TDC --config configs/TDC_config.yaml --target-featurizer ProtBertFeaturizer
SPRINT-ESM2
ultrafast-train --exp-id TDC --config configs/TDC_config.yaml --target-featurizer ESM2Featurizer ```
Download pre-trained model
Links to download pre-trained models used for Lit-PCBA evaluation in Table 2 are in checkpoints/README.md.
Embed proteins and molecules
Embed a library of proteins/molecules, using --data-file: a CSV/TSV file (separator inferred). The --data-file to embed must contain a "SMILES" or "Target Sequence" column for drug or target embedding, respectively.
If using a SaProt trained checkpoint, the "Target Sequence" should be a structure-aware sequence with residue and structure tokens (e.g. "RaTcIqAvKvQqIwQdMfVd"). Structure-aware sequences can be generated following Generate SaProt sequence for a given protein structure. If no structure tokens are detected, a mask token will be used for each resiude's structure token.
```
Get target embeddings with pre-trained model
ultrafast-embed --data-file data/DAVIS/testfoldseek.csv \ --checkpoint checkpoints/saprot.ckpt \ --moltype target \ --output-path results/DAVIStesttargetembeddings.npy
Get drug embeddings with pre-trained model
ultrafast-embed --data-file data/DAVIS/testfoldseek.csv \ --checkpoint checkpoints/saprot.ckpt \ --moltype drug \ --output-path results/DAVIStestdrugembeddings.npy ```
Vector Database
The following section details the usage of a ChromaDB for ultrafast retrieval of DTIs. Note that creation of the database is a computationally costly preprocessing step, but it only needs to be done once for a given library.
Make a vector database of drugs
ultrafast-store --data-file data/DAVIS/test_foldseek.csv \
--embeddings results/DAVIS_test_drug_embeddings.npy \
--moltype drug \
--db_dir ./dbs \
--db_name davis_test_drug_embeddings
Report top-k accuracy by querying targets against the drug database
ultrafast-report --data-file data/DAVIS/test_foldseek.csv \
--embeddings results/DAVIS_test_target_embeddings.npy \
--moltype target \
--db_dir ./dbs \
--db_name biosnap_test_drug_embeddings \
--topk 100
Compute TopK Hits for a given Query
This section details finding the TopK hits without using ChromaDB. This is likely faster if you are only going to query a library a few times or if you can massively parallelize the TopK search.
We can compute the TopK hits for a set of targets against a database of drugs.
ultrafast-topk --library-embeddings results/DAVIS_test_drug_embeddings.npy \
--library-type drug --library-data data/DAVIS/test_foldseek.csv \
--query-embeddings results/DAVIS_test_target_embeddings.npy \
--query-data data/DAVIS/test_foldseek.csv \
-K 100
or we can compute the TopK hits for a set of drugs against a database of targets by swapping the library and query arguments and changing the --library-type.
If you are computing the TopK hits for a large database, it is often faster to break it up into smaller chunks and compute the TopK per chunk in a parallel fashion. The chunks can be combined at the end to get the final TopK hits for the entire database:
python utils/combine_chunks.py [directory containing the TopK per chunk] -K 100 -O [output file]
Generate SaProt sequence for a given protein structure
foldseek must be installed somewhere in your path.
The protein structure can be in PDB or mmCIF format. The script will generate the sequence of the protein structure and save it to the specified csv file, appending the output to any existing data.
python utils/structure_to_saprot.py -I [path to the protein structure] --chain [chain of protein] -O [path to the output file]
If the protein was NOT generated by AF2 or another tool that outputs a confidence score, add --no-plddt-mask to the command.
Training from scratch
When training a SPRINT DTI Classification model from scratch, you need train/val/test CSV files with the following columns:
SMILES,Target Sequence,Label
where SMILES is the drug SMILES string, Target Sequence is the amino acid sequence of the target, and Label is a 0/1 value to indicate non-binding or binding, respectively.
CSV files should be placed in data/custom/
Models utilizing SaProt as the --target-featurizer must have structure-aware sequences in the Target Sequence column and the CSVs should be renamed to *_foldseek.csv where * is train/val/test.
Models can be trained using:
ultrafast-train --exp-id custom --task custom --config configs/saprot_agg_config.yaml --model-size large
Owner
- Name: Abhinav Adduri
- Login: abhinadduri
- Kind: user
- Repositories: 12
- Profile: https://github.com/abhinadduri
GitHub Events
Total
- Issues event: 4
- Watch event: 8
- Delete event: 13
- Issue comment event: 6
- Push event: 42
- Pull request review comment event: 2
- Pull request review event: 10
- Pull request event: 15
- Fork event: 5
- Create event: 8
Last Year
- Issues event: 4
- Watch event: 8
- Delete event: 13
- Issue comment event: 6
- Push event: 42
- Pull request review comment event: 2
- Pull request review event: 10
- Pull request event: 15
- Fork event: 5
- Create event: 8
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 3
- Total pull requests: 5
- Average time to close issues: 11 days
- Average time to close pull requests: about 2 hours
- Total issue authors: 3
- Total pull request authors: 1
- Average comments per issue: 0.33
- Average comments per pull request: 0.0
- Merged pull requests: 4
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 3
- Pull requests: 5
- Average time to close issues: 11 days
- Average time to close pull requests: about 2 hours
- Issue authors: 3
- Pull request authors: 1
- Average comments per issue: 0.33
- Average comments per pull request: 0.0
- Merged pull requests: 4
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- cnellington (2)
- dkoes (1)
- kthorn (1)
- vecos1990 (1)
Pull Request Authors
- drewnutt (20)
- abhinadduri (9)
- cnellington (3)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/checkout v3 composite
- actions/setup-python v3 composite
- PyTDC ==1.1.1
- chromadb ==0.5.5
- datamol ==0.12.5
- fair-esm ==2.0.0
- gdown ==5.2.0
- lightning ==2.4.0
- ml-pyxis @ git+https://github.com/vicolab/ml-pyxis.git@master
- molfeat ==0.10.1
- omegaconf ==2.3.0
- pandas ==2.1.4
- rdkit ==2023.9.5
- scikit_learn ==1.2.2
- torch ==2.4.1
- transformers ==4.43.4
- wandb ==0.17.8