panspecies-dti

Structure-aware PRotein ligand INTeraction (SPRINT) is a ultrafast deep learning framework for drug-target interaction prediction.

https://github.com/abhinadduri/panspecies-dti

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Structure-aware PRotein ligand INTeraction (SPRINT) is a ultrafast deep learning framework for drug-target interaction prediction.

Basic Info

Host: GitHub
Owner: abhinadduri
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 120 MB

Statistics

Stars: 9
Watchers: 4
Forks: 4
Open Issues: 7
Releases: 0

Created over 2 years ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

SPRINT

Code for the paper Scaling Structure Aware Virtual Screening to Billions of Molecules with SPRINT and the MLSB 2024 paper SPRINT: Ultrafast Drug-Target Interaction Prediction with Structure-Aware Protein Embeddings.

Structure-aware PRotein ligand INTeraction (SPRINT) is a ultrafast deep learning framework for drug-target interaction prediction and binding affinity prediction.

SPRINT can be used in a Google Colab notebook here:

All datasets are located in the data folder.

Overview

The protein and ligand are co-embedded in a shared space, enabling interaction prediction at the speed of a single dot product. Proteins are embedded with SaProt, followed by a Attention-Pooling layer, and small MLP. Ligands are embedded using Morgan Fingerprints and a small MLP. The model is trained in a fully supervised manner to predict the interaction between proteins and ligands.

Install

```

Install from source

git clone https://github.com/abhinadduri/panspecies-dti.git cd panspecies-dti pip install -e .

Or install directly from pip

pip install git+https://github.com/abhinadduri/panspecies-dti.git ``If you want to use DDP for faster training, first follow the above installation instructions. Then manually downgrade lightning to 2.0.8 viapip install lightning==2.0.8`

Download MERGED dataset

Script to download splits and data: cd data/MERGED/huge_data/ bash download.sh cd -

Reproducing the paper

Reproducing the drug-target interaction models in the paper.

DTI Prediction

The code below reproduces the DTI prediction on the DAVIS dataset. ```

Reproducing ConPLex

ultrafast-train --exp-id DAVIS --config configs/conplex_config.yaml

ConPLex-attn

ultrafast-train --exp-id DAVIS --config configs/saprotaggconfig.yaml --prot_proj agg

SPRINT-sm

ultrafast-train --exp-id DAVIS --config configs/saprotaggconfig.yaml

SPRINT

ultrafast-train --exp-id DAVIS --config configs/saprotaggconfig.yaml --model-size large `Other DTI dataset models can be reproduced by adding--taskto the commandline with:biosnap,bindingdb,biosnapprot(Unseen Targets),biosnapmol(Unseen Drugs), ormerged``

Lit-PCBA

```

SPRINT

ultrafast-train --exp-id LitPCBA --config configs/saprotaggconfig.yaml --epochs 15 --ship-model data/MERGED/hugedata/uniprotsexcludedat90.txt

SPRINT-Average

ultrafast-train --exp-id LitPCBA --config configs/saprotaggconfig.yaml --prot-proj avg --epochs 15 --ship-model data/MERGED/hugedata/uniprotsexcludedat90.txt

SPRINT-ProtBert

ultrafast-train --exp-id LitPCBA --config configs/saprotaggconfig.yaml --target-featurizer ProtBertFeaturizer --epochs 15 --ship-model data/MERGED/hugedata/uniprotsexcludedat90.txt `Adding--eval-pcba`` can show the performance on the Lit-PCBA dataset after epoch of training.

TDC Leaderboard

```

SPRINT

ultrafast-train --exp-id TDC --config configs/TDC_config.yaml

SPRINT-ProtBert

ultrafast-train --exp-id TDC --config configs/TDC_config.yaml --target-featurizer ProtBertFeaturizer

SPRINT-ESM2

ultrafast-train --exp-id TDC --config configs/TDC_config.yaml --target-featurizer ESM2Featurizer ```

Download pre-trained model

Links to download pre-trained models used for Lit-PCBA evaluation in Table 2 are in checkpoints/README.md.

Embed proteins and molecules

Embed a library of proteins/molecules, using --data-file: a CSV/TSV file (separator inferred). The --data-file to embed must contain a "SMILES" or "Target Sequence" column for drug or target embedding, respectively.

If using a SaProt trained checkpoint, the "Target Sequence" should be a structure-aware sequence with residue and structure tokens (e.g. "RaTcIqAvKvQqIwQdMfVd"). Structure-aware sequences can be generated following Generate SaProt sequence for a given protein structure. If no structure tokens are detected, a mask token will be used for each resiude's structure token.

```

Get target embeddings with pre-trained model

ultrafast-embed --data-file data/DAVIS/testfoldseek.csv \ --checkpoint checkpoints/saprot.ckpt \ --moltype target \ --output-path results/DAVIStesttargetembeddings.npy

Get drug embeddings with pre-trained model

ultrafast-embed --data-file data/DAVIS/testfoldseek.csv \ --checkpoint checkpoints/saprot.ckpt \ --moltype drug \ --output-path results/DAVIStestdrugembeddings.npy ```

Vector Database

The following section details the usage of a ChromaDB for ultrafast retrieval of DTIs. Note that creation of the database is a computationally costly preprocessing step, but it only needs to be done once for a given library.

Make a vector database of drugs

ultrafast-store --data-file data/DAVIS/test_foldseek.csv \ --embeddings results/DAVIS_test_drug_embeddings.npy \ --moltype drug \ --db_dir ./dbs \ --db_name davis_test_drug_embeddings

Report top-k accuracy by querying targets against the drug database

ultrafast-report --data-file data/DAVIS/test_foldseek.csv \ --embeddings results/DAVIS_test_target_embeddings.npy \ --moltype target \ --db_dir ./dbs \ --db_name biosnap_test_drug_embeddings \ --topk 100

Compute TopK Hits for a given Query

This section details finding the TopK hits without using ChromaDB. This is likely faster if you are only going to query a library a few times or if you can massively parallelize the TopK search.

We can compute the TopK hits for a set of targets against a database of drugs. ultrafast-topk --library-embeddings results/DAVIS_test_drug_embeddings.npy \ --library-type drug --library-data data/DAVIS/test_foldseek.csv \ --query-embeddings results/DAVIS_test_target_embeddings.npy \ --query-data data/DAVIS/test_foldseek.csv \ -K 100 or we can compute the TopK hits for a set of drugs against a database of targets by swapping the library and query arguments and changing the --library-type.

If you are computing the TopK hits for a large database, it is often faster to break it up into smaller chunks and compute the TopK per chunk in a parallel fashion. The chunks can be combined at the end to get the final TopK hits for the entire database: python utils/combine_chunks.py [directory containing the TopK per chunk] -K 100 -O [output file]

Generate SaProt sequence for a given protein structure

foldseek must be installed somewhere in your path.

The protein structure can be in PDB or mmCIF format. The script will generate the sequence of the protein structure and save it to the specified csv file, appending the output to any existing data. python utils/structure_to_saprot.py -I [path to the protein structure] --chain [chain of protein] -O [path to the output file] If the protein was NOT generated by AF2 or another tool that outputs a confidence score, add --no-plddt-mask to the command.

Training from scratch

When training a SPRINT DTI Classification model from scratch, you need train/val/test CSV files with the following columns: SMILES,Target Sequence,Label where SMILES is the drug SMILES string, Target Sequence is the amino acid sequence of the target, and Label is a 0/1 value to indicate non-binding or binding, respectively.

CSV files should be placed in data/custom/

Models utilizing SaProt as the --target-featurizer must have structure-aware sequences in the Target Sequence column and the CSVs should be renamed to *_foldseek.csv where * is train/val/test.

Models can be trained using: ultrafast-train --exp-id custom --task custom --config configs/saprot_agg_config.yaml --model-size large

Owner

Name: Abhinav Adduri
Login: abhinadduri
Kind: user

Repositories: 12
Profile: https://github.com/abhinadduri

GitHub Events

Total

Issues event: 4
Watch event: 8
Delete event: 13
Issue comment event: 6
Push event: 42
Pull request review comment event: 2
Pull request review event: 10
Pull request event: 15
Fork event: 5
Create event: 8

Last Year

Issues event: 4
Watch event: 8
Delete event: 13
Issue comment event: 6
Push event: 42
Pull request review comment event: 2
Pull request review event: 10
Pull request event: 15
Fork event: 5
Create event: 8

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 3
Total pull requests: 5
Average time to close issues: 11 days
Average time to close pull requests: about 2 hours
Total issue authors: 3
Total pull request authors: 1
Average comments per issue: 0.33
Average comments per pull request: 0.0
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 5
Average time to close issues: 11 days
Average time to close pull requests: about 2 hours
Issue authors: 3
Pull request authors: 1
Average comments per issue: 0.33
Average comments per pull request: 0.0
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

cnellington (2)
dkoes (1)
kthorn (1)
vecos1990 (1)

Pull Request Authors

drewnutt (20)
abhinadduri (9)
cnellington (3)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

.github/workflows/unit-test.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite

pyproject.toml pypi

PyTDC ==1.1.1
chromadb ==0.5.5
datamol ==0.12.5
fair-esm ==2.0.0
gdown ==5.2.0
lightning ==2.4.0
ml-pyxis @ git+https://github.com/vicolab/ml-pyxis.git@master
molfeat ==0.10.1
omegaconf ==2.3.0
pandas ==2.1.4
rdkit ==2023.9.5
scikit_learn ==1.2.2
torch ==2.4.1
transformers ==4.43.4
wandb ==0.17.8