https://github.com/broadinstitute/del-ml-refactor
Science Score: 39.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: broadinstitute
- Language: Jupyter Notebook
- Default Branch: main
- Size: 292 MB
Statistics
- Stars: 11
- Watchers: 2
- Forks: 3
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
DEL+ML paradigm for finding actionable discovery a cross DEL and cross ML model assessment
Published paper: Iqbal S, Jiang W, Hansen E, Aristotelous T, Liu S, Reidenbach A, et al. *Evaluation of DNA encoded library and machine learning model combinations for hit discovery. NPJ Drug Discovery (accepted). Feb **2025.
Preprint: Iqbal S, Jiang W, Hansen E, Aristotelous T, Liu S, Reidenbach A, et al. *DEL+ML paradigm for actionable hit discovery a cross DEL and cross ML model assessment**. ChemRxiv. 2024; doi:10.26434/chemrxiv-2024-2xrx4 This content is a preprint and has not been peer-reviewed.
This repository contains pretrained models and scripts used for prediction mentioned in the paper. If you find our work useful in your research or if you use parts of this code, please cite our paper. Contact: sumaiya@broadinstitute.org.
Pre-requisites:
- Linux (Tested on Ubuntu 22.04)
- NVIDIA GPU (Tested on NVIDIA RTX A6000 with cuda version 12.1)
- Python (3.10)
- Tensorflow (2.14)
- chemprop (1.6.1)
- RDkit (2023.9.2)
- Pytorch (2.1.1)
Please refer to installation guide for how to set up the working environment
Raw data:
- HitGen OpenDEL screening results - raw counts and effectsize ("data/HitGen/raw/DELscreeningresultHitGenrawcountseffectsizepart_.csv.zip")
- Column names:
- A: CK1-alpha, A-inh: CK1-alpha+inhibitor, D: CK1-delta, D-inh: CK1-delta+inhibitor, blank: no protein
Step 0: Data preparation
Prepare your data in the format like example/compound.csv. In summary, you can combine any metada of compounds but there must be a column named SMILES. We will use compound.csv as example to demonstrate the usage of other scripts
Step 1: Feature extraction
In our paper, we use Morgan Fingerprint from RDkit with nbits=2048, radius=2, useChirality=True. Simply run
python feature_extractor.py --input_file ./example/compound.csv --save_path ./example/ --experiment compound_feature
Feel free to customized the --save_path and --experiment flag to suit your need. For simplicity, we save the extracted feature to the same folder and call it compound feature
Step 2: Binder/Non-binder Prediction
We released the best two type of models (MLP and GNN) in each DEL librabry. Simply run
Multi-layer perceptron (MLP)
python prediction.py --input_file ./example/compound_feature.h5 --save_path ./example/ --experiment compound_pred_mlp --checkpoint ./data/HitGen/models/CK1a/MLP.keras
The above command uses the MLP models pretrained on HitGen CK1a molecules to predict how likely the molecules in the example/compound.csv are binders.
We do not use gpu by default as we observe it does not provide a clear speed up. We believe the reason is the overhead of moving data from CPU to GPU dominates the speed up of very small model (our case). If you want to run on GPU, add --use_gpu flag:
python prediction.py --input_file ./example/compound_feature.h5 --save_path ./example/ --experiment compound_pred_mlp --checkpoint ./data/HitGen/models/CK1a/MLP.keras --use_gpu
Graph neural network (GNN)
We use the graph neural network (GNN) implemented in chemprop. Following the chemprop instruction, run
chemprop_predict --smiles_columns SMILES --test_path ./example/compound.csv --checkpoint_path data/HitGen/models/CK1a/chemprop.pt --preds_path ./example/compound_prediction_chemprop.csv
The above command uses the GNN models pretrained on HitGen CK1a molecules to predict how likely the molecules in the example/compound.csv are binders.
Output
Here is the output format of the prediction
SMILES,prediction:
CCN1CCCC1Cn1cnc2c3ccc(OC)cc3nc-2c1O,0.678
CCC1(c2ccccc2)CC(=O)C(C2CC(c3ccc(OCc4ccc(C(F)(F)F)cc4)cc3)Cc3ccccc32)=C(O)O1,0.111
CCOC(=O)C(C)(OC(C)=O)c1cc(C)c(/N=C/N(C)C)c(C)c1,0.281
Cc1cc(C)n2s/c(=N\C(=O)C(c3ccc(Cl)cc3)C(C)C)nc2n1,0.112
OC1=C(Cl)/C(=N\Cc2ccccc2)C(O)O1,0.342
It is a .csv file that contain the prediction score from model output of each molecule.
t-SNE visualization
After Step 1, you can run the following script to visualize the high dimension data in 2d space:
python tsne.py --input_file ./example/compound.h5 --save_path ./example/ --experiment compound --perplexity 2
By default, we set n_jobs=-1 (i.e., Using all CPUs in the computer). Note perplexity=2 in the above script is just for this example dataset. The default value used in the script is perplexity=30. You may need to tune this parameter to fit your dataset by running:
python tsne.py --input_file ./example/compound.h5 --save_path ./example/ --experiment compound --perplexity YOUR_VALUE
Regarding the best practice to use t-SNE and more dicussions about the method, we recommend users to read this blog post and this video
Owner
- Name: Broad Institute
- Login: broadinstitute
- Kind: organization
- Location: Cambridge, MA
- Website: http://www.broadinstitute.org/
- Twitter: broadinstitute
- Repositories: 1,083
- Profile: https://github.com/broadinstitute
Broad Institute of MIT and Harvard
GitHub Events
Total
- Watch event: 6
- Push event: 12
- Fork event: 3
Last Year
- Watch event: 6
- Push event: 12
- Fork event: 3
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0