https://github.com/broadinstitute/del-ml-refactor

https://github.com/broadinstitute/del-ml-refactor

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.0%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: broadinstitute
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 292 MB
Statistics
  • Stars: 11
  • Watchers: 2
  • Forks: 3
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed 11 months ago
Metadata Files
Readme

README.md

DEL+ML paradigm for finding actionable discovery a cross DEL and cross ML model assessment

Published paper: Iqbal S, Jiang W, Hansen E, Aristotelous T, Liu S, Reidenbach A, et al. *Evaluation of DNA encoded library and machine learning model combinations for hit discovery. NPJ Drug Discovery (accepted). Feb **2025.

Preprint: Iqbal S, Jiang W, Hansen E, Aristotelous T, Liu S, Reidenbach A, et al. *DEL+ML paradigm for actionable hit discovery a cross DEL and cross ML model assessment**. ChemRxiv. 2024; doi:10.26434/chemrxiv-2024-2xrx4 This content is a preprint and has not been peer-reviewed.

This repository contains pretrained models and scripts used for prediction mentioned in the paper. If you find our work useful in your research or if you use parts of this code, please cite our paper. Contact: sumaiya@broadinstitute.org.

Pre-requisites:

  • Linux (Tested on Ubuntu 22.04)
  • NVIDIA GPU (Tested on NVIDIA RTX A6000 with cuda version 12.1)
  • Python (3.10)
  • Tensorflow (2.14)
  • chemprop (1.6.1)
  • RDkit (2023.9.2)
  • Pytorch (2.1.1)

Please refer to installation guide for how to set up the working environment

Raw data:

  • HitGen OpenDEL screening results - raw counts and effectsize ("data/HitGen/raw/DELscreeningresultHitGenrawcountseffectsizepart_.csv.zip")
  • Column names:
  • A: CK1-alpha, A-inh: CK1-alpha+inhibitor, D: CK1-delta, D-inh: CK1-delta+inhibitor, blank: no protein

Step 0: Data preparation

Prepare your data in the format like example/compound.csv. In summary, you can combine any metada of compounds but there must be a column named SMILES. We will use compound.csv as example to demonstrate the usage of other scripts

Step 1: Feature extraction

In our paper, we use Morgan Fingerprint from RDkit with nbits=2048, radius=2, useChirality=True. Simply run python feature_extractor.py --input_file ./example/compound.csv --save_path ./example/ --experiment compound_feature Feel free to customized the --save_path and --experiment flag to suit your need. For simplicity, we save the extracted feature to the same folder and call it compound feature

Step 2: Binder/Non-binder Prediction

We released the best two type of models (MLP and GNN) in each DEL librabry. Simply run

Multi-layer perceptron (MLP)

python prediction.py --input_file ./example/compound_feature.h5 --save_path ./example/ --experiment compound_pred_mlp --checkpoint ./data/HitGen/models/CK1a/MLP.keras The above command uses the MLP models pretrained on HitGen CK1a molecules to predict how likely the molecules in the example/compound.csv are binders.

We do not use gpu by default as we observe it does not provide a clear speed up. We believe the reason is the overhead of moving data from CPU to GPU dominates the speed up of very small model (our case). If you want to run on GPU, add --use_gpu flag: python prediction.py --input_file ./example/compound_feature.h5 --save_path ./example/ --experiment compound_pred_mlp --checkpoint ./data/HitGen/models/CK1a/MLP.keras --use_gpu

Graph neural network (GNN)

We use the graph neural network (GNN) implemented in chemprop. Following the chemprop instruction, run chemprop_predict --smiles_columns SMILES --test_path ./example/compound.csv --checkpoint_path data/HitGen/models/CK1a/chemprop.pt --preds_path ./example/compound_prediction_chemprop.csv The above command uses the GNN models pretrained on HitGen CK1a molecules to predict how likely the molecules in the example/compound.csv are binders.

Output

Here is the output format of the prediction SMILES,prediction: CCN1CCCC1Cn1cnc2c3ccc(OC)cc3nc-2c1O,0.678 CCC1(c2ccccc2)CC(=O)C(C2CC(c3ccc(OCc4ccc(C(F)(F)F)cc4)cc3)Cc3ccccc32)=C(O)O1,0.111 CCOC(=O)C(C)(OC(C)=O)c1cc(C)c(/N=C/N(C)C)c(C)c1,0.281 Cc1cc(C)n2s/c(=N\C(=O)C(c3ccc(Cl)cc3)C(C)C)nc2n1,0.112 OC1=C(Cl)/C(=N\Cc2ccccc2)C(O)O1,0.342 It is a .csv file that contain the prediction score from model output of each molecule.

t-SNE visualization

After Step 1, you can run the following script to visualize the high dimension data in 2d space: python tsne.py --input_file ./example/compound.h5 --save_path ./example/ --experiment compound --perplexity 2 By default, we set n_jobs=-1 (i.e., Using all CPUs in the computer). Note perplexity=2 in the above script is just for this example dataset. The default value used in the script is perplexity=30. You may need to tune this parameter to fit your dataset by running: python tsne.py --input_file ./example/compound.h5 --save_path ./example/ --experiment compound --perplexity YOUR_VALUE

Regarding the best practice to use t-SNE and more dicussions about the method, we recommend users to read this blog post and this video

Owner

  • Name: Broad Institute
  • Login: broadinstitute
  • Kind: organization
  • Location: Cambridge, MA

Broad Institute of MIT and Harvard

GitHub Events

Total
  • Watch event: 6
  • Push event: 12
  • Fork event: 3
Last Year
  • Watch event: 6
  • Push event: 12
  • Fork event: 3

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels