https://github.com/aspuru-guzik-group/dionysus
For analysis of calibration, performance, and generalizability of probabilistic models on small molecular datasets. Paper on RSC Digital Discovery: https://pubs.rsc.org/en/content/articlehtml/2023/dd/d2dd00146b
Science Score: 20.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: rsc.org -
✓Committers with academic emails
1 of 1 committers (100.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.0%) to scientific vocabulary
Keywords
Repository
For analysis of calibration, performance, and generalizability of probabilistic models on small molecular datasets. Paper on RSC Digital Discovery: https://pubs.rsc.org/en/content/articlehtml/2023/dd/d2dd00146b
Basic Info
Statistics
- Stars: 20
- Watchers: 8
- Forks: 2
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
DIONYSUS: Calibration and generalizability of probabilistic models on low-data chemical datasets
This package is accompanied by this paper: Calibration and generalizability of probabilistic models on low-data chemical datasets with DIONYSUS.
Authors: Gary Tom, Riley J. Hickman, Aniket Zinzuwadia, Afshan Mohajeri, Benjamin Sanchez-Lengeling, Alan Aspuru-Guzik. (2022)
Preliminary
DIONYSUS requires the following: - Python >= 3.8 - rdkit >= 2021.03.3 - tensorflow == 2.9.0 - tensorflow-probability == 0.17.0 - tf-models-official == 2.7.1 - graph_nets == 1.1.0 - sonnet == 2.0.0 - ngboost == 0.3.12 - gpflow == 2.5.2 - scikit-learn == 1.0.2 - scikit-multilearn == 0.2.0 - mordred == 1.2.0 - dask == 2022.6.1 - umap-learn == 0.5.1 - cairosvg == 2.5.2 - pandas == 1.4.1 - seaborn == 0.11.2 - hdbscan == 0.8.27
Datasets
Datasets used are shown in data/ directory. For new datasets, simply add it to data/raw/, and the corresponding information in data/dataset_registry_raw.csv. The following information is required:
name: name of datasetshortname: name that will be referenced in the scriptsfilename: the csv file name ({shortname}.csv)task: either regression/binary (classification)smiles_column: name of column containing the smiles representationstarget_columns: name of column containing the target of interest
Usage
Preprocessing
Scripts to preprocess the datasets are in scripts/ directory. Please run these prior to any of the below experiments.
```bash cd scripts/
prepare molecules and analysis directory in data/{shortname}
canonicalize all smiles
removes all invalid and duplicate smiles
removes all smiles with salts, ions or fragments
python makeqsarready.py --dataset={shortname}
create splits and features
create all features used (mfp, mordred, graphtuple, graphembed)
create the train/val/test splits used
python make_features.py --dataset={shortname} ```
Experiment 1: Supervised learning
Scripts to run experiments are contained in scripts/ directory.
```bash cd scripts/
make all predictions for specified feature/model
python make_predictions.py --dataset={shortname} --model={modelname} --feature={featurename}
create evaluations/figures for all available prediction data
python make_evaluations.py --dataset={shortname} ```
Experiment 2: Bayesian optimization
Scripts are found in bayes_opt/ directory.
All results will be contained in data/{shortname}/bayesian_optimization.
```bash cd bayes_opt/
run the bayesian optimization campaign
python makebo.py --dataset={shortname} --numrestarts=30 --budget=250 --goal=minimize --fracinitdesign=0.05
create the traces and evaluation files
also outputs files with the fraction of hits calculated
python makebotraces.py --dataset={shortname} ```
Experiment 3: Cluster splits and generalizability
Similar to the first experiment, the scripts are found in scripts/ directory. Once datasets are cleaned and features are made, you can make the cluster splits, and run for specified feature/model. In the manuscript, we only do this for mordred/GPs.
All results will be contained in data/{shortname}/generalization.
```bash cd scripts/
create the cluster splits
python makegeneralizationsplits.py --dataset={shortname}
create evaluations for all available prediction data
python makegeneralizationpredictions.py --dataset={shortname}
evaluate the predictions on each cluster split, and generate plots
python makegeneralizationevaluations.py --dataset={shortname} ```
Proposed structure of repository (TODO)
Structure of repository
- mol_data: dataset preprocessor and loader. And be extended to other datasets and new features.
- dionysus: model-agnostic evaluation script and library. Just requires predictions.
- dionysus_addons: models used in the paper, here for reproducibility.
...
Owner
- Name: Aspuru-Guzik group repo
- Login: aspuru-guzik-group
- Kind: organization
- Website: http://aspuru.chem.harvard.edu/
- Repositories: 30
- Profile: https://github.com/aspuru-guzik-group
GitHub Events
Total
- Watch event: 4
Last Year
- Watch event: 4
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Gary Tom | g****m@m****a | 15 |