https://github.com/aspuru-guzik-group/dionysus

For analysis of calibration, performance, and generalizability of probabilistic models on small molecular datasets. Paper on RSC Digital Discovery: https://pubs.rsc.org/en/content/articlehtml/2023/dd/d2dd00146b

Science Score: 20.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: rsc.org
✓
Committers with academic emails
1 of 1 committers (100.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.0%) to scientific vocabulary

Keywords

biology calibration cheminformatics chemistry probabilistic-models

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: aspuru-guzik-group
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 189 MB

Statistics

Stars: 20
Watchers: 8
Forks: 2
Open Issues: 0
Releases: 0

Topics

biology calibration cheminformatics chemistry probabilistic-models

Created almost 4 years ago · Last pushed about 3 years ago

Metadata Files

Readme License

README.md

DIONYSUS: Calibration and generalizability of probabilistic models on low-data chemical datasets

This package is accompanied by this paper: Calibration and generalizability of probabilistic models on low-data chemical datasets with DIONYSUS.

Authors: Gary Tom, Riley J. Hickman, Aniket Zinzuwadia, Afshan Mohajeri, Benjamin Sanchez-Lengeling, Alan Aspuru-Guzik. (2022)

Preliminary

DIONYSUS requires the following: - Python >= 3.8 - rdkit >= 2021.03.3 - tensorflow == 2.9.0 - tensorflow-probability == 0.17.0 - tf-models-official == 2.7.1 - graph_nets == 1.1.0 - sonnet == 2.0.0 - ngboost == 0.3.12 - gpflow == 2.5.2 - scikit-learn == 1.0.2 - scikit-multilearn == 0.2.0 - mordred == 1.2.0 - dask == 2022.6.1 - umap-learn == 0.5.1 - cairosvg == 2.5.2 - pandas == 1.4.1 - seaborn == 0.11.2 - hdbscan == 0.8.27

Datasets

Datasets used are shown in data/ directory. For new datasets, simply add it to data/raw/, and the corresponding information in data/dataset_registry_raw.csv. The following information is required:

name: name of dataset
shortname: name that will be referenced in the scripts
filename: the csv file name ({shortname}.csv)
task: either regression/binary (classification)
smiles_column: name of column containing the smiles representations
target_columns: name of column containing the target of interest

Usage

Preprocessing

Scripts to preprocess the datasets are in scripts/ directory. Please run these prior to any of the below experiments.

```bash cd scripts/

prepare molecules and analysis directory in data/{shortname}

canonicalize all smiles

removes all invalid and duplicate smiles

removes all smiles with salts, ions or fragments

python makeqsarready.py --dataset={shortname}

create splits and features

create all features used (mfp, mordred, graphtuple, graphembed)

create the train/val/test splits used

python make_features.py --dataset={shortname} ```

Experiment 1: Supervised learning

Scripts to run experiments are contained in scripts/ directory.

```bash cd scripts/

make all predictions for specified feature/model

python make_predictions.py --dataset={shortname} --model={modelname} --feature={featurename}

create evaluations/figures for all available prediction data

python make_evaluations.py --dataset={shortname} ```

Experiment 2: Bayesian optimization

Scripts are found in bayes_opt/ directory.

All results will be contained in data/{shortname}/bayesian_optimization.

```bash cd bayes_opt/

run the bayesian optimization campaign

python makebo.py --dataset={shortname} --numrestarts=30 --budget=250 --goal=minimize --fracinitdesign=0.05

create the traces and evaluation files

also outputs files with the fraction of hits calculated

python makebotraces.py --dataset={shortname} ```

Experiment 3: Cluster splits and generalizability

Similar to the first experiment, the scripts are found in scripts/ directory. Once datasets are cleaned and features are made, you can make the cluster splits, and run for specified feature/model. In the manuscript, we only do this for mordred/GPs.

All results will be contained in data/{shortname}/generalization.

```bash cd scripts/

create the cluster splits

python makegeneralizationsplits.py --dataset={shortname}

create evaluations for all available prediction data

python makegeneralizationpredictions.py --dataset={shortname}

evaluate the predictions on each cluster split, and generate plots

python makegeneralizationevaluations.py --dataset={shortname} ```

Proposed structure of repository (TODO)

Structure of repository - mol_data: dataset preprocessor and loader. And be extended to other datasets and new features. - dionysus: model-agnostic evaluation script and library. Just requires predictions.

- `dionysus_addons`: models used in the paper, here for reproducibility.

...

Owner

Name: Aspuru-Guzik group repo
Login: aspuru-guzik-group
Kind: organization

Website: http://aspuru.chem.harvard.edu/
Repositories: 30
Profile: https://github.com/aspuru-guzik-group

GitHub Events

Total

Watch event: 4

Last Year

Watch event: 4

Committers

Last synced: 11 months ago

All Time

Total Commits: 15
Total Committers: 1
Avg Commits per committer: 15.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Gary Tom	g**m@m**a	15

Committer Domains (Top 20 + Academic)

mail.mcgill.ca: 1

https://github.com/aspuru-guzik-group/dionysus

Science Score: 20.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

DIONYSUS: Calibration and generalizability of probabilistic models on low-data chemical datasets

Preliminary

Datasets

Usage

Preprocessing

prepare molecules and analysis directory in data/{shortname}

canonicalize all smiles

removes all invalid and duplicate smiles

removes all smiles with salts, ions or fragments

create splits and features

create all features used (mfp, mordred, graphtuple, graphembed)

create the train/val/test splits used

Experiment 1: Supervised learning

make all predictions for specified feature/model

create evaluations/figures for all available prediction data

Experiment 2: Bayesian optimization

run the bayesian optimization campaign

create the traces and evaluation files

also outputs files with the fraction of hits calculated

Experiment 3: Cluster splits and generalizability

create the cluster splits

create evaluations for all available prediction data

evaluate the predictions on each cluster split, and generate plots

Proposed structure of repository (TODO)

- dionysus_addons: models used in the paper, here for reproducibility.

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

- `dionysus_addons`: models used in the paper, here for reproducibility.