https://github.com/aspuru-guzik-group/dionysus

For analysis of calibration, performance, and generalizability of probabilistic models on small molecular datasets. Paper on RSC Digital Discovery: https://pubs.rsc.org/en/content/articlehtml/2023/dd/d2dd00146b

https://github.com/aspuru-guzik-group/dionysus

Science Score: 20.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: rsc.org
  • Committers with academic emails
    1 of 1 committers (100.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.0%) to scientific vocabulary

Keywords

biology calibration cheminformatics chemistry probabilistic-models
Last synced: 6 months ago · JSON representation

Repository

For analysis of calibration, performance, and generalizability of probabilistic models on small molecular datasets. Paper on RSC Digital Discovery: https://pubs.rsc.org/en/content/articlehtml/2023/dd/d2dd00146b

Basic Info
  • Host: GitHub
  • Owner: aspuru-guzik-group
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 189 MB
Statistics
  • Stars: 20
  • Watchers: 8
  • Forks: 2
  • Open Issues: 0
  • Releases: 0
Topics
biology calibration cheminformatics chemistry probabilistic-models
Created over 3 years ago · Last pushed almost 3 years ago
Metadata Files
Readme License

README.md

DIONYSUS: Calibration and generalizability of probabilistic models on low-data chemical datasets

This package is accompanied by this paper: Calibration and generalizability of probabilistic models on low-data chemical datasets with DIONYSUS.

Authors: Gary Tom, Riley J. Hickman, Aniket Zinzuwadia, Afshan Mohajeri, Benjamin Sanchez-Lengeling, Alan Aspuru-Guzik. (2022)

Preliminary

DIONYSUS requires the following: - Python >= 3.8 - rdkit >= 2021.03.3 - tensorflow == 2.9.0 - tensorflow-probability == 0.17.0 - tf-models-official == 2.7.1 - graph_nets == 1.1.0 - sonnet == 2.0.0 - ngboost == 0.3.12 - gpflow == 2.5.2 - scikit-learn == 1.0.2 - scikit-multilearn == 0.2.0 - mordred == 1.2.0 - dask == 2022.6.1 - umap-learn == 0.5.1 - cairosvg == 2.5.2 - pandas == 1.4.1 - seaborn == 0.11.2 - hdbscan == 0.8.27

Datasets

Datasets used are shown in data/ directory. For new datasets, simply add it to data/raw/, and the corresponding information in data/dataset_registry_raw.csv. The following information is required:

  • name: name of dataset
  • shortname: name that will be referenced in the scripts
  • filename: the csv file name ({shortname}.csv)
  • task: either regression/binary (classification)
  • smiles_column: name of column containing the smiles representations
  • target_columns: name of column containing the target of interest

Usage

Preprocessing

Scripts to preprocess the datasets are in scripts/ directory. Please run these prior to any of the below experiments.

```bash cd scripts/

prepare molecules and analysis directory in data/{shortname}

canonicalize all smiles

removes all invalid and duplicate smiles

removes all smiles with salts, ions or fragments

python makeqsarready.py --dataset={shortname}

create splits and features

create all features used (mfp, mordred, graphtuple, graphembed)

create the train/val/test splits used

python make_features.py --dataset={shortname} ```

Experiment 1: Supervised learning

Scripts to run experiments are contained in scripts/ directory.

```bash cd scripts/

make all predictions for specified feature/model

python make_predictions.py --dataset={shortname} --model={modelname} --feature={featurename}

create evaluations/figures for all available prediction data

python make_evaluations.py --dataset={shortname} ```

Experiment 2: Bayesian optimization

Scripts are found in bayes_opt/ directory.

All results will be contained in data/{shortname}/bayesian_optimization.

```bash cd bayes_opt/

run the bayesian optimization campaign

python makebo.py --dataset={shortname} --numrestarts=30 --budget=250 --goal=minimize --fracinitdesign=0.05

create the traces and evaluation files

also outputs files with the fraction of hits calculated

python makebotraces.py --dataset={shortname} ```

Experiment 3: Cluster splits and generalizability

Similar to the first experiment, the scripts are found in scripts/ directory. Once datasets are cleaned and features are made, you can make the cluster splits, and run for specified feature/model. In the manuscript, we only do this for mordred/GPs.

All results will be contained in data/{shortname}/generalization.

```bash cd scripts/

create the cluster splits

python makegeneralizationsplits.py --dataset={shortname}

create evaluations for all available prediction data

python makegeneralizationpredictions.py --dataset={shortname}

evaluate the predictions on each cluster split, and generate plots

python makegeneralizationevaluations.py --dataset={shortname} ```

Proposed structure of repository (TODO)

Structure of repository - mol_data: dataset preprocessor and loader. And be extended to other datasets and new features. - dionysus: model-agnostic evaluation script and library. Just requires predictions.

- dionysus_addons: models used in the paper, here for reproducibility.

...

Owner

  • Name: Aspuru-Guzik group repo
  • Login: aspuru-guzik-group
  • Kind: organization

GitHub Events

Total
  • Watch event: 4
Last Year
  • Watch event: 4

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 15
  • Total Committers: 1
  • Avg Commits per committer: 15.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Gary Tom g****m@m****a 15
Committer Domains (Top 20 + Academic)