bio-diffusion

A geometry-complete diffusion generative model (GCDM) for 3D molecule generation and optimization. (Nature CommsChem)

https://github.com/bioinfomachinelearning/bio-diffusion

Keywords

computational-biology computational-chemistry deep-learning generative-model graph-neural-networks machine-learning

Last synced: 10 months ago · JSON representation ·

Repository

A geometry-complete diffusion generative model (GCDM) for 3D molecule generation and optimization. (Nature CommsChem)

Basic Info

Host: GitHub
Owner: BioinfoMachineLearning
License: other
Language: Python
Default Branch: main
Homepage:
Size: 42.8 MB

Statistics

Stars: 202
Watchers: 3
Forks: 27
Open Issues: 1
Releases: 1

Topics

computational-biology computational-chemistry deep-learning generative-model graph-neural-networks machine-learning

Created over 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

[![Paper](http://img.shields.io/badge/arXiv-2302.04313-B31B1B.svg)](https://arxiv.org/abs/2302.04313) [![Datasets DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7881981.svg)](https://doi.org/10.5281/zenodo.7881981) [![Checkpoints DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13375913.svg)](https://doi.org/10.5281/zenodo.13375913) ![Bio-Diffusion.png](./img/Bio-Diffusion.png)

Description

This is the official codebase of the paper

Geometry-Complete Diffusion for 3D Molecule Generation and Optimization, Nature CommsChem

[arXiv] [Nature CommsChem]

![Animation of diffusion model-generated 3D molecules visualized successively](img/GCDM_Sampled_Molecule_Trajectory.gif)

System requirements

OS requirements

This package supports Linux. The package has been tested on the following Linux system: Description: AlmaLinux release 8.9 (Midnight Oncilla)

Python dependencies

This package is developed and tested under Python 3.9.x. The primary Python packages and their versions are as follows. For more details, please refer to the environment.yaml file. python hydra-core=1.2.0 matplotlib-base=3.4.3 numpy=1.23.1 pyg=2.2.0=py39_torch_1.12.0_cu116 python=3.9.15 pytorch=1.12.1=py3.9_cuda11.6_cudnn8.3.2_0 pytorch-cluster=1.6.0=py39_torch_1.12.0_cu116 pytorch-scatter=2.1.0=py39_torch_1.12.0_cu116 pytorch-sparse=0.6.16=py39_torch_1.12.0_cu116 pytorch-lightning=1.7.7 scikit-learn=1.1.2 torchmetrics=0.10.2

Installation guide

Install mamba (~500 MB: ~1 minute)

bash wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh" bash Mambaforge-$(uname)-$(uname -m).sh # accept all terms and install to the default location rm Mambaforge-$(uname)-$(uname -m).sh # (optionally) remove installer after using it source ~/.bashrc # alternatively, one can restart their shell session to achieve the same result

Install dependencies (~15 GB: ~10 minutes)

```bash

clone project

git clone https://github.com/BioinfoMachineLearning/bio-diffusion cd bio-diffusion

create conda environment

mamba env create -f environment.yaml conda activate bio-diffusion # note: one still needs to use conda to (de)activate environments

install local project as package

pip3 install -e . ```

Download data (~100 GB extracted: ~4 hours) ```bash

fetch, extract, and clean-up preprocessed data

wget https://zenodo.org/record/7881981/files/EDM.tar.gz tar -xzf EDM.tar.gz rm EDM.tar.gz ```

Download checkpoints (~5 GB extracted: ~5 minutes)

Note: Make sure to be located in the project's root directory beforehand (e.g., ~/bio-diffusion/) ```bash

fetch and extract model checkpoints directory

wget https://zenodo.org/record/13375913/files/GCDMCheckpoints.tar.gz tar -xzf GCDMCheckpoints.tar.gz rm GCDMCheckpoints.tar.gz ``**Note**: EGNN molecular property prediction checkpoints are also included withinGCDMCheckpoints.tar.gz`, where three checkpoints per property were trained with random seeds (18 in total). Also included in this Zenodo model checkpoints record are trained GeoLDM (Xu et al. 2023) checkpoint files used to produce the benchmarking results in the accompanying GCDM manuscript.

Demo

Generate new unconditional 3D molecules (QM9)

Unconditionally generate small molecules similar to those contained within the QM9 dataset (~5 minutes)

bash python3 src/mol_gen_sample.py datamodule=edm_qm9 model=qm9_mol_gen_ddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] ckpt_path="checkpoints/QM9/Unconditional/model_1_epoch_979-EMA.ckpt" num_samples=250 num_nodes=19 all_frags=true sanitize=false relax=false num_resamplings=1 jump_length=1 num_timesteps=1000 output_dir="./" seed=123

NOTE: Output .sdf files will be stored in the current working directory by default. Specify this using output_dir. Run python3 src/mol_gen_sample.py --help to view an exhaustive list of available input arguments.

CONSIDER: Running bust MY_GENERATED_MOLS.sdf to determine which of the generated molecules are valid according to the PoseBusters software suite (~3 minutes).

Generate new property-conditional 3D molecules (QM9)

Property-conditionally generate small molecules similar to those contained within the QM9 dataset (~10 minutes)

```bash

alpha

python3 src/molgenevalconditionalqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false generatormodelfilepath="checkpoints/QM9/Conditional/alphamodelepoch1619-EMA.ckpt" property=alpha iterations=100 batchsize=100 sweeppropertyvalues=true numsweeps=10 output_dir="./" seed=123

gap

python3 src/molgenevalconditionalqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false generatormodelfilepath="checkpoints/QM9/Conditional/gapmodelepoch1659-EMA.ckpt" property=gap iterations=100 batchsize=100 sweeppropertyvalues=true numsweeps=10 output_dir="./" seed=123

homo

python3 src/molgenevalconditionalqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false generatormodelfilepath="checkpoints/QM9/Conditional/homomodelepoch1879-EMA.ckpt" property=homo iterations=100 batchsize=100 sweeppropertyvalues=true numsweeps=10 output_dir="./" seed=123

lumo

python3 src/molgenevalconditionalqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false generatormodelfilepath="checkpoints/QM9/Conditional/lumomodelepoch1619-EMA.ckpt" property=lumo iterations=100 batchsize=100 sweeppropertyvalues=true numsweeps=10 output_dir="./" seed=123

mu

python3 src/molgenevalconditionalqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false generatormodelfilepath="checkpoints/QM9/Conditional/mumodelepoch1859-EMA.ckpt" property=mu iterations=100 batchsize=100 sweeppropertyvalues=true numsweeps=10 output_dir="./" seed=123

Cv

python3 src/molgenevalconditionalqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false generatormodelfilepath="checkpoints/QM9/Conditional/Cvmodelepoch1539-EMA.ckpt" property=Cv iterations=100 batchsize=100 sweeppropertyvalues=true numsweeps=10 output_dir="./" seed=123 ```

NOTE: Output .sdf files will be stored in the current working directory by default. Specify this using output_dir. Run python3 src/mol_gen_eval_conditional_qm9.py --help to view an exhaustive list of available input arguments.

CONSIDER: Running bust MY_GENERATED_MOLS.sdf to determine which of the generated molecules are valid according to the PoseBusters software suite (~3 minutes).

Generate new unconditional 3D molecules (GEOM-Drugs)

Unconditionally generate drug-size molecules similar to those contained within the GEOM-Drugs dataset (~15 minutes)

bash python3 src/mol_gen_sample.py datamodule=edm_geom model=geom_mol_gen_ddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] ckpt_path="checkpoints/GEOM/Unconditional/36hq94x5_model_1_epoch_76-EMA.ckpt" num_samples=250 num_nodes=44 all_frags=true sanitize=false relax=false num_resamplings=1 jump_length=1 num_timesteps=1000 output_dir="./" seed=123

NOTE: Output .sdf files will be stored in the current working directory by default. Specify this using output_dir. Run python3 src/mol_gen_sample.py --help to view an exhaustive list of available input arguments.

CONSIDER: Running bust MY_GENERATED_MOLS.sdf to determine which of the generated molecules are valid according to the PoseBusters software suite (~3 minutes).

Optimize 3D molecules for molecular stability and various molecular properties (QM9)

```bash

e.g., unconditionally generate a batch of samples to property-optimize

NOTE: alpha is listed here, but it will not be referenced for the (initial) unconditional molecule generation

python3 src/molgenevaloptimizationqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false unconditionalgeneratormodelfilepath="checkpoints/QM9/Unconditional/model1epoch979-EMA.ckpt" conditionalgeneratormodelfilepath="checkpoints/QM9/Conditional/alphamodelepoch1619-EMA.ckpt" classifiermodeldir="checkpoints/QM9/PropertyClassifiers/expclassalphaseed1" numsamples=1000 samplingoutputdir="./molstooptimize/" property=alpha iterations=10 numoptimizationtimesteps=100 returnframes=1 generatemoleculesonly=true usepregeneratedmolecules=false

optimize generated samples for specific molecular properties, where alpha is used in this example

python3 src/molgenevaloptimizationqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false unconditionalgeneratormodelfilepath="checkpoints/QM9/Unconditional/model1epoch979-EMA.ckpt" conditionalgeneratormodelfilepath="checkpoints/QM9/Conditional/alphamodelepoch1619-EMA.ckpt" classifiermodeldir="checkpoints/QM9/PropertyClassifiers/expclassalphaseed1" numsamples=1000 samplingoutputdir="./molstooptimize/" property=alpha iterations=10 numoptimizationtimesteps=100 returnframes=1 generatemoleculesonly=false usepregeneratedmolecules=true save_molecules=true ```

NOTE: Output .sdf files will be stored under ./outputs/. Run python3 src/mol_gen_eval_optimization_qm9.py --help to view an exhaustive list of available input arguments.

CONSIDER: Running bust MY_GENERATED_MOLS.sdf to determine which of the generated molecules are valid according to the PoseBusters software suite (~3 minutes).

Instructions for use

How to train new models

Train model with default configuration

```bash

train on CPU

python src/train.py trainer=cpu

train on GPU

python src/train.py trainer=gpu ```

Train model with chosen experiment configuration from configs/experiment/

bash python src/train.py experiment=experiment_name.yaml

Train a model for unconditional small molecule generation with the QM9 dataset (QM9)

bash python3 src/train.py experiment=qm9_mol_gen_ddpm.yaml

Train a model for property-conditional small molecule generation with the QM9 dataset (QM9)

```bash

choose a value for `model.module_cfg.conditioning` from the properties `[alpha, gap, homo, lumo, mu, Cv]`

python3 src/train.py experiment=qm9molgenconditionalddpm.yaml model.module_cfg.conditioning=[alpha] ```

Train a model for unconditional drug-size molecule generation with the GEOM-Drugs dataset (GEOM-Drugs)

bash python3 src/train.py experiment=geom_mol_gen_ddpm.yaml

Note: You can override any parameter from command line like this

bash python src/train.py trainer.max_epochs=20 datamodule.dataloader_cfg.batch_size=64

How to reproduce paper results

Reproduce paper results for unconditional small molecule generation with the QM9 dataset (QM9 Unconditional: ~2 hrs)

```bash

note: `trainer.devices=[0]` selects the CUDA device available at index `0` - customize as needed using e.g., `nvidia-smi`

python3 src/molgeneval.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] ckptpath="checkpoints/QM9/Unconditional/model1epoch979-EMA.ckpt" datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false numsamples=10000 samplingbatchsize=100 numtestpasses=5 savemolecules=True outputdir=output/QM9/Unconditional/gcdmmodel1/

... repeat 5 times in total ...

python3 src/molgeneval.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] ckptpath="checkpoints/QM9/Unconditional/model1epoch979-EMA.ckpt" datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false numsamples=10000 samplingbatchsize=100 numtestpasses=5 savemolecules=True outputdir=output/QM9/Unconditional/gcdmmodel5/ ```

NOTE: Refer to src/analysis/inference_analysis.py and src/analysis/molecule_analysis.py to manually enter and analyze the unconditional results reported by the commands above. Also keep in mind that molecule_analysis.py, in contrast to the rest of the codebase, uses OpenBabel to infer bonds for the XYZ files saved by mol_gen_eval.py. This distinction for bond inference considerably impacts the performance of each method as measured by this script.

Reproduce paper results for property-conditional small molecule generation with the QM9 dataset (QM9 Conditional: ~12 hrs)

```bash

alpha (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_alpha_seed_$SEED"`, where `SEED=[1, 64, 83]`)

python3 src/molgenevalconditionalqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false generatormodelfilepath="checkpoints/QM9/Conditional/alphamodelepoch1619-EMA.ckpt" classifiermodeldir="checkpoints/QM9/PropertyClassifiers/expclassalphaseedN" property=alpha iterations=100 batchsize=100 savemolecules=True outputdir=output/QM9/Conditional/gcdmmodel1_alpha/

gap (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_gap_$SEED"`, where `SEED=[1, 471, 43149]`)

python3 src/molgenevalconditionalqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false generatormodelfilepath="checkpoints/QM9/Conditional/gapmodelepoch1659-EMA.ckpt" classifiermodeldir="checkpoints/QM9/PropertyClassifiers/expclassgapseedN" property=gap iterations=100 batchsize=100 savemolecules=True outputdir=output/QM9/Conditional/gcdmmodel1_gap/

homo (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_homo_$SEED"`, where `SEED=[1, 4, 14]`)

python3 src/molgenevalconditionalqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false generatormodelfilepath="checkpoints/QM9/Conditional/homomodelepoch1879-EMA.ckpt" classifiermodeldir="checkpoints/QM9/PropertyClassifiers/expclasshomoseedN" property=homo iterations=100 batchsize=100 savemolecules=True outputdir=output/QM9/Conditional/gcdmmodel1_homo/

lumo (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_lumo_$SEED"`, where `SEED=[1, 427, 745]`)

python3 src/molgenevalconditionalqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false generatormodelfilepath="checkpoints/QM9/Conditional/lumomodelepoch1619-EMA.ckpt" classifiermodeldir="checkpoints/QM9/PropertyClassifiers/expclasslumoseedN" property=lumo iterations=100 batchsize=100 savemolecules=True outputdir=output/QM9/Conditional/gcdmmodel1_lumo/

mu (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_mu_$SEED"`, where `SEED=[1, 39, 86]`)

python3 src/molgenevalconditionalqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false generatormodelfilepath="checkpoints/QM9/Conditional/mumodelepoch1859-EMA.ckpt" classifiermodeldir="checkpoints/QM9/PropertyClassifiers/expclassmuseedN" property=mu iterations=100 batchsize=100 savemolecules=True outputdir=output/QM9/Conditional/gcdmmodel1_mu/

Cv (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_Cv_$SEED"`, where `SEED=[1, 8, 89]`)

python3 src/molgenevalconditionalqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false generatormodelfilepath="checkpoints/QM9/Conditional/Cvmodelepoch1539-EMA.ckpt" classifiermodeldir="checkpoints/QM9/PropertyClassifiers/expclassCvseedN" property=Cv iterations=100 batchsize=100 savemolecules=True outputdir=output/QM9/Conditional/gcdmmodel1_Cv/ ```

NOTE: Refer to src/analysis/inference_analysis.py, src/analysis/molecule_analysis.py, and src/analysis/qm_analysis.py to manually enter and analyze the property-conditional results reported by the commands above.

Reproduce paper results for unconditional drug-size molecule generation with the GEOM-Drugs dataset (GEOM-Drugs Unconditional: ~24 hrs)

```bash python3 src/molgeneval.py datamodule=edmgeom model=geommolgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] ckptpath="checkpoints/GEOM/Unconditional/36hq94x5model1epoch76-EMA.ckpt" datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false numsamples=10000 samplingbatchsize=100 numtestpasses=5 savemolecules=True outputdir=output/GEOM/Unconditional/gcdmmodel_1/

... repeat 5 times in total ...

python3 src/molgeneval.py datamodule=edmgeom model=geommolgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] ckptpath="checkpoints/GEOM/Unconditional/36hq94x5model1epoch76-EMA.ckpt" datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false numsamples=10000 samplingbatchsize=100 numtestpasses=5 savemolecules=True outputdir=output/GEOM/Unconditional/gcdmmodel_5/ ```

NOTE: Refer to src/analysis/inference_analysis.py, src/analysis/molecule_analysis.py, src/analysis/qm_analysis.py, and src/analysis/bust_analysis.py to manually enter and analyze the unconditional results reported by the commands above.

Reproduce paper results for property-specific small molecule optimization with the QM9 dataset (QM9 Guided: ~12 hrs)

```bash

unconditionally generate a batch of samples to property-optimize

NOTE: alpha is listed here, but it will not be referenced for the (initial) unconditional molecule generation

python3 src/molgenevaloptimizationqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false unconditionalgeneratormodelfilepath="checkpoints/QM9/Unconditional/model1epoch979-EMA.ckpt" conditionalgeneratormodelfilepath="checkpoints/QM9/Conditional/alphamodelepoch1619-EMA.ckpt" classifiermodeldir="checkpoints/QM9/PropertyClassifiers/expclassalphaseed1" numsamples=1000 samplingoutputdir="./optimmols/" property=alpha iterations=10 numoptimizationtimesteps=100 returnframes=1 generatemoleculesonly=true usepregenerated_molecules=false

optimize generated samples for specific molecular properties

alpha (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_alpha_seed_$SEED"`, where `SEED=[1, 64, 83]`)

python3 src/molgenevaloptimizationqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false unconditionalgeneratormodelfilepath="checkpoints/QM9/Unconditional/model1epoch979-EMA.ckpt" conditionalgeneratormodelfilepath="checkpoints/QM9/Conditional/alphamodelepoch1619-EMA.ckpt" classifiermodeldir="checkpoints/QM9/PropertyClassifiers/expclassalphaseedN" numsamples=1000 samplingoutputdir="./optimmols/" property=alpha iterations=10 numoptimizationtimesteps=100 returnframes=1 generatemoleculesonly=false usepregenerated_molecules=true

gap (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_gap_$SEED"`, where `SEED=[1, 471, 43149]`)

python3 src/molgenevaloptimizationqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false unconditionalgeneratormodelfilepath="checkpoints/QM9/Unconditional/model1epoch979-EMA.ckpt" conditionalgeneratormodelfilepath="checkpoints/QM9/Conditional/gapmodelepoch1659-EMA.ckpt" classifiermodeldir="checkpoints/QM9/PropertyClassifiers/expclassgapseedN" numsamples=1000 samplingoutputdir="./optimmols/" property=gap iterations=10 numoptimizationtimesteps=100 returnframes=1 generatemoleculesonly=false usepregenerated_molecules=true

homo (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_homo_$SEED"`, where `SEED=[1, 4, 14]`)

python3 src/molgenevaloptimizationqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false unconditionalgeneratormodelfilepath="checkpoints/QM9/Unconditional/model1epoch979-EMA.ckpt" conditionalgeneratormodelfilepath="checkpoints/QM9/Conditional/homomodelepoch1879-EMA.ckpt" classifiermodeldir="checkpoints/QM9/PropertyClassifiers/expclasshomoseedN" numsamples=1000 samplingoutputdir="./optimmols/" property=homo iterations=10 numoptimizationtimesteps=100 returnframes=1 generatemoleculesonly=false usepregenerated_molecules=true

lumo (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_lumo_$SEED"`, where `SEED=[1, 427, 745]`)

python3 src/molgenevaloptimizationqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false unconditionalgeneratormodelfilepath="checkpoints/QM9/Unconditional/model1epoch979-EMA.ckpt" conditionalgeneratormodelfilepath="checkpoints/QM9/Conditional/lumomodelepoch1619-EMA.ckpt" classifiermodeldir="checkpoints/QM9/PropertyClassifiers/expclasslumoseedN" numsamples=1000 samplingoutputdir="./optimmols/" property=lumo iterations=10 numoptimizationtimesteps=100 returnframes=1 generatemoleculesonly=false usepregenerated_molecules=true

mu (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_mu_$SEED"`, where `SEED=[1, 39, 86]`)

python3 src/molgenevaloptimizationqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false unconditionalgeneratormodelfilepath="checkpoints/QM9/Unconditional/model1epoch979-EMA.ckpt" conditionalgeneratormodelfilepath="checkpoints/QM9/Conditional/mumodelepoch1859-EMA.ckpt" classifiermodeldir="checkpoints/QM9/PropertyClassifiers/expclassmuseedN" numsamples=1000 samplingoutputdir="./optimmols/" property=mu iterations=10 numoptimizationtimesteps=100 returnframes=1 generatemoleculesonly=false usepregenerated_molecules=true

Cv (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_Cv_$SEED"`, where `SEED=[1, 8, 89]`)

python3 src/molgenevaloptimizationqm9.py datamodule=edmqm9 model=qm9molgenddpm logger=csv trainer.accelerator=gpu trainer.devices=[0] datamodule.dataloadercfg.numworkers=1 model.diffusioncfg.sampleduringtraining=false unconditionalgeneratormodelfilepath="checkpoints/QM9/Unconditional/model1epoch979-EMA.ckpt" conditionalgeneratormodelfilepath="checkpoints/QM9/Conditional/Cvmodelepoch1539-EMA.ckpt" classifiermodeldir="checkpoints/QM9/PropertyClassifiers/expclassCvseedN" numsamples=1000 samplingoutputdir="./optimmols/" property=Cv iterations=10 numoptimizationtimesteps=100 returnframes=1 generatemoleculesonly=false usepregenerated_molecules=true ```

NOTE: Refer to src/analysis/optimization_analysis.py to manually enter and plot the optimization results reported by the commands above.

Reproduce paper results for protein-conditional small molecule generation with the Binding MOAD and CrossDocked datasets (Binding MOAD & CrossDocked: ~5 days)

Please refer to the following dedicated GitHub repository for further details: https://github.com/BioinfoMachineLearning/GCDM-SBDD.

Docker

To run this project in a Docker container, you can use the following commands:

```bash

Build the image

docker build -t bio-diffusion .

Run the container (with GPUs and mounting the current directory)

docker run -it --gpus all -v .:/mnt --name bio-diffusion bio-diffusion ``__Note:__ You will still need to download the checkpoints and data as described in the installation guide. Then, update the Python commands to point to the desired local location of your files (e.g.,/mnt/checkpointsand/mnt/outputs`) once in the container.

Acknowledgements

Bio-Diffusion builds upon the source code and data from the following projects:

We thank all their contributors and maintainers!

License

This project is covered under the MIT License.

Citation

If you use the code or data associated with this package or otherwise find this work useful, please cite:

bibtex @article{morehead2024geometry, title={Geometry-complete diffusion for 3D molecule generation and optimization}, author={Morehead, Alex and Cheng, Jianlin}, journal={Communications Chemistry}, volume={7}, number={1}, pages={150}, year={2024}, publisher={Nature Publishing Group UK London} }

Owner

Name: BioinfoMachineLearning
Login: BioinfoMachineLearning
Kind: organization

Repositories: 29
Profile: https://github.com/BioinfoMachineLearning

Citation (citation.bib)

@article{morehead2024geometry,
  title={Geometry-complete diffusion for 3D molecule generation and optimization},
  author={Morehead, Alex and Cheng, Jianlin},
  journal={Communications Chemistry},
  volume={7},
  number={1},
  pages={150},
  year={2024},
  publisher={Nature Publishing Group UK London}
}

GitHub Events

Total

Issues event: 7
Watch event: 36
Issue comment event: 12
Push event: 1
Fork event: 4

Last Year

Issues event: 7
Watch event: 36
Issue comment event: 12
Push event: 1
Fork event: 4

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 4
Total pull requests: 1
Average time to close issues: 1 day
Average time to close pull requests: 6 months
Total issue authors: 4
Total pull request authors: 1
Average comments per issue: 0.75
Average comments per pull request: 2.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 4
Pull requests: 1
Average time to close issues: 1 day
Average time to close pull requests: 6 months
Issue authors: 4
Pull request authors: 1
Average comments per issue: 0.75
Average comments per pull request: 2.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

18hfliu (4)
charlotte0104 (2)
cengc13 (1)
chengfengke (1)
Daisuke239 (1)
lfs119 (1)

Pull Request Authors

colbyford (2)
hotwa (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 3
Total downloads: unknown

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 0
(may contain duplicates)
Total versions: 3

proxy.golang.org: github.com/BioinfoMachineLearning/Bio-Diffusion

Documentation: https://pkg.go.dev/github.com/BioinfoMachineLearning/Bio-Diffusion#section-documentation
License: other
Latest release: v0.0.1
published about 2 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.5%

Average: 5.7%

Dependent repos count: 5.9%

Last synced: 10 months ago

proxy.golang.org: github.com/bioinfomachinelearning/bio-diffusion

Documentation: https://pkg.go.dev/github.com/bioinfomachinelearning/bio-diffusion#section-documentation
License: other
Latest release: v0.0.1
published about 2 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 6.4%

Average: 6.6%

Dependent repos count: 6.9%

Last synced: 10 months ago

proxy.golang.org: github.com/BioinfoMachineLearning/bio-diffusion

Documentation: https://pkg.go.dev/github.com/BioinfoMachineLearning/bio-diffusion#section-documentation
License: other
Latest release: v0.0.1
published about 2 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 6.4%

Average: 6.6%

Dependent repos count: 6.9%

Last synced: 11 months ago

bio-diffusion

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Description

Contents

System requirements

OS requirements

Python dependencies

Installation guide

clone project

create conda environment

install local project as package

fetch, extract, and clean-up preprocessed data

fetch and extract model checkpoints directory

Demo

Generate new unconditional 3D molecules (QM9)

Generate new property-conditional 3D molecules (QM9)

alpha

gap

homo

lumo

mu

Cv

Generate new unconditional 3D molecules (GEOM-Drugs)

Optimize 3D molecules for molecular stability and various molecular properties (QM9)

e.g., unconditionally generate a batch of samples to property-optimize

NOTE: alpha is listed here, but it will not be referenced for the (initial) unconditional molecule generation

optimize generated samples for specific molecular properties, where alpha is used in this example

Instructions for use

How to train new models

Train model with default configuration

train on CPU

train on GPU

Train model with chosen experiment configuration from configs/experiment/

Train a model for unconditional small molecule generation with the QM9 dataset (QM9)

Train a model for property-conditional small molecule generation with the QM9 dataset (QM9)

choose a value for model.module_cfg.conditioning from the properties [alpha, gap, homo, lumo, mu, Cv]

Train a model for unconditional drug-size molecule generation with the GEOM-Drugs dataset (GEOM-Drugs)

How to reproduce paper results

Reproduce paper results for unconditional small molecule generation with the QM9 dataset (QM9 Unconditional: ~2 hrs)

note: trainer.devices=[0] selects the CUDA device available at index 0 - customize as needed using e.g., nvidia-smi

... repeat 5 times in total ...

Reproduce paper results for property-conditional small molecule generation with the QM9 dataset (QM9 Conditional: ~12 hrs)

alpha (repeat for classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_alpha_seed_$SEED", where SEED=[1, 64, 83])

gap (repeat for classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_gap_$SEED", where SEED=[1, 471, 43149])

homo (repeat for classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_homo_$SEED", where SEED=[1, 4, 14])

lumo (repeat for classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_lumo_$SEED", where SEED=[1, 427, 745])

mu (repeat for classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_mu_$SEED", where SEED=[1, 39, 86])

Cv (repeat for classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_Cv_$SEED", where SEED=[1, 8, 89])

Reproduce paper results for unconditional drug-size molecule generation with the GEOM-Drugs dataset (GEOM-Drugs Unconditional: ~24 hrs)

... repeat 5 times in total ...

Reproduce paper results for property-specific small molecule optimization with the QM9 dataset (QM9 Guided: ~12 hrs)

unconditionally generate a batch of samples to property-optimize

NOTE: alpha is listed here, but it will not be referenced for the (initial) unconditional molecule generation

optimize generated samples for specific molecular properties

alpha (repeat for classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_alpha_seed_$SEED", where SEED=[1, 64, 83])

gap (repeat for classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_gap_$SEED", where SEED=[1, 471, 43149])

homo (repeat for classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_homo_$SEED", where SEED=[1, 4, 14])

lumo (repeat for classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_lumo_$SEED", where SEED=[1, 427, 745])

mu (repeat for classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_mu_$SEED", where SEED=[1, 39, 86])

Cv (repeat for classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_Cv_$SEED", where SEED=[1, 8, 89])

Reproduce paper results for protein-conditional small molecule generation with the Binding MOAD and CrossDocked datasets (Binding MOAD & CrossDocked: ~5 days)

Docker

Build the image

Run the container (with GPUs and mounting the current directory)

Acknowledgements

License

Citation

Owner

Citation (citation.bib)

GitHub Events

Total

Last Year

Issues and Pull Requests

choose a value for `model.module_cfg.conditioning` from the properties `[alpha, gap, homo, lumo, mu, Cv]`

note: `trainer.devices=[0]` selects the CUDA device available at index `0` - customize as needed using e.g., `nvidia-smi`

alpha (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_alpha_seed_$SEED"`, where `SEED=[1, 64, 83]`)

gap (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_gap_$SEED"`, where `SEED=[1, 471, 43149]`)

homo (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_homo_$SEED"`, where `SEED=[1, 4, 14]`)

lumo (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_lumo_$SEED"`, where `SEED=[1, 427, 745]`)

mu (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_mu_$SEED"`, where `SEED=[1, 39, 86]`)

Cv (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_Cv_$SEED"`, where `SEED=[1, 8, 89]`)

alpha (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_alpha_seed_$SEED"`, where `SEED=[1, 64, 83]`)

gap (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_gap_$SEED"`, where `SEED=[1, 471, 43149]`)

homo (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_homo_$SEED"`, where `SEED=[1, 4, 14]`)

lumo (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_lumo_$SEED"`, where `SEED=[1, 427, 745]`)

mu (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_mu_$SEED"`, where `SEED=[1, 39, 86]`)

Cv (repeat for `classifier_model_dir="checkpoints/QM9/Property_Classifiers/exp_class_Cv_$SEED"`, where `SEED=[1, 8, 89]`)