https://github.com/braceal/hgraph2graph

Hierarchical Generation of Molecular Graphs using Structural Motifs

https://github.com/braceal/hgraph2graph

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.3%) to scientific vocabulary
Last synced: 7 months ago · JSON representation

Repository

Hierarchical Generation of Molecular Graphs using Structural Motifs

Basic Info
  • Host: GitHub
  • Owner: braceal
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 150 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Fork of wengong-jin/hgraph2graph
Created almost 4 years ago · Last pushed almost 4 years ago

https://github.com/braceal/hgraph2graph/blob/master/

# Hierarchical Generation of Molecular Graphs using Structural Motifs

Our paper is at https://arxiv.org/pdf/2002.03230.pdf

## Installation
First install the dependencies via conda:
 * PyTorch >= 1.0.0
 * networkx
 * RDKit >= 2019.03
 * numpy
 * Python >= 3.6

For development on Mac
```
conda create -p ./env
conda activate ./env
conda install pip
./env/bin/pip install -r requirements/dev.txt
./env/bin/pip install -e .
./env/bin/pip install torch
./env/bin/pip install rdkit-pypi
./env/bin/pip install networkx
```

And then run `pip install .`. Additional dependency for property-guided finetuning:
 * Chemprop >= 1.2.0


## Data Format
* For graph generation, each line of a training file is a SMILES string of a molecule
* For graph translation, each line of a training file is a pair of molecules (molA, molB) that are similar to each other but molB has better chemical properties. Please see `data/qed/train_pairs.txt`. The test file is a list of molecules to be optimized. Please see `data/qed/test.txt`.

## Molecule generation pretraining procedure
We can train a molecular language model on a large corpus of unlabeled molecules. We have uploaded a model checkpoint pre-trained on ChEMBL dataset in `ckpt/chembl-pretrained/model.ckpt`. If you wish to train your own language model, please follow the steps below:

1. Extract substructure vocabulary from a given set of molecules:
```
python get_vocab.py --ncpu 16 < data/chembl/all.txt > vocab.txt
```

2. Preprocess training data:
```
python preprocess.py --train data/chembl/all.txt --vocab data/chembl/all.txt --ncpu 16 --mode single
mkdir train_processed
mv tensor* train_processed/
```

3. Train graph generation model
```
mkdir ckpt/chembl-pretrained
python train_generator.py --train train_processed/ --vocab data/chembl/vocab.txt --save_dir ckpt/chembl-pretrained
```

4. Sample molecules from a model checkpoint
```
python generate.py --vocab data/chembl/vocab.txt --model ckpt/chembl-pretrained/model.ckpt --nsamples 1000
```

## Property-guided molcule generation procedure (a.k.a. finetuning)
The following script loads a trained Chemprop model and finetunes a pre-trained molecule language model to generate molecules with specific chemical properties.
```
mkdir ckpt/finetune
python finetune_generator.py --train ${ACTIVE_MOLECULES} --vocab data/chembl/vocab.txt --generative_model ckpt/chembl-pretrained/model.ckpt --chemprop_model ${YOUR_PROPERTY_PREDICTOR} --min_similarity 0.1 --max_similarity 0.5 --nsample 10000 --epoch 10 --threshold 0.5 --save_dir ckpt/finetune
```
Here `${ACTIVE_MOLECULES}` should contain a list of experimentally verified active molecules. 

`${YOUR_PROPERTY_PREDICTOR}` should be a directory containing saved chemprop model checkpoint. 

`--max_similarity 0.5` means any novel molecule should have nearest neighbor similarity lower than 0.5 to any known active molecules in ${ACTIVE_MOLECULES}` file.

`--nsample 10000` means to sample 10000 molecules in each epoch. 

`--threshold 0.5` is the activity threshold. A molecule is considered as active if its predicted chemprop score is greater than 0.5.

In each epoch, generated active molecules are saved in `ckpt/finetune/good_molecules.${epoch}`. All the novel active molecules are saved in `ckpt/finetune/new_molecules.${epoch}`

## Molecule translation training procedure
Molecule translation is often useful for lead optimization (i.e., modifying a given molecule to improve its properties)

1. Extract substructure vocabulary from a given set of molecules:
```
python get_vocab.py --ncpu 16 < data/qed/mols.txt > vocab.txt
```
Please replace `data/qed/mols.txt` with your molecules.

2. Preprocess training data:
```
python preprocess.py --train data/qed/train_pairs.txt --vocab data/qed/vocab.txt --ncpu 16
mkdir train_processed
mv tensor* train_processed/
```

3. Train the model:
```
mkdir ckpt/translation
python train_translator.py --train train_processed/ --vocab data/qed/vocab.txt --save_dir ckpt/translation
```

4. Make prediction on your lead compounds (you can use any model checkpoint, here we use model.5 for illustration)
```
python translate.py --test data/qed/valid.txt --vocab data/qed/vocab.txt --model ckpt/translation/model.5 --num_decode 20 > results.csv
```

## Polymer generation
The polymer generation code is in the `polymer/` folder. The polymer generation code is similar to `train_generator.py`, but the substructures are tailored for polymers. 
For generating regular drug like molecules, we recommend to use `train_generator.py` in the root directory.

Owner

  • Name: Alex Brace
  • Login: braceal
  • Kind: user
  • Company: University of Chicago

GitHub Events

Total
Last Year