minimal_vae_on_selfies

A minimal implementation of a VAE on molecules, encoded as SELFIES

https://github.com/miguelgondu/minimal_vae_on_selfies

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: pubmed.ncbi, ncbi.nlm.nih.gov, acs.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

A minimal implementation of a VAE on molecules, encoded as SELFIES

Basic Info
  • Host: GitHub
  • Owner: miguelgondu
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 2.13 MB
Statistics
  • Stars: 8
  • Watchers: 3
  • Forks: 2
  • Open Issues: 1
  • Releases: 0
Created about 3 years ago · Last pushed over 2 years ago
Metadata Files
Readme Citation

readme.md

Minimal VAE on SELFIES

Molecule sampled - 1 Molecule sampled - 2 Molecule sampled - 3 Molecule sampled - 4

SELFIES are a robust and discrete representation of molecules^1, which are a sort of successor to SMILES^2. This repository contains a minimal working example of an MLP variational autoencoder that can be trained on SELFIES of at most 20 tokens, including how to download a database of SELFIES strings, how to process these as categorical data, the training loops, and some samples from the latent space (see above).

Prerequisites

This code was written with Python 3.9 in mind. If you are using Conda, try

sh conda create -n minimal-vae-on-selfies python=3.9

following with

sh pip install -r requirements.txt

You should add the src/ file to your PYTHONPATH. If you're using VSCode for running and debugging, this can be done by adding this key-value pair to your launch.json : json "env": { "PYTHONPATH": "${workspaceFolder}${pathSeparator}src:${env:PYTHONPATH}" }

Data preprocessing

In src/data_preprocessing you can find files for downloading the dataset (which is PubChem's CID-SMILES saved at ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-SMILES.gz) and processing it to a small dataset of 50000 SELFIES strings in data/processed/TINY-CID-SELFIES-20, which is already available in the repo.

If you want access to the other datasets (for, say, a larger training run), you can run the following scripts. Warning: you will need plenty of disk space since the uncompressed CID-SMILES is already 8Gb.

  • download_dataset.py lets you download the dataset and decompress it.
  • smiles_to_strings.py reads the file at data/raw/CID-SMILES in chunks and progressively translates the SMILES to SELFIES using The Matter's Lab translator.^3 Each processed chunk is appended at the end of a CID-SELFIES file in data/processed/CID-SELFIES.
  • small_selfies.py filters all the SELFIES in the dataset that are larger than 20 tokens, outputting a file in data/processed/SMALL-CID-SELFIES-20. Finally, a subset of only 50000 of these 20-length SELFIES is stored in data/processed/TINY-CID-SELFIES-20, which is the file used for training. However, the models and training pipeline are built in such a way that training on the entire SMALL-CID-SELFIES-20 should be feasible with a little bit of work.[^4]

This figure shows the size of these datasets in log-scale.

Dataset sizes

We can also take a look at the distribution of sequence lengths. The following plot shows how most of the mass of token lengths (99%) is concentrated below 221 tokens. We truncate at 320, but the maximum sequence length is 1840.

Lengths of sequences

Ideally, we would also include other datasets like ZINC20^5, or GuacaMol^6. This is included as a TODO at the end of this repo.

Tokenizing

The tokenizing is done by hand. The script src/tokenizing/compute_tokens.py takes a database name (like TINY-CID-SELFIES-20) and computes a tokens_dict: Dict[str, int] and saves it inside data/processed/tokens_TINY-CID-SELFIES-20.json. We define the tokens as all the substrings in a SELFIES between square brackets:

```python

From src/utils/tokens.py

def fromselfieto_tokens(selfie: str) -> List[str]: """ Given a selfie string, returns a list of all occurences of [.?] in the string (i.e. whatever is between square brackes). """ return list(re.findall(r"[.?]", selfie)) ```

This doesn't scale. Smarter tokenizers are available out there. For example, check MolGen.

Model's definition

We implement a simple MLP Variational Autoencoder.

```python

Inside the model's definition in src/models/vae.py

Define the input length: length of a given SELFIES

(always padded to be {max_length}), times the number of tokens

self.inputlength = maxtokenlength * len(self.tokensdict)

Define the model

self.encoder = nn.Sequential( nn.Linear(self.inputlength, 1024), nn.ReLU(), nn.Linear(1024, 512), nn.ReLU(), nn.Linear(512, 256), nn.ReLU(), ) self.encodermu = nn.Linear(256, latentdim) self.encoderlogvar = nn.Linear(256, latentdim)

The decoder, which outputs the logits of the categorical

distribution over the vocabulary.

self.decoder = nn.Sequential( nn.Linear(latentdim, 256), nn.ReLU(), nn.Linear(256, 512), nn.ReLU(), nn.Linear(512, 1024), nn.ReLU(), nn.Linear(1024, self.inputlength), ) ```

After running src/training/training_models.py, a trained model is saved in data/trained_models/VAESelfies_TINY-CID-SELFIES-20.pt (download it here).

This model is nowhere close to state-of-the-art. The goal of this code is to have a toy latent space to run experiments in, or to have a starting point for your new SOTA models!

Some random samples

The script in src/exploration/explore_latent_space.py samples randomly from its latent space. Some of those are at the beginning of this repository.

Some TO-DOs

  • [x] ~~Latent space optimization of MolSkill and QED using Evolutionary Strategies.~~
  • [x] ~~Assess the lengths of all molecules in CID-SELFIES~~
  • [ ] Scalable tokenizing using MolGen's tokenizer.
  • [ ] Training on all of SMALL-CID-SELFIES-20.
  • [ ] Latent space optimization of MolSkill and QED using Bayesian Optimization.
  • [ ] Include other datasets, like GuacaMol or ZINC20.
  • [ ] Better models, like an autoregressive VAE using LSTMs, or a transformer.

Cite this repository!

If you find this code useful, feel free to cite it!

bibtex @software{Gonzalez-Duque:VAESelfies:2023, author = {González-Duque, Miguel}, title = {{Minimal implementation of a Variational Autoencoder on SELFIES representations of molecules}}, url = {https://github.com/miguelgondu/minimal_VAE_on_selfies}, version = {0.1}, date = {2023-05-02} }

[^4]: We would need a smarter tokenizer, like the one provided by MolGen. We'd also need to do better with the memory management.

Owner

  • Name: Miguel González Duque
  • Login: miguelgondu
  • Kind: user

Mathematician with interests in optimization, probabilistic modeling, and differential geometry.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "González-Duque"
  given-names: "Miguel"
title: "Minimal implementation of a Variational Autoencoder on SELFIES representations of molecules"
version: 0.1
date-released: 2023-05-02
url: "https://github.com/miguelgondu/minimal_VAE_on_selfies"

GitHub Events

Total
Last Year

Dependencies

requirements.txt pypi
  • CairoSVG ==2.7.0
  • botorch ==0.8.4
  • evotorch ==0.4.1
  • gpytorch ==1.10
  • matplotlib ==3.7.1
  • numpy ==1.24.3
  • pandas ==2.0.1
  • rdkit ==2022.9.5
  • seaborn ==0.12.2
  • selfies ==2.1.1
  • torch ==2.0.0
requirements-dev.txt pypi
  • black * development
  • pytest * development