minimal_vae_on_selfies
A minimal implementation of a VAE on molecules, encoded as SELFIES
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: pubmed.ncbi, ncbi.nlm.nih.gov, acs.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary
Repository
A minimal implementation of a VAE on molecules, encoded as SELFIES
Basic Info
- Host: GitHub
- Owner: miguelgondu
- License: mit
- Language: Python
- Default Branch: main
- Size: 2.13 MB
Statistics
- Stars: 8
- Watchers: 3
- Forks: 2
- Open Issues: 1
- Releases: 0
Metadata Files
readme.md
Minimal VAE on SELFIES

SELFIES are a robust and discrete representation of molecules^1, which are a sort of successor to SMILES^2. This repository contains a minimal working example of an MLP variational autoencoder that can be trained on SELFIES of at most 20 tokens, including how to download a database of SELFIES strings, how to process these as categorical data, the training loops, and some samples from the latent space (see above).
Prerequisites
This code was written with Python 3.9 in mind. If you are using Conda, try
sh
conda create -n minimal-vae-on-selfies python=3.9
following with
sh
pip install -r requirements.txt
You should add the src/ file to your PYTHONPATH. If you're using VSCode for running and debugging, this can be done by adding this key-value pair to your launch.json :
json
"env": {
"PYTHONPATH": "${workspaceFolder}${pathSeparator}src:${env:PYTHONPATH}"
}
Data preprocessing
In src/data_preprocessing you can find files for downloading the dataset (which is PubChem's CID-SMILES saved at ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-SMILES.gz) and processing it to a small dataset of 50000 SELFIES strings in data/processed/TINY-CID-SELFIES-20, which is already available in the repo.
If you want access to the other datasets (for, say, a larger training run), you can run the following scripts. Warning: you will need plenty of disk space since the uncompressed CID-SMILES is already 8Gb.
download_dataset.pylets you download the dataset and decompress it.smiles_to_strings.pyreads the file atdata/raw/CID-SMILESin chunks and progressively translates the SMILES to SELFIES using The Matter's Lab translator.^3 Each processed chunk is appended at the end of a CID-SELFIES file indata/processed/CID-SELFIES.small_selfies.pyfilters all the SELFIES in the dataset that are larger than 20 tokens, outputting a file indata/processed/SMALL-CID-SELFIES-20. Finally, a subset of only 50000 of these 20-length SELFIES is stored indata/processed/TINY-CID-SELFIES-20, which is the file used for training. However, the models and training pipeline are built in such a way that training on the entireSMALL-CID-SELFIES-20should be feasible with a little bit of work.[^4]
This figure shows the size of these datasets in log-scale.

We can also take a look at the distribution of sequence lengths. The following plot shows how most of the mass of token lengths (99%) is concentrated below 221 tokens. We truncate at 320, but the maximum sequence length is 1840.

Ideally, we would also include other datasets like ZINC20^5, or GuacaMol^6. This is included as a TODO at the end of this repo.
Tokenizing
The tokenizing is done by hand. The script src/tokenizing/compute_tokens.py takes a database name (like TINY-CID-SELFIES-20) and computes a tokens_dict: Dict[str, int] and saves it inside data/processed/tokens_TINY-CID-SELFIES-20.json. We define the tokens as all the substrings in a SELFIES between square brackets:
```python
From src/utils/tokens.py
def fromselfieto_tokens(selfie: str) -> List[str]: """ Given a selfie string, returns a list of all occurences of [.?] in the string (i.e. whatever is between square brackes). """ return list(re.findall(r"[.?]", selfie)) ```
This doesn't scale. Smarter tokenizers are available out there. For example, check MolGen.
Model's definition
We implement a simple MLP Variational Autoencoder.
```python
Inside the model's definition in src/models/vae.py
Define the input length: length of a given SELFIES
(always padded to be {max_length}), times the number of tokens
self.inputlength = maxtokenlength * len(self.tokensdict)
Define the model
self.encoder = nn.Sequential( nn.Linear(self.inputlength, 1024), nn.ReLU(), nn.Linear(1024, 512), nn.ReLU(), nn.Linear(512, 256), nn.ReLU(), ) self.encodermu = nn.Linear(256, latentdim) self.encoderlogvar = nn.Linear(256, latentdim)
The decoder, which outputs the logits of the categorical
distribution over the vocabulary.
self.decoder = nn.Sequential( nn.Linear(latentdim, 256), nn.ReLU(), nn.Linear(256, 512), nn.ReLU(), nn.Linear(512, 1024), nn.ReLU(), nn.Linear(1024, self.inputlength), ) ```
After running src/training/training_models.py, a trained model is saved in data/trained_models/VAESelfies_TINY-CID-SELFIES-20.pt (download it here).
This model is nowhere close to state-of-the-art. The goal of this code is to have a toy latent space to run experiments in, or to have a starting point for your new SOTA models!
Some random samples
The script in src/exploration/explore_latent_space.py samples randomly from its latent space. Some of those are at the beginning of this repository.
Some TO-DOs
- [x] ~~Latent space optimization of MolSkill and QED using Evolutionary Strategies.~~
- [x] ~~Assess the lengths of all molecules in
CID-SELFIES~~ - [ ] Scalable tokenizing using MolGen's tokenizer.
- [ ] Training on all of
SMALL-CID-SELFIES-20. - [ ] Latent space optimization of MolSkill and QED using Bayesian Optimization.
- [ ] Include other datasets, like GuacaMol or ZINC20.
- [ ] Better models, like an autoregressive VAE using LSTMs, or a transformer.
Cite this repository!
If you find this code useful, feel free to cite it!
bibtex
@software{Gonzalez-Duque:VAESelfies:2023,
author = {González-Duque, Miguel},
title = {{Minimal implementation of a Variational Autoencoder on SELFIES representations of molecules}},
url = {https://github.com/miguelgondu/minimal_VAE_on_selfies},
version = {0.1},
date = {2023-05-02}
}
[^4]: We would need a smarter tokenizer, like the one provided by MolGen. We'd also need to do better with the memory management.
Owner
- Name: Miguel González Duque
- Login: miguelgondu
- Kind: user
- Website: miguelgondu.com/about
- Repositories: 58
- Profile: https://github.com/miguelgondu
Mathematician with interests in optimization, probabilistic modeling, and differential geometry.
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "González-Duque" given-names: "Miguel" title: "Minimal implementation of a Variational Autoencoder on SELFIES representations of molecules" version: 0.1 date-released: 2023-05-02 url: "https://github.com/miguelgondu/minimal_VAE_on_selfies"
GitHub Events
Total
Last Year
Dependencies
- CairoSVG ==2.7.0
- botorch ==0.8.4
- evotorch ==0.4.1
- gpytorch ==1.10
- matplotlib ==3.7.1
- numpy ==1.24.3
- pandas ==2.0.1
- rdkit ==2022.9.5
- seaborn ==0.12.2
- selfies ==2.1.1
- torch ==2.0.0
- black * development
- pytest * development