molecule-signature-paper

Code supporting the paper 'Getting Molecules from their Fingerprints: Generative Models vs. Deterministic Enumeration'

https://github.com/brsynth/molecule-signature-paper

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.8%) to scientific vocabulary

Keywords

chemistry chemoinformatics deterministic-enumeration fingerprint generative-model molecular-generation signature
Last synced: 6 months ago · JSON representation ·

Repository

Code supporting the paper 'Getting Molecules from their Fingerprints: Generative Models vs. Deterministic Enumeration'

Basic Info
  • Host: GitHub
  • Owner: brsynth
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 288 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 7
Topics
chemistry chemoinformatics deterministic-enumeration fingerprint generative-model molecular-generation signature
Created over 2 years ago · Last pushed 8 months ago
Metadata Files
Readme Changelog License Code of conduct Citation

README.md

Supporting content for the Molecule Signature paper

Github Version Github Licence

This repository contains code to support the Molecule Signature publication. See citation for details.

Table of Contents

1. Repository structure

```text . data < placeholder for data files > .. notebooks < supporting jupyter notebooks > .. src < source code for data preparation and modeling > paper dataset learning

```

1.1. Datasets

The data directory is the place where put required data files to be used by the code. emolecules and metanetx subdirectories are created at execution time. See data organization README for details.

1.2. Supporting Notebooks

The notebooks directory contains Jupyter notebooks that support figures and tables from the paper. The handy.py file contains utility functions that are used in some of the notebooks.

1.3. Source code

The src directory contains the source code for the paper. The code is organized in two main directories: dataset for preparing datasets and learning for training and using the generative model. See Usage for details on how to run the code.

2. Installation

The following steps will set up a signature-paper conda environment.

  1. Install Conda:

    The conda package manager is required. If you do not have it installed, you can download it from here. Follow the instructions on the page to install Conda. For example, on Windows, you would download the installer and run it. On macOS and Linux, you might use a command like:

    bash bash ~/Downloads/Miniconda3-latest-Linux-x86_64.sh

    Follow the prompts on the installer to complete the installation.

  2. Install dependencies:

    bash conda env create -f recipes/environment.yaml conda activate signature-paper pip install --no-deps -e .

  3. Download data:

    Precomputed alphabets, trained generative models and most important datasets are available as a Zenodo archive: https://zenodo.org/records/14760992. Extract the files and place them in the data directory.

  4. Optionnaly (for dev): set the signature package from source

    Installing the signature code from source is optional may be useful for development purposes. This will allow you to make changes to the signature code and see the effects in the paper code without having to reinstall the package.

    ```bash conda activate signature-paper

    Remove the packaged version

    conda remove molecule-signature

    Set up the source version

    git clone git@github.com:brsynth/molecule-signature.git lib/molecule-signature pushd lib/molecule-signature pip install --no-deps -e . popd ```

3. Usage

3.1. Preparing datasets

The src/paper/dataset module contains code to prepare datasets for training and evaluation. The prepare command will create the datasets in the data directory.

  • Getting help

    bash python -u src/paper/dataset/prepare.py --help python -u src/paper/dataset/tokens.py --help

  • eMolecules dataset

    Prepare the emolecules dataset. Cautions: the emolecules dataset is large and may take a long time to download and process (as well as a substantial disk space and RAM consumption).

    ```bash

    Datasets

    python src/paper/dataset/prepare.py all --db emolecules --workers 10 --showprogress --splitmethod trainvalidsplit

    Tokenizers

    python src/paper/dataset/tokens.py --db emolecules --tokenminpresence 0.0001 ```

  • MetaNetX dataset

    ```bash

    Datasets

    python src/paper/dataset/prepare.py all --db metanetx --workers 10 --show_progress

    Tokenizers

    python src/paper/dataset/tokens.py --db metanetx --tokenminpresence 0.0001 ```

3.2 Deterministic enumeration

Users will find explanations and examples on how to use the deterministic enumeration code in the notebooks folder, in particular:

3.3. Train generative models

Cautions: settings may need to be adjusted depending on the available resources (e.g., GPU memory, disk space, etc.) and the HPC environment.

  • Getting help

    bash python -u src/paper/learning/train.py --help

  • Pre-train model (from eMolecules datasets)

    bash python src/paper/learning/train.py \ --db emolecules \ --source ECFP \ --target SMILES \ --dataloader_num_workers 3 \ --enable_mixed_precision True \ --ddp_num_nodes 1 \ --split_method train_valid_split \ --train_fold_index 0 \ --data_max_rows -1 \ --out_dir ${WORK}/hpc \ --out_sub_dir NULL \ --model_dim 512 \ --model_num_encoder_layers 3 \ --model_num_decoder_layers 3 \ --scheduler_method plateau \ --scheduler_lr 0.0001 \ --scheduler_plateau_patience 1 \ --scheduler_plateau_factor 0.1 \ --early_stopping_patience 5 \ --early_stopping_min_delta 0.0001 \ --train_max_epochs 200 \ --train_batch_size 128 \ --train_accumulate_grad_batches 1 \ --train_val_check_interval 1 \ --train_seed 42 \ --finetune false \ --finetune_lr 0.0001 \ --finetune_freeze_encoder false \ --finetune_checkpoint None

  • Fine-tune model (from MetaNetX datasets, fold 0)

    bash python src/paper/learning/train.py \ --db metanetx \ --source ECFP \ --target SMILES \ --dataloader_num_workers 3 \ --enable_mixed_precision True \ --ddp_num_nodes 1 \ --split_method kfold \ --train_fold_index 0 \ --data_max_rows -1 \ --out_dir ${WORK}/hpc/llogs \ --out_sub_dir FOLD_0 \ --model_dim 512 \ --model_num_encoder_layers 3 \ --model_num_decoder_layers 3 \ --scheduler_method plateau \ --scheduler_lr 0.0001 \ --scheduler_plateau_patience 2 \ --scheduler_plateau_factor 0.1 \ --early_stopping_patience 6 \ --early_stopping_min_delta 0.0001 \ --train_max_epochs 200 \ --train_batch_size 128 \ --train_accumulate_grad_batches 1 \ --train_val_check_interval 1 \ --train_seed 42 \ --finetune true \ --finetune_lr 0.0001 \ --finetune_freeze_encoder true \ --finetune_checkpoint <path_to_pretrain_model_checkpoint>

3.4. Predict molecules with generative models

The src/paper/learning/predict.py script can be used to generate molecules from the trained models. The script requires a trained model checkpoint and a tokenizer, which can be downloaded from the Zenodo archive(see Installation).

  • Generate molecules from the fine-tuned model

    bash python src/paper/learning/predict.py \ --model_path "data/models/finetuned.ckpt" \ --model_source_tokenizer "data/tokens/ECFP.model" \ --model_target_tokenizer "data/tokens/SMILES.model" \ --pred_mode "beam"

  • Getting help

    bash python -u src/paper/learning/predict.py --help

4. Reproduce results (notebooks)

Deterministic enumeration

Generative models

  • Accuracies from the generative model (Tables 1 and 2) are computed using the 4.generation_recovery.ipynb notebook. This notebook also provides examples of how to use the predict.py script to generate molecules and evaluate their accuracy.

  • Accuracies from the cross comparisons between generative models(present and molforge models) and test datasets are computed using the 5.generative_molforge.ipynb notebook.

Analyses

5. Citation

Meyer, P., Duigou, T., Gricourt, G., & Faulon, J.-L. Reverse Engineering Molecules from Fingerprints through Deterministic Enumeration and Generative Models. In preparation.

Owner

  • Name: BioRetroSynth
  • Login: brsynth
  • Kind: organization

Our group is interested in synthetic biology and systems metabolic engineering in whole-cell and cell-free systems.

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this code, please cite it as below.
authors:
  - family-names: "Duigou"
    given-names: "Thomas"
    orcid: "https://orcid.org/0000-0002-2649-2950"
  - family-names: "Meyer"
    given-names: "Philippe"
    orcid: "https://orcid.org/0000-0002-0618-2947"
  - family-names: "Gricourt"
    given-names: "Guillaume"
    orcid: "https://orcid.org/0000-0003-0143-5535"
  - family-names: "Faulon"
    given-names: "Jean-Loup"
    orcid: "https://orcid.org/0000-0003-4274-2953"
title: "Supporting content for the 'Reverse Engineering Molecules from Fingerprints through Deterministic Enumeration and Generative Models' paper."
version: 3.0.0
doi: TO DEFINE
date-released: 2025-01-31
url: "https://github.com/brsynth/molecule-signature-paper"
preferred-citation:
  type: article
  authors:
  - family-names: "Meyer"
    given-names: "Philippe"
    orcid: "https://orcid.org/0000-0002-0618-2947"
  - family-names: "Duigou"
    given-names: "Thomas"
    orcid: "https://orcid.org/0000-0002-2649-2950"
  - family-names: "Gricourt"
    given-names: "Guillaume"
    orcid: "https://orcid.org/0000-0003-0143-5535"
  - family-names: "Faulon"
    given-names: "Jean-Loup"
    orcid: "https://orcid.org/0000-0003-4274-2953"
  doi:
  journal: in preparation
  month:
  start:
  end:
  title: "Reverse Engineering Molecules from Fingerprints through Deterministic Enumeration and Generative Models."
  issue:
  volume:
  year:

GitHub Events

Total
  • Release event: 2
  • Delete event: 5
  • Public event: 1
  • Push event: 25
  • Pull request event: 1
  • Create event: 2
Last Year
  • Release event: 2
  • Delete event: 5
  • Public event: 1
  • Push event: 25
  • Pull request event: 1
  • Create event: 2

Dependencies

.github/workflows/release.yml actions
  • actions/checkout v4 composite
  • softprops/action-gh-release v2 composite
pyproject.toml pypi
environment.yaml pypi
  • scikit-learn *