structure-seer

The implementation, training and evaluation of a Structure Seer machine learning model designed for reconstruction of adjacency of a molecular graph from the labelling of its nodes.

https://github.com/quantori/structure-seer

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.9%) to scientific vocabulary

Keywords

cheminformatics graph graph-convolutional-network machine-learning ml molecular-graph molecular-graph-learning molecule molecule-generation nmr-data nmr-spectroscopy
Last synced: 6 months ago · JSON representation ·

Repository

The implementation, training and evaluation of a Structure Seer machine learning model designed for reconstruction of adjacency of a molecular graph from the labelling of its nodes.

Basic Info
  • Host: GitHub
  • Owner: quantori
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 43.2 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 1
Topics
cheminformatics graph graph-convolutional-network machine-learning ml molecular-graph molecular-graph-learning molecule molecule-generation nmr-data nmr-spectroscopy
Created about 2 years ago · Last pushed about 2 years ago
Metadata Files
Readme License Code of conduct Citation

README.md

DOI:10.1039/D3DD00178D

PyTorch

Structure Seer

The implementation training and evaluation of a Structure Seer model designed for reconstruction of adjacency of a molecular graph from the labelling of its nodes. The detailed characterisation and disclosure of the model architecture is provided in: Structure Seer - a machine learning model for chemical structure elucidation from a node labelling of a molecular graph, Digital discovery, 2023

Datasets

The repository does not contain initial datasets used for training. - Small example datasets for detailed model evaluation are provided in ./example_dataset - Model weights trained on QM9 and PubChem Datasets are stored in ./weights

Abstract

The repository contains the implementation for a novel graph convolution based machine-learning model which is designed to provide a quantitative probabilistic prediction on the connectivity of the atoms based on the information on the elemental composition of the molecule along with a list of atom-attributed isotropic shielding constants. The suggested approach holds significant potential for scalability, as it can harness vast amounts of information on known chemical structures for the model's learning process. The model architecture allows for direct structure reconstruction through prediction of molecular graph adjacency based solely on the labelling of its nodes, which potentially allows dealing with molecules of any size and composition (given an appropriate training dataset is available) without significant increase in computational resources required.

Key approaches

Unification of adjacency matrix representation

The primary challenge in generating the adjacency matrix is that it is not an invariant for a given graph. For a given graph with G nodes, there are G! adjacency matrices that can describe its connectivity. To tackle this issue, the adjacency matrix representation needs to be unified. Typically, in the machine- readable representation of a molecule, its atoms are stored in the first-depth-tree traversal order. While this order contains information about the stored structure, it cannot be easily reconstructed when only the elemental composition of the molecule and the isotropic shielding constant for each atom are known. Since the shielding constant provides a unique characterization of an atom's chemical environment, it can be employed to standardize the representation of the adjacency matrix in conjunction with element information.

Generic adjacency matrix

The architecture of the Structure Seer model bears similarities to other GCN-based models used for diverse tasks involving molecular graphs. However, its distinctive design is centred around encoding the molecule solely based on node labelling, which allows for the generation of the complete adjacency matrix. This feature makes the considered architecture applicable to a broad range of atom adjacency reconstruction tasks.

Training

Refer to the training procedure in the Jupyter notebook ./training.ipynb . Customize the procedure by adjusting the global variables in the second code cell. The main training function source code is in ./training/train_model.py.

In order to train the model using Google Colab - extract the repository to the GDrive into ./MyDrive.

Evaluation

For model evaluation, utilize ./model_evaluation.ipynb with the pretrained model weights. Small example datasets for detailed model evaluation are provided in ./example_dataset.

Code examples

Explore model usage and functionality in ./structure_seer_code_examples.ipynb, which includes illustrative examples.

Owner

  • Name: Quantori
  • Login: quantori
  • Kind: organization
  • Email: contact@quantori.com
  • Location: United States of America

Quantori is the premier digital IT services partner for life science and healthcare companies around the world.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: "Sapegin"
    given-names: "Denis A"
    orcid: "https://orcid.org/0000-0002-1446-6288"
title: "Structure Seer"
preferred-citation:
  type: article
  authors:
    - family-names: "Sapegin"
      given-names: "Denis Andzheevich"
      orcid: "https://orcid.org/0000-0002-1446-6288"
    - family-names: "Bear"
      given-names: "Joseph C"
  title: "Structure Seer – a machine learning model for chemical structure elucidation from node labelling of a molecular graph"
  doi: 10.1039/D3DD00178D
  journal: "Digital Discovery"
  year: 2024

GitHub Events

Total
Last Year

Committers

Last synced: about 1 year ago

All Time
  • Total Commits: 23
  • Total Committers: 2
  • Avg Commits per committer: 11.5
  • Development Distribution Score (DDS): 0.043
Past Year
  • Commits: 17
  • Committers: 1
  • Avg Commits per committer: 17.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Membrizard d****n@q****m 22
Roman Maksimov 6****v 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: about 1 month
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • Membrizard (1)
Top Labels
Issue Labels
Pull Request Labels