umda

Unsupervised molecule discovery in astrophysics project

https://github.com/laserkelvin/umda

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Unsupervised molecule discovery in astrophysics project

Basic Info
  • Host: GitHub
  • Owner: laserkelvin
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: master
  • Size: 250 MB
Statistics
  • Stars: 5
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 1
Created over 5 years ago · Last pushed over 3 years ago
Metadata Files
Readme License Citation

README.md

Unsupervised Molecule Discovery in Astrophysics

umap-image

Applying cheminformatics to astrochemistry

This repository includes notebooks and codebase for developing machine learning pipelines that apply cheminformatics concepts to predicting astrochemical properties.

The current focus is on molecular column densities in astronomical observations, but can potentially be applied towards laboratory data, as well as studying chemical networks. As it stands, the code has been tested to work for up to four million molecules on a Dell XPS 15 (32 GB ram, 6 core i7-9750H) without much difficulty thanks to frameworks like dask that can abstract away a large amount of the parallelization and out-of-memory operations.

As a point of clarification: "unsupervised" in the title refers to the fact that the molecule feature vectors are learned using mol2vec, which is unsupervised. The act of predicting column densities requires training a supervised model. I think the former is more exciting in terms of development than the latter.

If you used the list of recommendations generated from this work as part of your own observations or work, please cite the Zenodo repository DOI and the paper as it comes out. In the meantime, please cite this repository

Installation

Currently, the codebase is not quite ready for public consumption: while the API more or less works as intended, there's still a bit of fussing around with model training and deploying. If you would like to contribute to this aspect, please raise an issue in this repository!

The Makefile environment recipe should recreate the software environment needed for umda to work. Simply run make environment to set everything up automatically; you can then activate the environment with conda activate umda.

Instructions

Currently a user API is underdeveloped, and so if you would like to run your own predictions it is somewhat manual. As part of the repository, we've included a pretrained embedding model, as well as a host of regressors stored as pickles dumped using joblib.

Here is an example of the bare minimum code one needs to run the model and predict the column density of benzene and formaldehyde using linear regression:

```python from joblib import load import numpy as np

from umda.data import load_pipeline

load a wrapper class for generating embeddings

embedder = load_pipeline() regressors = load("models/regressors.pkl")

get the gradient boosting regressor

regressor = regressors.get("gbr")

smiles = ["C1=C=C=C=C=C1", "C=O"] vecs = np.vstack([embedder.vectorize(smi) for smi in smiles]) regressor.predict(vecs) ```

The full workflow

The pieces of this project are modular, comprising a word2vec embedder model and any given regressor, and the workflow involves putting together these pieces.

The embedding model

  1. Collect up all the SMILES strings you have, and put them into a single .smi file. The scripts/pool_smiles.py gives an example of this.
  2. Train the mol2vec model using these SMILES. The scripts/make_mol2vec.py shows how this is done.
  3. Set up an embedding pipeline: we want to transform SMILES to vectors, an optionally, perform dimensionality reduction. The script scripts/embedding_pipeline.py will do this, and serialize a pretrained and convenience class EmbeddingModel.

The regressors

With an embedding pipeline in hand, the next step is to train a regressor to predict whatever astrochemical property you desire. I advise you set up a .csv file or other machine readable format that holds all of the molecules and column densities. As part of the regression pipeline, one may also optionally want to perform feature preprocessing, and I recommend setting up a composable sklearn.pipeline model. Most of this is done in notebooks/estimator_training, and calls on functions in the umda.data and umda.training modules.

Generating recomendations

This step doesn't really need trained regressors, but generally you'd be interested in predicting their abundance anyway. The script scripts/tmc1_recommendations.py shows how to do this: essentially you compute the pairwise distance of every molecule in your source (TMC-1) and those in your precomputed database of embeddings, and return the closest unique matches. The last step in this script grabs a regressor and predicts the recommendations' column densities. You'll likely need to filter the list to exclude elements you don't think are likely: this is still a pitfall because the distance metric is a reduction of comparisons in high dimensions, and in particular you are likely to end up with things like small diatomic heavy metals (because they are structurally similar to things like CH and CN). Coming up with a semantic model for recommendation wouldn't be too difficult, and is left to the reader


Project based on the cookiecutter data science project template. #cookiecutterdatascience

This version of the cookiecutter template is modified by Kelvin Lee.

Owner

  • Name: Kelvin Lee
  • Login: laserkelvin
  • Kind: user
  • Location: Hillsboro, OR
  • Company: Intel Corporation

HPC AI/ML engineer at Intel AXG. Previously postdoc at MIT, CfA/SAO. Interests in astrochemistry and physical chemistry.

GitHub Events

Total
  • Fork event: 1
Last Year
  • Fork event: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • laserkelvin (1)
Top Labels
Issue Labels
Pull Request Labels