Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.7%) to scientific vocabulary
Repository
Unsupervised molecule discovery in astrophysics project
Basic Info
- Host: GitHub
- Owner: laserkelvin
- License: mit
- Language: Jupyter Notebook
- Default Branch: master
- Size: 250 MB
Statistics
- Stars: 5
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
Unsupervised Molecule Discovery in Astrophysics

Applying cheminformatics to astrochemistry
This repository includes notebooks and codebase for developing machine learning pipelines that apply cheminformatics concepts to predicting astrochemical properties.
The current focus is on molecular column densities in astronomical observations,
but can potentially be applied towards laboratory data, as well as studying chemical
networks. As it stands, the code has been tested to work for up to four million
molecules on a Dell XPS 15 (32 GB ram, 6 core i7-9750H) without much difficulty
thanks to frameworks like dask that can abstract away a large amount of the
parallelization and out-of-memory operations.
As a point of clarification: "unsupervised" in the title refers to the fact that the
molecule feature vectors are learned using mol2vec, which is unsupervised. The
act of predicting column densities requires training a supervised model. I think
the former is more exciting in terms of development than the latter.
If you used the list of recommendations generated from this work as part of your
own observations or work, please cite the Zenodo repository and the paper as it comes out.
In the meantime, please cite this repository
Installation
Currently, the codebase is not quite ready for public consumption: while the API more or less works as intended, there's still a bit of fussing around with model training and deploying. If you would like to contribute to this aspect, please raise an issue in this repository!
The Makefile environment recipe should recreate the software environment
needed for umda to work. Simply run make environment to set everything
up automatically; you can then activate the environment with conda activate umda.
Instructions
Currently a user API is underdeveloped, and so if you would like to run your
own predictions it is somewhat manual. As part of the repository, we've included
a pretrained embedding model, as well as a host of regressors stored as pickles
dumped using joblib.
Here is an example of the bare minimum code one needs to run the model and predict the column density of benzene and formaldehyde using linear regression:
```python from joblib import load import numpy as np
from umda.data import load_pipeline
load a wrapper class for generating embeddings
embedder = load_pipeline() regressors = load("models/regressors.pkl")
get the gradient boosting regressor
regressor = regressors.get("gbr")
smiles = ["C1=C=C=C=C=C1", "C=O"] vecs = np.vstack([embedder.vectorize(smi) for smi in smiles]) regressor.predict(vecs) ```
The full workflow
The pieces of this project are modular, comprising a word2vec embedder model
and any given regressor, and the workflow involves putting together these pieces.
The embedding model
- Collect up all the SMILES strings you have, and put them into a single
.smifile. Thescripts/pool_smiles.pygives an example of this. - Train the
mol2vecmodel using these SMILES. Thescripts/make_mol2vec.pyshows how this is done. - Set up an embedding pipeline: we want to transform SMILES to vectors, an optionally, perform dimensionality reduction. The script
scripts/embedding_pipeline.pywill do this, and serialize a pretrained and convenience classEmbeddingModel.
The regressors
With an embedding pipeline in hand, the next step is to train a regressor to predict whatever astrochemical property you desire. I advise you set up a .csv file or other machine readable format that holds all of the molecules and column densities. As part of the regression pipeline, one may also optionally want to perform feature preprocessing, and I recommend setting up a composable sklearn.pipeline model. Most of this is done in notebooks/estimator_training, and calls on functions in the umda.data and umda.training modules.
Generating recomendations
This step doesn't really need trained regressors, but generally you'd be interested in predicting their abundance anyway. The script scripts/tmc1_recommendations.py shows how to do this: essentially you compute the pairwise distance of every molecule in your source (TMC-1) and those in your precomputed database of embeddings, and return the closest unique matches. The last step in this script grabs a regressor and predicts the recommendations' column densities. You'll likely need to filter the list to exclude elements you don't think are likely: this is still a pitfall because the distance metric is a reduction of comparisons in high dimensions, and in particular you are likely to end up with things like small diatomic heavy metals (because they are structurally similar to things like CH and CN). Coming up with a semantic model for recommendation wouldn't be too difficult, and is left to the reader
Project based on the cookiecutter data science project template. #cookiecutterdatascience
This version of the cookiecutter template is modified by Kelvin Lee.
Owner
- Name: Kelvin Lee
- Login: laserkelvin
- Kind: user
- Location: Hillsboro, OR
- Company: Intel Corporation
- Website: https://laserkelvin.github.io
- Twitter: cmmmsubmm
- Repositories: 95
- Profile: https://github.com/laserkelvin
HPC AI/ML engineer at Intel AXG. Previously postdoc at MIT, CfA/SAO. Interests in astrochemistry and physical chemistry.
GitHub Events
Total
- Fork event: 1
Last Year
- Fork event: 1
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- laserkelvin (1)