orion-kl-ml-main

ML analysis of Orion KL

https://github.com/humblesituation164/orion-kl-ml-main

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 7 DOI reference(s) in README
  • Academic publication links
    Links to: iop.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

ML analysis of Orion KL

Basic Info
  • Host: GitHub
  • Owner: HumbleSituation164
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 2.19 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created almost 3 years ago · Last pushed about 2 years ago
Metadata Files
Readme License Citation

README.md

NOTE: this repository is currently under construction - please check back for updates

Explaining chemical inventories with machine learning

Static Badge Static Badge Static Badge Static Badge Static Badge

Static Badge Static Badge

This repository includes notebooks and the codebase for the machine learning pipelines developed for the application of predicting chemical abundances of unobserved interstellar species for the Orion Kleinmann-Low (Orion KL) nebula and its respective environments.

The primary focus of this work is generating base predictions of column densities for targets of interest for molecular surveys. Inspiration was derived from Lee et al. (2021) to apply similar supervised and unsupervised methodologies to a source with more physical and chemical complexity. Here we use a similar workflow as Lee+ 2021, with modifications made to the embedding pipeline. Installation and workflow instructions are provided below. Full working examples are provided as notebooks accessible through this repo.

The corresponding dataset can be downloaded via Zenodo. If used, please cite accordingly. If you use the results generated from this work as part of your own research, please cite the Zenodo repository and the accepted paper once published (stay tuned!). In the meantime, please cite this repository.

Requirements

This package requires Python 3.8+, as the embedding package uses some decorators only available after 3.7.

Installation

Conda environment

  1. Use conda to install from the YAML specification with conda env create -f conda.yml
  2. Activate the environment by typing conda activate orion-kl
  3. Install the Python requirements using poetry install

You can test that the environment is working by running ipython and trying:

```python

from orionklml import embedsmiles, embedsmiles_batch

by default returns a NumPy array

embed_smiles("c1ccccc1")

if you want to work with torch.Tensors

embed_smiles("c1ccccc1", numpy=False)

operate on a list of SMILES

smiles = ["c1ccccc1", "CC#N", "C#CC#CC#CC"] embedsmilesbatch(smiles) ```

This uses the newly developed seq2seq model, VICGAE. Here is a quick installation and use guide to get you started, but see the repo for more details if you're interested in development or training your own model.

VICGAE molecular embedder installation and usage

astrochem_embedding can be installed quickly through PyPI: python $ pip install astrochem_embedding

You can check the environment using a pre-trained model: ```python

from astrochemembedding import VICGAE import torch model = VICGAE.frompretrained() model.embed_smiles("c1ccccc1") ```

===========================

Instructions

Users can generate their own predictions, however it is manual at this time. As part of the repository, we've included a pretrained embedding model, as well as the two regressors stored as pickles dumped using joblib. With some modification, other regressors available via scikit-learn packages can be trained using our pipeline.

Project workflow

Our goal was to make this project modular so that the pieces (embedder, regressor, dataset) could be swapped out to fit the users' needs.

Molecule embedding

  1. Collect all SMILES strings for your molecules and put them into a single .csv file. Alternatively, the VICGAE embedder also accepts SELFIES string, so you could compile your .csv file as SELFIES instead, if you prefer.
  2. Transform your SMILES strings into vectors using the embedding pipeline. The script pipeline/embed_molecules.py contains the functions you will need to do this, or you can check out the notebook for a working example. Note: if you would like to use the pretrained VICGAE embedder, you can fork the repo at laserkelvin/astrochem_embedding.

Training the regressors

With the embedding pipeline set up and the molecular embedding vectors acquired, we can now train a regressor to predict the column density of your molecule (or molecules) of choice. We advise you set up a .csv file or other machine readable format that holds all of the molecules, their embeddings, column densities, and other physical features as needed (e.g. velocity components, line widths). The codebase for model training will be become avaiable in the future; in the meantime you can use one of the pretrained regressors stored as a pickle under models/. You can also train (with modification) other regressors available through scikit-learn packages. Future notebooks will provide a working example if you would like to go this route.

Predicting column denisties

Once you have a trained regressor of choice, you can compile your molecules of interest into a .csv file with their SMILES and/or SELFIES strings. The script pipeline/predictions.py will generate the simulated parameters for your targets to generate your predictions for a chosen region in Orion KL.

Generating counterfactuals

This step uses a modified version of the exmol package developed by the White group. Please see their publication and repository for more details. This step isn't necessary, but does provide a chance to make some interesting constrastive examples to explore the interpretbility of detectibility for molecules. Essentially, this modified script will take a molecule of choice, make minor mutations to the base structure via addition, subtraction, or swapping, and generate that new counterfactual structure. This first portion does not neccessarily need our pipeline, however, exmol can output the corresponding SELFIES string to feed into a trained regressor and predict its column density. This method, although in it's infancy, can provide a means of a testable hypothesis in looking at correlations between functional groups and molecular structural components with their abundance.

Note: Codebase and use of the modified exmol package will be provifed in future source code and notebooks. Until then, please see the commits log to see the changes and adjustments made for our branch.


License

Distributed under the terms of the MIT license

Issues

Credits

Project based on the cookiecutter data science project template. #cookiecutterdatascience

This version of the cookiecutter template is modified by Kelvin Lee.

Owner

  • Name: Haley Scolati
  • Login: HumbleSituation164
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Scolati
    given-names: Haley
    orcid: https://orcid.org/0000-0002-8505-4459
  - family-names: Lee
    given-names: Kelvin
    orcid: https://orcid.org/0000-0002-1903-9242
title: "Explanation of chemical inventories using machine learning"
version: 0.1.0
doi: 10.5281/zenodo.7958251
date-released: 2023-05-22

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Dependencies

poetry.lock pypi
  • 167 dependencies
pyproject.toml pypi
  • black ^21.11b0 develop
  • flake8 ^4.0.1 develop
  • pre-commit ^2.15.0 develop
  • pytest ^6.2.5 develop
  • astrochem-embedding ^0.1.4
  • exmol ^2.1.1
  • ipython ^7.29.0
  • jupyter ^1.0.0
  • matplotlib ^3.5.0
  • numpy ^1.21.4
  • pandas ^1.3.4
  • periodictable ^1.6.1
  • python ^3.8, < 3.10.0
  • shap ^0.40.0
  • skunk ^0.4.0