gauche

A Library for Gaussian Processes in Chemistry

https://github.com/leojklarner/gauche

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 7 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, nature.com, iop.org, rsc.org, acs.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

A Library for Gaussian Processes in Chemistry

Basic Info
Statistics
  • Stars: 239
  • Watchers: 7
  • Forks: 25
  • Open Issues: 16
  • Releases: 0
Created about 5 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Project Status: Active – The project has reached a stable, usable state and is being actively developed. License: MIT Docs CodeFactor Code style: black python

<!--DOI:10.48550/arXiv.2212.04450

GAUCHE is a collaborative, open-source software library that aims to make state-of-the-art probabilistic modelling and black-box optimisation techniques more easily accessible to scientific experts in chemistry, materials science and beyond. We provide 30+ bespoke kernels for molecules, chemical reactions and proteins and illustrate how they can be used for Gaussian processes and Bayesian optimisation in 10+ easy-to-adapt tutorial notebooks.

Overview | Getting Started | Documentation | Paper (NeurIPS 2023)

What's New?

Overview

General-purpose Gaussian process (GP) and Bayesian optimisation (BO) libraries do not cater for molecular representations. Likewise, general-purpose molecular machine learning libraries do not consider GPs and BO. To bridge this gap, GAUCHE provides a modular, robust and easy-to-use framework of 30+ parallelisable and batch-GP-compatible implementations of string, fingerprint and graph kernels that operate on a range of widely-used molecular representations.

Kernels

Standard GP packages typically assume continuous input spaces of low and fixed dimensionality. This makes it difficult to apply them to common molecular representations: molecular graphs are discrete objects, SMILES strings vary in length and topological fingerprints tend to be high-dimensional and sparse. To bridge this gap, GAUCHE provides:

  • Fingerprint Kernels that measure the similarity between bit/count vectors of descriptor by examining the degree to which their elements overlap.
  • String Kernels that measure the similarity between strings by examining the degree to which their sub-strings overlap.
  • Graph Kernels that measure between graphs by examining the degree to which certain substructural motifs overlap.

Representations

GAUCHE supports any representation that is based on bit/count vectors, strings or graphs. For rapid prototyping and benchmarking, we also provide a range of standard featurisation techniques for molecules, chemical reactions and proteins:

Domain Representation
Molecules ECFP Fingerprints [1], rdkit Fragments, Fragprints, Molecular Graphs [2], SMILES [3], SELFIES [4]
Chemical Reactions One-Hot Encoding, Data-Driven Reaction Fingerprints [5], Differential Reaction Fingerprints [6], Reaction SMARTS
Proteins Sequences, Graphs [7]

Extensions

If there are any specific kernels or representations that you would like to see included in GAUCHE, feel free to submit an issue or pull request.

Getting Started

The easiest way to install Gauche is via pip.

bash pip install gauche

As not all users will need the full functionality of the package, we provide a range of installation options:

  • pip install gauche - installs the core functionality of GAUCHE (kernels, representations, data loaders, etc.) and should cover a wide range of use cases.
  • pip install gauche[rxn] - additionally installs the rxnfp and drfp fingerprints that can be used to represent chemical reactions.
  • pip install gauche[graphs] - installs all dependencies for graph kernels and representations.

If you aren't sure which installation option is right for you, you can simply install all of them with pip install gauche[all].


Tutorial Notebooks

The best way to get started with GAUCHE is to check out our tutorial notebooks. These notebooks provide a step-by-step introduction to the core functionality of GAUCHE and illustrate how it can be used to solve a range of common problems in molecular property prediction and optimisation. To install gauche in the colab environment run:

pip install gauche

| | | |---|---| | Loading and Featurising Molecules | Open In Colab | | GP Regression on Molecules | Open In Colab | | Bayesian Optimisation Over Molecules | Open In Colab | | Multioutput Gaussian Processes for Multitask Learning | Open In Colab | | Training GPs on Graphs | Open In Colab | | Sparse GP Regression for Big Molecular Data | Open In Colab | |Molecular Preference Learning|Open In Colab | |Preferential Bayesian Optimisation|Open In Colab | |Training Bayesian GNNs on Molecules|Open In Colab |


Example Usage: Loading and Featurising Molecules

GAUCHE provides a range of helper functions for loading and preprocessing datasets for molecular property and reaction yield prediction and optimisation tasks. For more detail, check out our Loading and Featurising Molecules Tutorial and the corresponding section in the Docs.

Open In Colab

```python
from gauche.dataloader import MolPropLoader

loader = MolPropLoader()

load one of the included benchmarks

loader.load_benchmark("Photoswitch")

or a custom dataset

loader.readcsv(path="data.csv", smilescolumn="smiles", label_column="y")

and quickly featurise the provided molecules

loader.featurize('ecfp_fragprints') X, y = loader.features, loader.labels ```

Example Usage: GP Regression on Molecules

Fitting a GP model with a kernel from GAUCHE and using it to predict the properties of new molecules is as easy as this. For more detail, check out our GP Regression on Molecules Tutorial and the corresponding section in the Docs.

Open In Colab

```python import gpytorch from botorch import fitgpytorchmodel from gauche.kernels.fingerprintkernels.tanimotokernel import TanimotoKernel

define GP model with Tanimoto kernel

class TanimotoGP(gpytorch.models.ExactGP): def init(self, trainx, trainy, likelihood): super(TanimotoGP, self).init(trainx, trainy, likelihood) self.meanmodule = gpytorch.means.ConstantMean() self.covarmodule = gpytorch.kernels.ScaleKernel(TanimotoKernel())

def forward(self, x): meanx = self.meanmodule(x) covarx = self.covarmodule(x) return gpytorch.distributions.MultivariateNormal(meanx, covarx)

initialise GP likelihood, model and

marginal log likelihood objective

likelihood = gpytorch.likelihoods.GaussianLikelihood() model = TanimotoGP(Xtrain, ytrain, likelihood) mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

fit GP with BoTorch in order to use

the LBFGS-B optimiser (recommended)

fitgpytorchmodel(mll)

use the trained GP to get predictions and

uncertainty estimates for new molecules

model.eval() likelihood.eval() preds = model(Xtest) predmeans, pred_vars = preds.mean, preds.variance ```

Citing GAUCHE

If GAUCHE is useful for your work please consider citing the following paper:

```bibtex @article{griffiths2024gauche, title={{GAUCHE}: A library for {Gaussian} processes in chemistry}, author={Griffiths, Ryan-Rhys and Klarner, Leo and Moss, Henry and Ravuri, Aditya and Truong, Sang and Du, Yuanqi and Stanton, Samuel and Tom, Gary and Rankovic, Bojana and Jamasb, Arian and others}, journal={Advances in Neural Information Processing Systems}, volume={36}, year={2024} }

```

References

[1] Rogers, D. and Hahn, M., 2010. Extended-connectivity fingerprints. Journal of Chemical Information and Modeling, 50(5), pp.742-754.

[2] Fey, M., & Lenssen, J. E. (2019). Fast graph representation learning with PyTorch Geometric. arXiv preprint arXiv:1903.02428.

[3] Weininger, D., 1988. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences, 28(1), pp.31-36.

[4] Krenn, M., Häse, F., Nigam, A., Friederich, P. and Aspuru-Guzik, A., 2020. Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4), p.045024.

[5] Probst, D., Schwaller, P. and Reymond, J.L., 2022. Reaction classification and yield prediction using the differential reaction fingerprint DRFP. Digital Discovery, 1(2), pp.91-97.

[6] Schwaller, P., Probst, D., Vaucher, A.C., Nair, V.H., Kreutter, D., Laino, T. and Reymond, J.L., 2021. Mapping the space of chemical reactions using attention-based neural networks. Nature Machine Intelligence, 3(2), pp.144-152.

[7] Jamasb, A., Viñas Torné, R., Ma, E., Du, Y., Harris, C., Huang, K., Hall, D., Lió, P. and Blundell, T., 2022. Graphein-a Python library for geometric deep learning and network analysis on biomolecular structures and interaction networks. Advances in Neural Information Processing Systems, 35, pp.27153-27167.

Owner

  • Name: Leo Klarner
  • Login: leojklarner
  • Kind: user
  • Location: Oxford

PhD student in AI for Drug Discovery @ Oxford

Citation (citation.bib)

@misc{griffiths2022gauche,
      title={GAUCHE: A Library for Gaussian Processes in Chemistry}, 
      author={Ryan-Rhys Griffiths and Leo Klarner and Henry B. Moss and Aditya Ravuri and Sang Truong and Bojana Rankovic and Yuanqi Du and Arian Jamasb and Julius Schwartz and Austin Tripp and Gregory Kell and Anthony Bourached and Alex Chan and Jacob Moss and Chengzhi Guo and Alpha A. Lee and Philippe Schwaller and Jian Tang},
      year={2022},
      eprint={2212.04450},
      archivePrefix={arXiv},
      primaryClass={physics.chem-ph}
}

GitHub Events

Total
  • Issues event: 3
  • Watch event: 23
  • Issue comment event: 7
  • Pull request event: 1
  • Fork event: 3
Last Year
  • Issues event: 3
  • Watch event: 23
  • Issue comment event: 7
  • Pull request event: 1
  • Fork event: 3

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 1
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: over 1 year
  • Total issue authors: 1
  • Total pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 4.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • kkovary (2)
  • AustinT (1)
  • sgbaird (1)
Pull Request Authors
  • AustinT (1)
  • AntObi (1)
  • benjamc (1)
  • kkovary (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 413 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 4
  • Total versions: 2
  • Total maintainers: 1
pypi.org: gauche

Gaussian Process Library for Molecules, Chemical Reactions and Proteins.

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 4
  • Downloads: 413 Last month
Rankings
Dependent repos count: 7.5%
Average: 8.8%
Dependent packages count: 10.1%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/build.yaml actions
  • actions/checkout v2 composite
  • s-weigand/setup-conda v1 composite
.github/workflows/code-style.yaml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.requirements/base.in pypi
  • gpytorch *
  • graphein ==1.4.0
  • numpy *
  • pandas *
  • rdkit *
  • scikit-learn *
  • selfies *
  • tqdm *
.requirements/dev.in pypi
  • black * development
  • isort * development
  • pytest * development
.requirements/docs.in pypi
  • furo *
  • m2r2 *
  • nbsphinx *
  • nvsphinx-link *
  • sphinx *
  • sphinx-copybutton *
  • sphinx-inline-tabs *
  • sphinxcontrib-gtagjs *
  • sphinxext-opengraph *
pyproject.toml pypi
setup.py pypi
.requirements/graphs.in pypi
  • grakel *
  • graphein *
.requirements/rxn.in pypi
  • drfp *
  • transformers *
.github/workflows/build_documentation.yaml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite
  • peaceiris/actions-gh-pages v3 composite