matbench-genmetrics

matbench-genmetrics: A Python library for benchmarking crystal structure generative models using time-based splits of Materials Project structures - Published in JOSS (2024)

https://github.com/sparks-baird/matbench-genmetrics

Science Score: 100.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 9 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
    1 of 3 committers (33.3%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

materials-informatics python
Last synced: 4 months ago · JSON representation ·

Repository

Generative materials benchmarking metrics, inspired by guacamol and CDVAE.

Basic Info
Statistics
  • Stars: 40
  • Watchers: 1
  • Forks: 2
  • Open Issues: 17
  • Releases: 3
Topics
materials-informatics python
Created over 3 years ago · Last pushed over 1 year ago
Metadata Files
Readme Contributing License Citation Authors

README.md

Project generated with PyScaffold ReadTheDocs Coveralls PyPI-Server Conda-Forge PyPI - Downloads DOI <!-- These are examples of badges you might also want to add to your README. Update the URLs accordingly. Lines of code Built Status Monthly Downloads Twitter --> <!--- > NOTE: This is a WIP repository (as of 2022-08-06) being developed in parallel with xtal2png and mp-time-split. Feedback and contributions welcome! --->

This is not an official repository of Matbench, but eventually, it may be incorporated into Matbench

matbench-genmetrics Open In Colab

Generative materials benchmarking metrics, inspired by guacamol and CDVAE.

This repository provides standardized benchmarks for benchmarking generative models for crystal structure. Each benchmark has a fixed dataset, a predefined split, and notions of best (i.e. metrics) associated with it.

NOTE: This project is separate from https://matbench-discovery.materialsproject.org/ which provides a slick leaderboard and package for benchmarking ML models on crystal stability prediction from unrelaxed structures. This project instead looks at assessing the quality of generative models for crystal structures.

Getting Started

Installation, a dummy example, output metrics for the example, and descriptions of the benchmark metrics.

Installation

bash pip install matbench-genmetrics

See Advanced Installation for more information.

Example

NOTE: be sure to set dummy=False for the real/full benchmark run. MPTSMetrics10 is intended for fast prototyping and debugging, as it assumes only 10 generated structures.

```python

from matbenchgenmetrics.mptimesplit.utils.gen import DummyGenerator from matbenchgenmetrics.core.metrics import MPTSMetrics10, MPTSMetrics100, MPTSMetrics1000, MPTSMetrics10000 mptm = MPTSMetrics10(dummy=True) for fold in mptm.folds: trainvalinputs = mptm.gettrainandvaldata(fold) dg = DummyGenerator() dg.fit(trainvalinputs) genstructures = dg.gen(n=mptm.numgen) mptm.evaluateandrecord(fold, genstructures) print(mptm.recordedmetrics) ```

python { 0: { "validity": 0.4375, "coverage": 0.0, "novelty": 1.0, "uniqueness": 0.9777777777777777, }, 1: { "validity": 0.4390681003584229, "coverage": 0.0, "novelty": 1.0, "uniqueness": 0.9333333333333333, }, 2: { "validity": 0.4401197604790419, "coverage": 0.0, "novelty": 1.0, "uniqueness": 0.8222222222222222, }, 3: { "validity": 0.4408740359897172, "coverage": 0.0, "novelty": 1.0, "uniqueness": 0.8444444444444444, }, 4: { "validity": 0.4414414414414415, "coverage": 0.0, "novelty": 1.0, "uniqueness": 0.9111111111111111, }, }

Metrics

| Metric | Description | | ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Validity | A loose measure of how "valid" the set of generated structures are by comparing the space group number distribution of the generated structures with the benchmark data. Formally, this is one minus (Wasserstein distance between distribution of space group numbers for train and generated structures divided by distance of dummy case between train and space_group_number == 1). See also https://github.com/sparks-baird/matbench-genmetrics/issues/44 | | Coverage | A form of "rediscovery", where structures from the future that were held out were "discovered" by the generative model, i.e., when the generative model "predicted the future". Formally, this is the match counts between held-out test structures and generated structures divided by number of test structures.| | Novelty | A measure of how novel the generated structures are relative to the structures that were used to train the generative model. Formally, this is one minus (match counts between train structures and generated structures divided by number of generated structures).| | Uniqueness | A measure of whether the generative model is suggesting repeat structures or not. Formally, this is one minus (non-self-comparing match counts within generated structures divided by total possible non-self-comparing matches).|

A match is when StructureMatcher(stol=0.5, ltol=0.3, angle_tol=10.0).fit(s1, s2) evaluates to True.

Detailed descriptions of the metrics are given on the Metrics page.

We performed a "slow march of time" benchmarking study, which uses the mp-time-split data from a future fold as the "generated" structures for the previous fold. The results are presented in the charts below. See the corresponding notebook for details.

Slow March of Time benchmarking

Advanced Installation

PyPI (pip) installation

Create and activate a new conda environment named matbench-genmetrics (-n) with python==3.11.* or your preferred Python version, then install matbench-genmetrics via pip.

bash conda create -n matbench-genmetrics python==3.11.* conda activate matbench-genmetrics pip install matbench-genmetrics

Editable installation

In order to set up the necessary environment:

  1. clone and enter the repository via:

bash git clone https://github.com/sparks-baird/matbench-genmetrics.git cd matbench-genmetrics

  1. create and activate a new conda environment (optional, but recommended)

bash conda env create --name matbench-genmetrics python==3.11.* conda activate matbench-genmetrics

  1. perform an editable (-e) installation in the current directory (.):

bash pip install -e .

NOTE: Some changes, e.g. in setup.cfg, might require you to run pip install -e . again.

Optional and needed only once after git clone:

  1. install several pre-commit git hooks with:

bash pre-commit install # You might also want to run `pre-commit autoupdate`

and checkout the configuration under .pre-commit-config.yaml. The -n, --no-verify flag of git commit can be used to deactivate pre-commit hooks temporarily.

  1. install nbstripout git hooks to remove the output cells of committed notebooks with:

bash nbstripout --install --attributes notebooks/.gitattributes

This is useful to avoid large diffs due to plots in your notebooks. A simple nbstripout --uninstall will revert these changes.

Then take a look into the scripts and notebooks folders.

Dependency Management & Reproducibility

  1. Always keep your abstract (unpinned) dependencies updated in environment.yml and eventually in setup.cfg if you want to ship and install your package via pip later on.
  2. Create concrete dependencies as environment.lock.yml for the exact reproduction of your environment with:

bash conda env export -n matbench-genmetrics -f environment.lock.yml

For multi-OS development, consider using --no-builds during the export. 3. Update your current environment with respect to a new environment.lock.yml using:

bash conda env update -f environment.lock.yml --prune

Project Organization

txt ├── AUTHORS.md <- List of developers and maintainers. ├── CHANGELOG.md <- Changelog to keep track of new features and fixes. ├── CONTRIBUTING.md <- Guidelines for contributing to this project. ├── Dockerfile <- Build a docker container with `docker build .`. ├── LICENSE.txt <- License as chosen on the command-line. ├── README.md <- The top-level README for developers. ├── configs <- Directory for configurations of model & application. ├── data │ ├── external <- Data from third party sources. │ ├── interim <- Intermediate data that has been transformed. │ ├── processed <- The final, canonical data sets for modeling. │ └── raw <- The original, immutable data dump. ├── docs <- Directory for Sphinx documentation in rst or md. ├── environment.yml <- The conda environment file for reproducibility. ├── models <- Trained and serialized models, model predictions, │ or model summaries. ├── notebooks <- Jupyter notebooks. Naming convention is a number (for │ ordering), the creator's initials and a description, │ e.g. `1.0-fw-initial-data-exploration`. ├── pyproject.toml <- Build configuration. Don't change! Use `pip install -e .` │ to install for development or to build `tox -e build`. ├── references <- Data dictionaries, manuals, and all other materials. ├── reports <- Generated analysis as HTML, PDF, LaTeX, etc. │ └── figures <- Generated plots and figures for reports. ├── scripts <- Analysis and production scripts which import the │ actual PYTHON_PKG, e.g. train_model. ├── setup.cfg <- Declarative configuration of your project. ├── setup.py <- [DEPRECATED] Use `python setup.py develop` to install for │ development or `python setup.py bdist_wheel` to build. ├── src │ └── matbench_genmetrics <- Actual Python package where the main functionality goes. ├── tests <- Unit tests which can be run with `pytest`. ├── .coveragerc <- Configuration for coverage reports of unit tests. ├── .isort.cfg <- Configuration for git hook that sorts imports. └── .pre-commit-config.yaml <- Configuration of pre-commit git hooks.

Citing

Baird, S.G.; Sayeed, H.M.; Montoya, J.; Sparks, T.D. (2024). matbench-genmetrics: A Python library for benchmarking crystal structure generative models using time-based splits of Materials Project structures. Journal of Open Source Software, 9(97), 5618, https://doi.org/10.21105/joss.05618

bibtex @article{Baird2024, doi = {10.21105/joss.05618}, url = {https://doi.org/10.21105/joss.05618}, year = {2024}, publisher = {The Open Journal}, volume = {9}, number = {97}, pages = {5618}, author = {Sterling G. Baird and Hasan M. Sayeed and Joseph Montoya and Taylor D. Sparks}, title = {matbench-genmetrics: A Python library for benchmarking crystal structure generative models using time-based splits of Materials Project structures}, journal = {Journal of Open Source Software} }

Note

This project has been set up using PyScaffold 4.2.2.post1.dev2+ge50b5e1 and the dsproject extension 0.7.2.post1.dev2+geb5d6b6.

Owner

  • Name: Sparks/Baird Materials Informatics
  • Login: sparks-baird
  • Kind: organization
  • Email: sterling.baird@utah.edu
  • Location: United States of America

Sterling Baird and Taylor Sparks Materials Informatics Projects

JOSS Publication

matbench-genmetrics: A Python library for benchmarking crystal structure generative models using time-based splits of Materials Project structures
Published
May 27, 2024
Volume 9, Issue 97, Page 5618
Authors
Sterling G. Baird ORCID
Materials Science & Engineering, University of Utah, United States of America, Acceleration Consortium, University of Toronto. 80 St George St, Toronto, ON Canada
Hasan M. Sayeed ORCID
Materials Science & Engineering, University of Utah, United States of America
Joseph Montoya ORCID
Toyota Research Institute, Los Altos, CA, United States of America
Taylor D. Sparks ORCID
Materials Science & Engineering, University of Utah, United States of America
Editor
Sophie Beck ORCID
Tags
materials informatics crystal structure generative modeling TimeSeriesSplit benchmarking

Citation (CITATION.cff)

cff-version: "1.2.0"
authors:
- family-names: Baird
  given-names: Sterling G.
  orcid: "https://orcid.org/0000-0002-4491-6876"
- family-names: Sayeed
  given-names: Hasan M.
  orcid: "https://orcid.org/0000-0002-6583-7755"
- family-names: Montoya
  given-names: Joseph
  orcid: "https://orcid.org/0000-0001-5760-2860"
- family-names: Sparks
  given-names: Taylor D.
  orcid: "https://orcid.org/0000-0001-8020-7711"
contact:
- family-names: Baird
  given-names: Sterling G.
  orcid: "https://orcid.org/0000-0002-4491-6876"
doi: 10.5281/zenodo.10840604
message: If you use this software, please cite our article in the
  Journal of Open Source Software.
preferred-citation:
  authors:
  - family-names: Baird
    given-names: Sterling G.
    orcid: "https://orcid.org/0000-0002-4491-6876"
  - family-names: Sayeed
    given-names: Hasan M.
    orcid: "https://orcid.org/0000-0002-6583-7755"
  - family-names: Montoya
    given-names: Joseph
    orcid: "https://orcid.org/0000-0001-5760-2860"
  - family-names: Sparks
    given-names: Taylor D.
    orcid: "https://orcid.org/0000-0001-8020-7711"
  date-published: 2024-05-27
  doi: 10.21105/joss.05618
  issn: 2475-9066
  issue: 97
  journal: Journal of Open Source Software
  publisher:
    name: Open Journals
  start: 5618
  title: "matbench-genmetrics: A Python library for benchmarking crystal
    structure generative models using time-based splits of Materials
    Project structures"
  type: article
  url: "https://joss.theoj.org/papers/10.21105/joss.05618"
  volume: 9
title: "matbench-genmetrics: A Python library for benchmarking crystal
  structure generative models using time-based splits of Materials
  Project structures"

GitHub Events

Total
  • Watch event: 6
  • Fork event: 1
Last Year
  • Watch event: 6
  • Fork event: 1

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 337
  • Total Committers: 3
  • Avg Commits per committer: 112.333
  • Development Distribution Score (DDS): 0.027
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
sgbaird s****d@u****u 328
hasan h****3@g****m 7
Janosh Riebesell j****l@g****m 2
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 28
  • Total pull requests: 52
  • Average time to close issues: 3 months
  • Average time to close pull requests: 4 days
  • Total issue authors: 6
  • Total pull request authors: 3
  • Average comments per issue: 2.68
  • Average comments per pull request: 0.52
  • Merged pull requests: 51
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • sgbaird (22)
  • kjappelbaum (1)
  • ml-evs (1)
  • sp8rks (1)
  • jamesrhester (1)
  • hasan-sayeed (1)
Pull Request Authors
  • sgbaird (50)
  • janosh (1)
  • hasan-sayeed (1)
Top Labels
Issue Labels
enhancement (2)
Pull Request Labels

Packages

  • Total packages: 3
  • Total downloads:
    • pypi 65 last-month
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 0
    (may contain duplicates)
  • Total versions: 32
  • Total maintainers: 1
proxy.golang.org: github.com/sparks-baird/matbench-genmetrics
  • Versions: 17
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 5.5%
Average: 5.6%
Dependent repos count: 5.8%
Last synced: 4 months ago
pypi.org: matbench-genmetrics

Generative materials benchmarking metrics, inspired by CDVAE.

  • Versions: 14
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 65 Last month
Rankings
Dependent packages count: 6.6%
Stargazers count: 17.2%
Average: 22.1%
Forks count: 23.2%
Dependent repos count: 30.6%
Downloads: 33.1%
Maintainers (1)
Last synced: 4 months ago
conda-forge.org: matbench-genmetrics
  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Stargazers count: 53.0%
Average: 54.9%
Forks count: 56.7%
Last synced: 4 months ago

Dependencies

docs/requirements.txt pypi
  • ipykernel *
  • myst-parser *
  • nbsphinx *
  • nbsphinx-link *
  • sphinx >=3.2.1
  • sphinx_copybutton *
  • sphinx_rtd_theme *
.github/workflows/ci.yml actions
  • actions/checkout v3 composite
  • actions/download-artifact v3 composite
  • actions/setup-python v4 composite
  • actions/upload-artifact v3 composite
  • coverallsapp/github-action master composite
.github/workflows/draft-pdf.yml actions
  • actions/checkout v3 composite
  • actions/upload-artifact v1 composite
  • openjournals/openjournals-draft-action master composite
Dockerfile docker
  • mcr.microsoft.com/vscode/devcontainers/python 0-${VARIANT} build
pyproject.toml pypi
setup.py pypi
environment.yml conda
  • ipython
  • matplotlib
  • pip
  • plotly
  • pre_commit
  • pytest
  • pytest-cov
  • python >=3.6
  • python-kaleido
  • sphinx
  • tox