the-well

A 15TB Collection of Physics Simulation Datasets

https://github.com/polymathicai/the_well

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

A 15TB Collection of Physics Simulation Datasets

Basic Info
Statistics
  • Stars: 917
  • Watchers: 19
  • Forks: 73
  • Open Issues: 4
  • Releases: 3
Created over 1 year ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md


![Test Workflow](https://github.com/PolymathicAI/the_well/actions/workflows/tests.yaml/badge.svg) [![PyPI](https://img.shields.io/pypi/v/the_well)](https://pypi.org/project/the-well/) [![Docs](https://img.shields.io/badge/docs-latest---?color=25005a&labelColor=grey)](https://polymathic-ai.org/the_well/) [![arXiv](https://img.shields.io/badge/arXiv-2412.00568---?logo=arXiv&labelColor=b31b1b&color=grey)](https://arxiv.org/abs/2412.00568) [![NeurIPS](https://img.shields.io/badge/NeurIPS-2024---?logo=https%3A%2F%2Fneurips.cc%2Fstatic%2Fcore%2Fimg%2FNeurIPS-logo.svg&labelColor=68448B&color=b3b3b3)](https://openreview.net/forum?id=00Sx577BT3) [![HuggingFace](https://img.shields.io/badge/datasets-%20?logo=huggingface&logoColor=%23FFD21E&label=Hugging%20Face&labelColor=%236B7280&color=%23FFD21E )](https://huggingface.co/collections/polymathic-ai/the-well-67e129f4ca23e0447395d74c)

The Well: 15TB of Physics Simulations

Welcome to the Well, a large-scale collection of machine learning datasets containing numerical simulations of a wide variety of spatiotemporal physical systems. The Well draws from domain scientists and numerical software developers to provide 15TB of data across 16 datasets covering diverse domains such as biological systems, fluid dynamics, acoustic scattering, as well as magneto-hydrodynamic simulations of extra-galactic fluids or supernova explosions. These datasets can be used individually or as part of a broader benchmark suite for accelerating research in machine learning and computational sciences.

Tap into the Well

Once the Well package installed and the data downloaded you can use them in your training pipeline.

```python from the_well.data import WellDataset from torch.utils.data import DataLoader

trainset = WellDataset( wellbasepath="path/to/base", welldatasetname="nameofthedataset", wellsplitname="train" ) trainloader = DataLoader(trainset)

for batch in train_loader: ... ```

For more information regarding the interface, please refer to the API and the tutorials.

Installation

If you plan to use The Well datasets to train or evaluate deep learning models, we recommend to use a machine with enough computing resources. We also recommend creating a new Python (>=3.10) environment to install the Well. For instance, with venv:

python -m venv path/to/env source path/to/env/activate/bin

From PyPI

The Well package can be installed directly from PyPI.

pip install the_well

From Source

It can also be installed from source. For this, clone the repository and install the package with its dependencies.

git clone https://github.com/PolymathicAI/the_well cd the_well pip install .

Depending on your acceleration hardware, you can specify --extra-index-url to install the relevant PyTorch version. For example, use

pip install . --extra-index-url https://download.pytorch.org/whl/cu121

to install the dependencies built for CUDA 12.1.

Benchmark Dependencies

If you want to run the benchmarks, you should install additional dependencies.

pip install the_well[benchmark]

Downloading the Data

The Well datasets range between 6.9GB and 5.1TB of data each, for a total of 15TB for the full collection. Ensure that your system has enough free disk space to accomodate the datasets you wish to download.

Once the_well is installed, you can use the the-well-download command to download any dataset of The Well.

the-well-download --base-path path/to/base --dataset active_matter --split train

If --dataset and --split are omitted, all datasets and splits will be downloaded. This could take a while!

Streaming from Hugging Face

Most of the Well datasets are also hosted on Hugging Face. Data can be streamed directly from the hub using the following code.

```python from the_well.data import WellDataset from torch.utils.data import DataLoader

The following line may take a couple of minutes to instantiate the datamodule

trainset = WellDataset( wellbasepath="hf://datasets/polymathic-ai/", # access from HF hub welldatasetname="activematter", wellsplitname="train", ) trainloader = DataLoader(trainset)

for batch in train_loader: ... ```

For better performance in large training, we advise downloading the data locally instead of streaming it over the network.

Benchmark

Train Models on the Well

The repository allows benchmarking surrogate models on the different datasets that compose the Well. Some state-of-the-art models are already implemented in models, while dataset classes handle the raw data of the Well. The benchmark relies on a training script that uses hydra to instantiate various classes (e.g. dataset, model, optimizer) from configuration files.

For instance, to run the training script of default FNO architecture on the active matter dataset, launch the following commands:

bash cd the_well/benchmark python train.py experiment=fno server=local data=active_matter

Each argument corresponds to a specific configuration file. In the command above server=local indicates the training script to use local.yaml, which just declares the relative path to the data. The configuration can be overridden directly or edited with new YAML files. Please refer to hydra documentation for editing configuration.

You can use this command within a sbatch script to launch the training with Slurm.

Load Benchmarked Model Checkpoints

The model benchmarked in the original paper of the Well have been designed as a a simple baseline. They should not be considered as state-of-the-art. We hope that the community will build upon these results to develop better architectures for PDE surrogate modeling.

Most of the checkpoints of the models are available on Hugging Face. To load a specific checkpoint follow the example below of the FNO model trained on the active_matter dataset.

```python from the_well.benchmark.models import FNO

model = FNO.frompretrained("polymathic-ai/FNO-activematter") ```

Citation

This project has been led by the Polymathic AI organization, in collaboration with researchers from the Flatiron Institute, University of Colorado Boulder, University of Cambridge, New York University, Rutgers University, Cornell University, University of Tokyo, Los Alamos Natioinal Laboratory, University of California, Berkeley, Princeton University, CEA DAM, and University of Liège.

If you find this project useful for your research, please consider citing

@article{ohana2024well, title={The well: a large-scale collection of diverse physics simulations for machine learning}, author={Ohana, Ruben and McCabe, Michael and Meyer, Lucas and Morel, Rudy and Agocs, Fruzsina and Beneitez, Miguel and Berger, Marsha and Burkhart, Blakesly and Dalziel, Stuart and Fielding, Drummond and others}, journal={Advances in Neural Information Processing Systems}, volume={37}, pages={44989--45037}, year={2024} }

Contact

For questions regarding this project, please contact Ruben Ohana and Michael McCabe at {rohana,mmccabe}@flatironinstitute.org.

Bug Reports and Feature Requests

To report a bug (in the data or the code), request a feature or simply ask a question, you can open an issue on the repository.

Owner

  • Name: Polymathic AI
  • Login: PolymathicAI
  • Kind: organization

Citation (CITATION)

@article{ohana2024well,
  title={The well: a large-scale collection of diverse physics simulations for machine learning},
  author={Ohana, Ruben and McCabe, Michael and Meyer, Lucas and Morel, Rudy and Agocs, Fruzsina and Beneitez, Miguel and Berger, Marsha and Burkhart, Blakesly and Dalziel, Stuart and Fielding, Drummond and others},
  journal={Advances in Neural Information Processing Systems},
  volume={37},
  pages={44989--45037},
  year={2024}
}

GitHub Events

Total
  • Create event: 35
  • Issues event: 30
  • Release event: 2
  • Watch event: 867
  • Delete event: 29
  • Issue comment event: 79
  • Push event: 87
  • Pull request review comment event: 27
  • Pull request review event: 62
  • Pull request event: 47
  • Fork event: 76
Last Year
  • Create event: 35
  • Issues event: 30
  • Release event: 2
  • Watch event: 867
  • Delete event: 29
  • Issue comment event: 79
  • Push event: 87
  • Pull request review comment event: 27
  • Pull request review event: 62
  • Pull request event: 47
  • Fork event: 76

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 16
  • Total pull requests: 16
  • Average time to close issues: 19 days
  • Average time to close pull requests: 1 day
  • Total issue authors: 12
  • Total pull request authors: 4
  • Average comments per issue: 1.0
  • Average comments per pull request: 2.06
  • Merged pull requests: 12
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 16
  • Pull requests: 16
  • Average time to close issues: 19 days
  • Average time to close pull requests: 1 day
  • Issue authors: 12
  • Pull request authors: 4
  • Average comments per issue: 1.0
  • Average comments per pull request: 2.06
  • Merged pull requests: 12
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • LTMeyer (3)
  • FloWsnr (2)
  • mikemccabe210 (2)
  • arvizu-god (1)
  • NielsRogge (1)
  • MilesCranmer (1)
  • sawhney-medha (1)
  • Methylamphetamine (1)
  • tung-nd (1)
  • echirtel1 (1)
  • Autumn-Roy (1)
  • ArshKA (1)
  • caoql98 (1)
Pull Request Authors
  • LTMeyer (16)
  • francois-rozet (3)
  • payelmuk150 (2)
  • mikemccabe210 (1)
  • RudyMorel (1)
  • rubenohana (1)
  • eltociear (1)
  • FloWsnr (1)
Top Labels
Issue Labels
bug (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 258 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 3
  • Total maintainers: 1
pypi.org: the-well

A large-scale collection of machine learning datasets of various spatiotemporal physical systems

  • Documentation: https://the-well.readthedocs.io/
  • License: BSD 3-Clause License Copyright (c) 2024 Polymathic AI. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of Polymathic AI nor the names of the Well contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  • Latest release: 1.1.0
    published 11 months ago
  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 258 Last month
Rankings
Dependent packages count: 9.9%
Average: 33.0%
Dependent repos count: 56.0%
Maintainers (1)
Last synced: 7 months ago

Dependencies

.github/workflows/tests.yaml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • pre-commit/action v3.0.1 composite
pyproject.toml pypi
  • einops >=0.8
  • h5py >=3.9.0
  • numpy >=1.20
  • pyyaml >=6.0
  • torch >=2.1