virtualizarr
Cloud-Optimize your Scientific Data as Virtual Zarr stores, using xarray syntax.
Science Score: 64.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: ieee.org -
✓Committers with academic emails
1 of 24 committers (4.2%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.3%) to scientific vocabulary
Keywords from Contributors
Repository
Cloud-Optimize your Scientific Data as Virtual Zarr stores, using xarray syntax.
Basic Info
- Host: GitHub
- Owner: zarr-developers
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://virtualizarr.readthedocs.io/en/stable/index.html
- Size: 1.17 MB
Statistics
- Stars: 224
- Watchers: 12
- Forks: 49
- Open Issues: 140
- Releases: 13
Metadata Files
README.md
VirtualiZarr
Cloud-Optimize your Scientific Data as a Virtual Zarr Datacube, using Xarray syntax.
The best way to distribute large scientific datasets is via the Cloud, in Cloud-Optimized formats [^1]. But often this data is stuck in archival pre-Cloud file formats such as netCDF.
VirtualiZarr^2 makes it easy to create "Virtual" Zarr datacubes, allowing performant access to archival data as if it were in the Cloud-Optimized Zarr format, without duplicating any data.
Please see the documentation.
Features
- Create virtual references pointing to bytes inside an archival file with
open_virtual_dataset. - Supports a range of archival file formats, including netCDF4 and HDF5, and has a pluggable system for supporting new formats.
- Access data via the zarr-python API by reading from the zarr-compatible
ManifestStore. - Combine data from multiple files into one larger datacube using xarray's combining functions, such as
xarray.concat. - Commit the virtual references to storage either using the Kerchunk references specification or the Icechunk transactional storage engine.
- Users access the virtual datacube simply as a single zarr-compatible store using
xarray.open_zarr.
Inspired by Kerchunk
VirtualiZarr grew out of discussions on the Kerchunk repository, and is an attempt to provide the game-changing power of kerchunk but in a zarr-native way, and with a familiar array-like API.
You now have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides almost all the same features as Kerchunk.
Development Status and Roadmap
VirtualiZarr version 1 (mostly) achieved feature parity with kerchunk's logic for combining datasets, providing an easier way to manipulate kerchunk references in memory and generate kerchunk reference files on disk.
VirtualiZarr version 2 brings:
- Zarr v3 support
- A pluggable system of "parsers" for virtualizing custom file formats
- The
ManifestStoreabstraction, which allows for loading data without serializing to Kerchunk/Icechunk first - Integration with
obstore - Reference parsing that doesn't rely on kerchunk under the hood
- The ability to use "parsers" to load data directly from archival file formats into Zarr and/or Xarray
Future VirtualiZarr development will focus on generalizing and upstreaming useful concepts into the Zarr specification, the Zarr-Python library, Xarray, and possibly some new packages.
We have a lot of ideas, including: - Zarr-native on-disk chunk manifest format - "Virtual concatenation" of separate Zarr arrays - ManifestArrays as an intermediate layer in-memory in Zarr-Python - Separating CF-related Codecs from xarray
If you see other opportunities then we would love to hear your ideas!
Talks and Presentations
- 2025/04/30 - Cloud-Native Geospatial Forum - Tom Nicholas - Slides / Recording
- 2024/11/21 - MET Office Architecture Guild - Tom Nicholas - Slides
- 2024/11/13 - Cloud-Native Geospatial conference - Raphael Hagen - Slides
- 2024/07/24 - ESIP Meeting - Sean Harkins - Event / Recording
- 2024/05/15 - Pangeo showcase - Tom Nicholas - Event / Recording / Slides
Credits
This package was originally developed by Tom Nicholas whilst working at [C]Worthy, who deserve credit for allowing him to prioritise a generalizable open-source solution to the dataset virtualization problem. VirtualiZarr is now a community-owned multi-stakeholder project.
Licence
Apache 2.0
References
Owner
- Name: Zarr Developers
- Login: zarr-developers
- Kind: organization
- Location: https://twitter.com/zarr_dev
- Website: https://zarr.dev
- Repositories: 32
- Profile: https://github.com/zarr-developers
Contributors to the Zarr open source project.
Citation (citation.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." title: "VirtualiZarr" abstract: "Create virtual Zarr stores for cloud-friendly access to archival data, using familiar xarray syntax." license: Apache-2.0 repository-code: "https://github.com/zarr-developers/VirtualiZarr" authors: - family-names: "Nicholas" given-names: "Thomas" orcid: "https://orcid.org/0000-0002-2176-0530" - family-names: "Hagen" given-names: "Norland" orcid: "https://orcid.org/0000-0000-0000-0000" - family-names: "Harkins" given-names: "Sean" orcid: "https://orcid.org/0000-0000-0000-0000" - family-names: "Barciauskas" given-names: "Aimee" orcid: "https://orcid.org/0000-0002-3158-9554" - family-names: "Jones" given-names: "Max" orcid: "https://orcid.org/0000-0003-0180-8928" - family-names: "Signell" given-names: "Julia" orcid: "https://orcid.org/0000-0002-4120-3192" - family-names: "Nag" given-names: "Ayush" orcid: "https://orcid.org/0009-0008-1790-597X" - family-names: "Hidalgo" given-names: "Gustavo" orcid: "https://orcid.org/0000-0000-0000-0000" - family-names: "Augspurger" given-names: "Tom" orcid: "https://orcid.org/0000-0002-8136-7087" - family-names: "Abernathey" given-names: "Ryan" orcid: "https://orcid.org/0000-0001-5999-4917"
Committers
Last synced: 10 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Tom Nicholas | t****m@c****g | 203 |
| Raphael Hagen | n****n@g****m | 37 |
| Max Jones | 1****s | 35 |
| pre-commit-ci[bot] | 6****] | 14 |
| Chuck Daniels | c****k@d****g | 13 |
| Aimee Barciauskas | a****e@d****g | 10 |
| Julia Signell | j****l@g****m | 8 |
| Sean Harkins | s****s@g****m | 7 |
| Ayush Nag | 3****g | 5 |
| rsignell | 1****l | 4 |
| Matthew Iannucci | m****w@e****o | 4 |
| Timothy Hodson | 3****s | 3 |
| Gustavo Hidalgo | g****o@m****m | 3 |
| Tom Augspurger | t****r@m****m | 3 |
| Julius Busecke | j****s@l****u | 2 |
| Justus Magin | k****s | 2 |
| Nathan Zimmerman | n****n@g****m | 2 |
| Scott Henderson | s****q@g****m | 2 |
| Doug Latornell | d****l@d****a | 2 |
| Ben Mares | s****1@t****m | 1 |
| Josh Moore | j****h@o****g | 1 |
| Keith Doore | 5****e | 1 |
| Michael Sumner | m****r@g****m | 1 |
| Shane Loeffler | S****c | 1 |
Committer Domains (Top 20 + Academic)
Packages
- Total packages: 1
-
Total downloads:
- pypi 6,124 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 11
- Total maintainers: 3
pypi.org: virtualizarr
Create virtual Zarr stores from archival data using xarray API
- Homepage: https://github.com/zarr-developers/VirtualiZarr
- Documentation: https://github.com/zarr-developers/VirtualiZarr/blob/main/README.md
- License: apache-2.0
-
Latest release: 2.1.2
published 6 months ago
Rankings
Maintainers (3)
Dependencies
- kerchunk *
- packaging *
- pydantic *
- xarray *
- actions/checkout v4 composite
- codecov/codecov-action v3.1.4 composite
- mamba-org/provision-with-micromamba main composite
- black
- codecov
- flake8
- h5netcdf
- kerchunk
- netcdf4
- pydantic
- pytest
- pytest-cov
- python >=3.9
- ujson