kerchunk-cookbook
Project Pythia cookbook for Kerchunk
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.5%) to scientific vocabulary
Repository
Project Pythia cookbook for Kerchunk
Basic Info
- Host: GitHub
- Owner: ProjectPythia
- License: apache-2.0
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://projectpythia.org/kerchunk-cookbook/
- Size: 80 MB
Statistics
- Stars: 18
- Watchers: 2
- Forks: 15
- Open Issues: 6
- Releases: 1
Metadata Files
README.md
Virtual Zarr Cookbook (Kerchunk and VirtualiZarr)
![]()
This Project Pythia Cookbook covers using the Kerchunk, VirtualiZarr, and Zarr-Python libraries to access archival data formats as if they were ARCO (Analysis-Ready-Cloud-Optimized) data.
Motivation
The Kerchunk library pioneered the access of chunked and compressed
data formats (such as NetCDF3. HDF5, GRIB2, TIFF & FITS), many of
which are the primary data formats for many data archives, as if
they were in ARCO formats such as Zarr which allows for parallel,
chunk-specific access. Instead of creating a new copy of the dataset
in the Zarr spec/format, Kerchunk reads through the data archive
and extracts the byte range and compression information of each
chunk, then writes that information to a "virtual Zarr store" using a
JSON or Parquet "reference file". The VirtualiZarr
library provides a simple way to create these "virtual stores" using familiary
xarray syntax. Lastly, the icechunk provides a new way to store and re-use these references.
These virtual Zarr stores can be re-used and read via Zarr and Xarray.
For more details on how this process works please see this page on the Kerchunk docs).
Authors
Much of the content of this cookbook was inspired by
Martin Durant,
the creator of Kerchunk and the
Kerchunk documentation.
Contributors
Structure
This cookbook is broken up into two sections, Foundations and Example Notebooks.
Section 1 - Foundations
In the Foundations section we will demonstrate
how to use Kerchunk and VirtualiZarr to create reference files
from single file sources, as well as to create
multi-file virtual Zarr stores from collections of files.
Section 2 - Generating Virtual Zarr Stores
The notebooks in the Generating Virtual Zarr Stores section
demonstrates how to use Kerchunk and VirtualiZarr to create
datasets for all the supported file formats.
These libraries currently support virtualizing NetCDF3,
NetCDF4/HDF5, GRIB2, TIFF (including COG).
Section 3 - Using Virtual Zarr Stores
The Using Virtual Zarr Stores section contains notebooks demonstrating how to load existing references into Xarray, generating coordinates for GeoTiffs using xrefcoord, and plotting using Hvplot Datashader.
Running the Notebooks
You can either run the notebook using
or on your local machine.
Running on Binder
The simplest way to interact with a Jupyter Notebook is through
Binder, which enables the execution of a
Jupyter Book in the cloud. The details of how this works are not
important for now. All you need to know is how to launch a Pythia
Cookbooks chapter via Binder. Simply navigate your mouse to
the top right corner of the book chapter you are viewing and click
on the rocket ship icon and be sure to select
“launch Binder”. After a moment you should be presented with a
notebook that you can interact with. You’ll be able to execute
and even change the example programs. The code cells
have no output at first, until you execute them by pressing
{kbd}Shift+{kbd}Enter. Complete details on how to interact with
a live Jupyter notebook are described in Getting Started with
Jupyter.
Running on Your Own Machine
If you are interested in running this material locally on your computer, you will need to follow this workflow:
Install mambaforge/mamba
Clone the
https://github.com/ProjectPythia/kerchunk-cookbookrepository:
bash
git clone https://github.com/ProjectPythia/kerchunk-cookbook.git
- Move into the
kerchunk-cookbookdirectorybash cd kerchunk-cookbook - Create and activate your conda environment from the
environment.ymlfile. Note: In theenvironment.ymlfile, Kerchunk` is currently being installed from source as development is happening rapidly.
bash
mamba env create -f environment.yml
mamba activate kerchunk-cookbook
- Move into the
notebooksdirectory and start up Jupyterlabbash cd notebooks/ jupyter lab
Owner
- Name: Project Pythia
- Login: ProjectPythia
- Kind: organization
- Email: projectpythia@ucar.edu
- Location: United States of America
- Website: projectpythia.org
- Twitter: Project_Pythia
- Repositories: 21
- Profile: https://github.com/ProjectPythia
Community learning resource for Python-based computing in the geosciences
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this cookbook, please cite it as below."
authors:
# add additional entries for each author -- see https://github.com/citation-file-format/citation-file-format/blob/main/schema-guide.md
- family-names: Hagen
given-names: Norland Raphael
website: https://github.com/norlandrhagen
orcid: https://orcid.org/0000-0003-1994-1153
- name: "Kerchunk Cookbook contributors" # use the 'name' field to acknowledge organizations
website: "https://github.com/ProjectPythia/kerchunk-cookbook/graphs/contributors"
title: "Kerchunk Cookbook"
abstract: "Kerchunk provides cloud-friendly access to archival data. With Kerchunk you can read collections of legacy file formats (NetCDF, GRIB2 etc.) as if they were ARCO (Analysis-Ready Cloud-Optimized) formats such as Zarr, without creating a copy of the original dataset."
GitHub Events
Total
- Issues event: 7
- Watch event: 6
- Delete event: 2
- Issue comment event: 12
- Push event: 73
- Pull request event: 10
- Fork event: 1
- Create event: 3
Last Year
- Issues event: 7
- Watch event: 6
- Delete event: 2
- Issue comment event: 12
- Push event: 73
- Pull request event: 10
- Fork event: 1
- Create event: 3
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 0
- Total pull requests: 3
- Average time to close issues: N/A
- Average time to close pull requests: 9 months
- Total issue authors: 0
- Total pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.67
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: about 8 hours
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.5
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- maxrjones (2)
- norlandrhagen (1)
- erogluorhan (1)
- brian-rose (1)
- raybellwaves (1)
Pull Request Authors
- maxrjones (4)
- jukent (4)
- raybellwaves (4)
- Anu-Ra-g (3)
- rsignell (1)
- ahuang11 (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/checkout v4 composite
- jacobtomlinson/gha-find-replace v3 composite
- stefanzweifel/git-auto-commit-action v4 composite
- cfgrib
- dask
- dask-labextension
- datashader
- distributed
- fastparquet
- fsspec
- git
- h5netcdf
- h5py
- hvplot
- ipywidgets
- jupyter-book
- jupyter_bokeh
- jupyter_server
- jupyterlab
- mamba
- matplotlib
- netcdf4
- numcodecs
- numpy
- pandas
- pip
- pooch
- pre-commit
- pyopenssl
- python 3.10.*
- rioxarray
- s3fs
- scipy
- sphinx-pythia-theme
- tifffile
- ujson
- xarray
- xarray-datatree
- zarr