kerchunk-cookbook

Project Pythia cookbook for Kerchunk

https://github.com/projectpythia/kerchunk-cookbook

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.5%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Project Pythia cookbook for Kerchunk

Basic Info

Host: GitHub
Owner: ProjectPythia
License: apache-2.0
Language: Jupyter Notebook
Default Branch: main
Homepage: https://projectpythia.org/kerchunk-cookbook/
Size: 80 MB

Statistics

Stars: 18
Watchers: 2
Forks: 15
Open Issues: 6
Releases: 1

Created over 3 years ago · Last pushed 10 months ago

Metadata Files

Readme License Citation

Virtual Zarr Cookbook (Kerchunk and VirtualiZarr)

This Project Pythia Cookbook covers using the Kerchunk, VirtualiZarr, and Zarr-Python libraries to access archival data formats as if they were ARCO (Analysis-Ready-Cloud-Optimized) data.

Motivation

The Kerchunk library pioneered the access of chunked and compressed data formats (such as NetCDF3. HDF5, GRIB2, TIFF & FITS), many of which are the primary data formats for many data archives, as if they were in ARCO formats such as Zarr which allows for parallel, chunk-specific access. Instead of creating a new copy of the dataset in the Zarr spec/format, Kerchunk reads through the data archive and extracts the byte range and compression information of each chunk, then writes that information to a "virtual Zarr store" using a JSON or Parquet "reference file". The VirtualiZarr library provides a simple way to create these "virtual stores" using familiary xarray syntax. Lastly, the icechunk provides a new way to store and re-use these references.

These virtual Zarr stores can be re-used and read via Zarr and Xarray.

For more details on how this process works please see this page on the Kerchunk docs).

Authors

Raphael Hagen

Much of the content of this cookbook was inspired by Martin Durant, the creator of Kerchunk and the Kerchunk documentation.

Contributors

Structure

This cookbook is broken up into two sections, Foundations and Example Notebooks.

Section 1 - Foundations

In the Foundations section we will demonstrate how to use Kerchunk and VirtualiZarr to create reference files from single file sources, as well as to create multi-file virtual Zarr stores from collections of files.

Section 2 - Generating Virtual Zarr Stores

The notebooks in the Generating Virtual Zarr Stores section demonstrates how to use Kerchunk and VirtualiZarr to create datasets for all the supported file formats. These libraries currently support virtualizing NetCDF3, NetCDF4/HDF5, GRIB2, TIFF (including COG).

Section 3 - Using Virtual Zarr Stores

The Using Virtual Zarr Stores section contains notebooks demonstrating how to load existing references into Xarray, generating coordinates for GeoTiffs using xrefcoord, and plotting using Hvplot Datashader.

Running the Notebooks

You can either run the notebook using or on your local machine.

Running on Binder

The simplest way to interact with a Jupyter Notebook is through Binder, which enables the execution of a Jupyter Book in the cloud. The details of how this works are not important for now. All you need to know is how to launch a Pythia Cookbooks chapter via Binder. Simply navigate your mouse to the top right corner of the book chapter you are viewing and click on the rocket ship icon and be sure to select “launch Binder”. After a moment you should be presented with a notebook that you can interact with. You’ll be able to execute and even change the example programs. The code cells have no output at first, until you execute them by pressing {kbd}Shift+{kbd}Enter. Complete details on how to interact with a live Jupyter notebook are described in Getting Started with Jupyter.

Running on Your Own Machine

If you are interested in running this material locally on your computer, you will need to follow this workflow:

Install mambaforge/mamba
Clone the https://github.com/ProjectPythia/kerchunk-cookbook repository:

bash git clone https://github.com/ProjectPythia/kerchunk-cookbook.git

Move into the kerchunk-cookbook directory bash cd kerchunk-cookbook
Create and activate your conda environment from the environment.yml file. Note: In the environment.yml file, Kerchunk` is currently being installed from source as development is happening rapidly.

bash mamba env create -f environment.yml mamba activate kerchunk-cookbook

Move into the notebooks directory and start up Jupyterlab bash cd notebooks/ jupyter lab

Owner

Name: Project Pythia
Login: ProjectPythia
Kind: organization
Email: projectpythia@ucar.edu
Location: United States of America

Website: projectpythia.org
Twitter: Project_Pythia
Repositories: 21
Profile: https://github.com/ProjectPythia

Community learning resource for Python-based computing in the geosciences

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this cookbook, please cite it as below."
authors:
  # add additional entries for each author -- see https://github.com/citation-file-format/citation-file-format/blob/main/schema-guide.md
  - family-names: Hagen
    given-names: Norland Raphael
    website: https://github.com/norlandrhagen
    orcid: https://orcid.org/0000-0003-1994-1153
  - name: "Kerchunk Cookbook contributors" # use the 'name' field to acknowledge organizations
    website: "https://github.com/ProjectPythia/kerchunk-cookbook/graphs/contributors"
title: "Kerchunk Cookbook"
abstract: "Kerchunk provides cloud-friendly access to archival data. With Kerchunk you can read collections of legacy file formats (NetCDF, GRIB2 etc.) as if they were ARCO (Analysis-Ready Cloud-Optimized) formats such as Zarr, without creating a copy of the original dataset."

GitHub Events

Total

Issues event: 7
Watch event: 6
Delete event: 2
Issue comment event: 12
Push event: 73
Pull request event: 10
Fork event: 1
Create event: 3

Last Year

Issues event: 7
Watch event: 6
Delete event: 2
Issue comment event: 12
Push event: 73
Pull request event: 10
Fork event: 1
Create event: 3

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: 9 months
Total issue authors: 0
Total pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.67
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: about 8 hours
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.5
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

maxrjones (2)
norlandrhagen (1)
erogluorhan (1)
brian-rose (1)
raybellwaves (1)

Pull Request Authors

maxrjones (4)
jukent (4)
raybellwaves (4)
Anu-Ra-g (3)
rsignell (1)
ahuang11 (1)

Top Labels

Issue Labels

bug (3) content (2) infrastructure (1)

Pull Request Labels

Dependencies

.github/workflows/nightly-build.yaml actions

.github/workflows/publish-book.yaml actions

.github/workflows/trigger-book-build.yaml actions

.github/workflows/trigger-delete-preview.yaml actions

.github/workflows/trigger-link-check.yaml actions

.github/workflows/trigger-preview.yaml actions

.github/workflows/trigger-replace-links.yaml actions

actions/checkout v4 composite
jacobtomlinson/gha-find-replace v3 composite
stefanzweifel/git-auto-commit-action v4 composite

environment.yml conda

cfgrib
dask
dask-labextension
datashader
distributed
fastparquet
fsspec
git
h5netcdf
h5py
hvplot
ipywidgets
jupyter-book
jupyter_bokeh
jupyter_server
jupyterlab
mamba
matplotlib
netcdf4
numcodecs
numpy
pandas
pip
pooch
pre-commit
pyopenssl
python 3.10.*
rioxarray
s3fs
scipy
sphinx-pythia-theme
tifffile
ujson
xarray
xarray-datatree
zarr

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science