esgf-virtual-aggregation

https://github.com/zequihg50/esgf-virtual-aggregation

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: zequihg50
Language: Jupyter Notebook
Default Branch: master
Size: 33.6 MB

Statistics

Stars: 3
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 3

Created over 5 years ago · Last pushed about 1 year ago

Metadata Files

Readme Citation

ESGF Virtual Aggregation

Remote data access to Virtual Analysis Ready Data (Virtual ARD) for climate datasets of the ESGF.

Run the demo , check this Pangeo Showcase or see run your own ESGF Virtual Aggregation.

Important - The ESGF Virtual Aggregation depends on ESGF data nodes being available. This is not the case half of the time, expect errors when trying to load datasets. Check the status of ESGF data nodes here.

Rationale

The ESGF is a federated file distribution service for climate data. Remote data access and virtual datasets are possible through OPeNDAP and netCDF-java, available by default in all ESGF nodes. However, these capabilities have never been used. This provides:

Analysis Ready Data (ARD) in the form of virtual datasets, that is, no data duplication needed.
Remote data access without the need to download files. Open an URL and get direct access to an analytical data cube.

Run your own ESGF Virtual Aggregation

The ESGF Virtual Aggregation data involves two steps:

Query ESGF fedeartion for metadata and store it in a local SQL database.
Generate virtual aggregations (NcMLs) from the SQL database.

ESGF Virtual Aggregation is fully customizable via selection files. See the sample file selection-sample.

The following code generates the metadata SQL database from the selection-sample file.

bash python search.py -d sample.db -s selection-sample

Now, generate the virtual aggregations (both esgf_dataset and esgf_ensemble) from the database using 4 parallel jobs.

bash python ncmls.py -j4 --database sample.db -p esgf_ensemble

You will find that the virtual aggregations are NcML files. You will need a client based on netCDF-java to read them or you can also set up a TDS server and read via OpenDAP. See next section.

Run your own server

A THREDDS Data Server (TDS) with access to the ESGF Virtual Aggregation datasets is available at https://hub.ipcc.ifca.es/thredds.

You may deploy your own THREDDS Data Server and perform remote data analysis on the ESGF Virtual Aggregation dataset.

bash docker run -p 8080:8080 -v $(pwd)/content:/usr/local/tomcat/content/thredds unidata/thredds-docker:5.0-beta7

Now, visit localhost:8080/thredds and inspect the server's directory. You may download the NcML from the HTTPServer endpoint or use the OpenDAP service to get the OpenDAP URL (it should look like http://localhost:8080/thredds/dodsC/...).

The OpenDAP service may be used to perform remote data analysis using xarray.

```python import xarray,dask

dask.config.set(scheduler="processes")

url = "http://localhost:8080/thredds/dodsC/esgeva/demo/CMIP6CMIPAS-RCECTaiESM1historicaldaytasgnv20200626esgf.ceda.ac.uk.ncml" ds = xarray.opendataset(url).chunk({"time": 100})

query the size of the dataset on the server side

ds.attrs["size_human"]

view the variant_label coordinate

ds["variant_label"][...].compute()

compute spatial mean for all variant_labels

this involves transferring the necessary data from the server

means = ds["tas"].mean(["lat", "lon"]).compute() means ```

See the notebooks for usage and reproducibility.

Owner

Name: Ezequiel Cimadevilla Alvarez
Login: zequihg50
Kind: user

Repositories: 32
Profile: https://github.com/zequihg50

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Cimadevilla"
  given-names: "Ezequiel"
  orcid: "https://orcid.org/0000-0002-8437-2068"
- family-names: "Lawrence"
  given-names: "Bryan"
  orcid: "https://orcid.org/0000-0001-9262-7860"
- family-names: "S. Cofiño"
  given-names: "Antonio"
  orcid: "https://orcid.org/0000-0001-7719-979X"
title: "The ESGF Virtual Aggregation"
version: 20240125
date-released: 2024-11-22
url: "https://github.com/zequihg50/esgf-virtual-aggregation"

GitHub Events

Total

Release event: 1
Delete event: 1
Push event: 6
Create event: 3

Last Year

Release event: 1
Delete event: 1
Push event: 6
Create event: 3

Issues and Pull Requests

Last synced: about 2 years ago

All Time

Total issues: 1
Total pull requests: 0
Average time to close issues: 6 days
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 3.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: 6 days
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 3.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Yefee (1)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

environment.yml conda

cartopy
cf_xarray
dask
hvplot
jinja2
matplotlib
nc-time-axis
netcdf4
pandas
requests
s3fs
seaborn
sqlite
xarray
zarr

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science