ddc-ar6-cmip6-data-archival

This github repository documents the CMIP6 input dataset archival process at the IPCC DDC Partner DKRZ.

https://github.com/martinast/ddc-ar6-cmip6-data-archival

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 5 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

This github repository documents the CMIP6 input dataset archival process at the IPCC DDC Partner DKRZ.

Basic Info

Host: GitHub
Owner: MartinaSt
License: mit
Language: Jupyter Notebook
Default Branch: main
Size: 11.7 MB

Statistics

Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 5

Created about 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

README.md

This github repository documents the CMIP6 input dataset archival process at the IPCC DDC Partner DKRZ. Archived_Data gives an overview over the available AR6 data.

DDC-AR6-CMIP6-Data-Archival

This documentation is part of the enhanced transparency implemented as IPCC FAIR Guidelines into the AR6 (Pirani et al., 2022, https://doi.org/10.5281/zenodo.6504468). WGI AR6 TSU provided the CMIP6 input dataset lists provided by the WGI AR6 chapters, which is available at https://drive.google.com/drive/u/0/folders/1oqMdqGTOId-oMn82WzmZrloEYsF-sk. These lists are known to be incomplete. Therefore these TSU lists were merged with the datasets requested by WGI authors at the start of the Sixth Assessment Cycle (https://goo.gl/tVaGko) to obtain the final list of CMIP6 datasets that was added as DDC AR6 Reference Data Archive. The WGI-requested variable list was also used as source for the definition of the CMIP6 data subset disseminated as part of Copernicus CDS (https://cds.climate.copernicus.eu/cdsapp#!/dataset/projections-cmip6).

The CMIP6 input dataset lists provided by the TSU WGI contain abbreviated information on figure usage, which was used to add references/links to to the figure/final datasets or the figure webpage of the IPCC WGI AR6 to the metadata.

Finally, regional subsets of selected core variables were added to the archive in support of developing countries with low internet bandwidth.

Processing steps for the CMIP6 input dataset list

1. Tidy up the received chapter lists turning it into a valid json and harmonizing usage of several missing values: - usage ./tidy.py <input dir> - input directory containing downloaded JSON chapter files: input20220125 - output directory containing the tidied JSON chapter files: input20220125_tidy

2. Correct obvious DRS errors for CMIP6 datasets lists: - usage ./CMIP6correct.py <input dir> - input directory containing the tidied JSON chapter files: input20220125tidy - configuration file containing correction rules: CMIP6correct.conf - output directory containing corrected JSON chapter files, a list of remaining 23016 CMIP6 datasets (including doublets) and a list of 88 non-correctable datasets: input20220125tidycorrect - log file: log/CMIP6correct2022-07-11.log

3. Merge JSON chapter lists into single list and separate MPI-GE datasets from CMIP6 datasets:

The correction was an iterative process. The final version of the corrected and merged CMIP6 input dataset list was created on 2022-07-11. It contains 18909 CMIP6 input datasets. - usage ./compileList.py <input dir> <cmip6|cordex|cmip5> - input directory with corrected JSON chapter files: input20220125tidycorrect - output directory for merged file: output - log file: log/compileList2022-07-11.log

4. Check data availability in ESGF:

The DRS specified in the JSON is used to check data availability. In a second data availability step the DRS is used to get the trackingid from the NetCDF data headers, which contains the Handle ID of the file: - usage Jupyter Notebook `cmip6lta-tsunever-in-esgf.ipynb` - input file: cmip6list2022-07-11.json - output file: cmip6listdatarefsyntaxwrong-drs-by-pids_2022-08-17.txt, 860 data sets found, which have never been published.

5. Add and correct versions for CMIP6 datasets without version information:

For datasets with wrong or missing version information, ESGF and DKRZ data pool are checked for available versions, which are exchanged in or added to the CMIP6 input dataset list: - usage Jupyter Notebook cmip6_lta-tsu_sanitize.ipynb - input file: cmip6listdatarefsyntaxwrong-drs-by-pids2022-08-17.txt - output files: cmip6listdatarefsyntaxdrs-candidates-by-version2022-08-17.json providing 808 data sets for replacement in the dataset list and cmip6listdatarefsyntaxsortout-notfound2022-08-17.txt listing 52 data sets, which are unavailable;

6. Update of CMIP6 input dataset list and replicate missing data sets into data pool:

The corrected CMIP6 input dataset list cmip6list2022-07-11.json is updated with the results from 4. and 5. If more than one version for a dataset without specified version exist, all versions are included. Information on dublicates are merged: - usage Jupyter Notebook cmip6_lta-tsu_finalize.ipynb - input files: cmip6list2022-07-11.json and cmip6listdatarefsyntaxdrs-candidates-by-version2022-08-17.json - output files: cmip6list2022-08-17.json; 855 datasets were identified, after merging dublicates 680 datasets were updated/added to the CMIP6 input dataset list (all are available)

The result contains 18956 corrected and available CMIP6 input datasets. 52 dataset duplicates caused by version changes require special treatment.

An overview oover the changes applied in steps 4. and 5. to the input dataset list cmip6list2022-07-11.json to create the resulting cmip6list2022-08-17.json is provided in: DDC_cmip6_overview.csv. The CMIP6 input dataset list in column 'DDCAR6Archive' is used for merging with the WGI-requested variable list resulting in a total number of 65118 archived CMIP6 datasets.

7. Add data usage information to the metadata:

The CMIP6 input data set lists provided by the TSU WGI contain abbreviated information on figure usage of the the data sets. This information was used to add references/links to the figure/final data sets in the CEDA catalogue (https://catalogue.ceda.ac.uk/) and to the figure pages of the IPCC WGI AR6 (https://www.ipcc.ch/report/ar6/wg1/figures).

8. Selected CMIP6 input datasets for regions:

In support of developing countries with low internet bandwidth, selected variables of the CMIP6 input datasets are additionally provided for different regions.

9. Publication of provenance information on figure generation:

The provenance information on CMIP6 data usage in the generation of figures of the IPCC AR6 WGI was published to Zendo for each figure as a pilot implementation of the RDA Complex Citation Working Groups recommendations.

Repository Structure

Archived_Data: input and intermediate datasets archived at DDC Partner DKRZ
Metadata_Conformance: containing scripts for steps 1 to 3
Data_Verification: containing scripts for steps 4 to 6
Regional Data: containing scripts for step 8
Provenance Publication: containing scripts for step 9

Variable usage analysis: Metadata_Conformance/varlist4dreq.py

For CMIP7 planning, the CMIP6 variable usage by IPCC WGI authors was analyzed based on the corrected CMIP6 input dataset list. The results were shared through the google drive folder: https://drive.google.com/drive/u/0/folders/14YSgIype4fFtnXb4DNMmgXxYQiLLC0wy

Installation

The Metadata Conformance package requires python2.7. The Data Verification package was done using Python 3.7 including Python packages: json, pandas, tqdm, uuid, intake, pyhandle.handleclient (PyHandleClient). The Regional Data package uses shell scripts and Python 3.

License

The software is released under the MIT License.

Owner

Name: Martina Stockhause
Login: MartinaSt
Kind: user

Repositories: 1
Profile: https://github.com/MartinaSt

GitHub Events

Total

Release event: 1
Push event: 4
Create event: 1

Last Year

Release event: 1
Push event: 4
Create event: 1

Committers

Last synced: about 1 year ago

All Time

Total Commits: 17
Total Committers: 3
Avg Commits per committer: 5.667
Development Distribution Score (DDS): 0.412

Past Year

Commits: 4
Committers: 1
Avg Commits per committer: 4.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Martina Stockhause	k**2@c**e	10
Martina Stockhause	k**2@c**e	4
Martina Stockhause	s**e@d**e	3

Committer Domains (Top 20 + Academic)

dkrz.de: 1 citationsvc9.cloud.dkrz.de: 1 citationsvc.dkrz.de: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

ddc-ar6-cmip6-data-archival

Science Score: 49.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

DDC-AR6-CMIP6-Data-Archival

Processing steps for the CMIP6 input dataset list

Repository Structure

Variable usage analysis: Metadata_Conformance/varlist4dreq.py

Installation

License

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels