pyquarc

The pyQuARC tool reads and evaluates metadata records with a focus on the consistency and robustness of the metadata. pyQuARC flags opportunities to improve or add to contextual metadata information in order to help the user connect to relevant data products. pyQuARC also ensures that information common to both the data product and the file-level metadata are consistent and compatible. pyQuARC frees up human evaluators to make more sophisticated assessments such as whether an abstract accurately describes the data and provides the correct contextual information. The base pyQuARC package assesses descriptive metadata used to catalog Earth observation data products and files. As open source software, pyQuARC can be adapted and customized by data providers to allow for quality checks that evolve with their needs, including checking metadata not covered in base package.

https://github.com/nasa-impact/pyquarc

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
    15 of 31 committers (48.4%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.4%) to scientific vocabulary

Keywords from Contributors

mesh

Scientific Fields

Psychology Social Sciences - 40% confidence
Last synced: 4 months ago · JSON representation ·

Repository

The pyQuARC tool reads and evaluates metadata records with a focus on the consistency and robustness of the metadata. pyQuARC flags opportunities to improve or add to contextual metadata information in order to help the user connect to relevant data products. pyQuARC also ensures that information common to both the data product and the file-level metadata are consistent and compatible. pyQuARC frees up human evaluators to make more sophisticated assessments such as whether an abstract accurately describes the data and provides the correct contextual information. The base pyQuARC package assesses descriptive metadata used to catalog Earth observation data products and files. As open source software, pyQuARC can be adapted and customized by data providers to allow for quality checks that evolve with their needs, including checking metadata not covered in base package.

Basic Info
  • Host: GitHub
  • Owner: NASA-IMPACT
  • License: apache-2.0
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 24.7 MB
Statistics
  • Stars: 22
  • Watchers: 7
  • Forks: 3
  • Open Issues: 49
  • Releases: 16
Created about 7 years ago · Last pushed 5 months ago
Metadata Files
Readme Changelog License Citation

README.md

pyQuARC

Open Source Library for Earth Observation Metadata Quality Assessment

DOI

Introduction

The pyQuARC (pronounced "pie-quark") library was designed to read and evaluate descriptive metadata used to catalog Earth observation data products and files. This type of metadata focuses and limits attention to important aspects of data, such as the spatial and temporal extent, in a structured manner that can be leveraged by data catalogs and other applications designed to connect users to data. Therefore, poor quality metadata (e.g. inaccurate, incomplete, improperly formatted, inconsistent) can yield subpar results when users search for data. Metadata that inaccurately represents the data it describes risks matching users with data that does not reflect their search criteria and, in the worst-case scenario, can make data impossible to find.

Given the importance of high quality metadata, it is necessary that metadata be regularly assessed and updated as needed. pyQuARC is a tool that can help streamline the process of assessing metadata quality by automating it as much as possible. In addition to basic validation checks (e.g. adherence to the metadata schema, controlled vocabularies, and link checking), pyQuARC flags opportunities to improve or add contextual metadata information to help the user connect to, access, and better understand the data product. pyQuARC also ensures that information common to both data product (i.e. collection) and the file-level (i.e. granule) metadata are consistent and compatible. As open source software, pyQuARC can be adapted and customized to allow for quality checks unique to different needs.

pyQuARC Base Package

pyQuARC was specifically designed to assess metadata in NASA’s Common Metadata Repository (CMR), which is a centralized metadata repository for all of NASA’s Earth observation data products. In addition to NASA’s ~9,000 data products, the CMR also holds metadata for over 40,000 additional Earth observation data products submitted by external data partners. The CMR serves as the backend for NASA’s Earthdata Search (search.earthdata.nasa.gov) and is also the authoritative metadata source for NASA’s Earth Observing System Data and Information System (EOSDIS).

pyQuARC was developed by a group called the Analysis and Review of the CMR (ARC) team. The ARC team conducts quality assessments of NASA’s metadata records in the CMR, identifies opportunities for improvement in the metadata records, and collaborates with the data archive centers to resolve any identified issues. ARC has developed a metadata quality assessment framework which specifies a common set of assessment criteria. These criteria focus on correctness, completeness, and consistency with the goal of making data more discoverable, accessible, and usable. The ARC metadata quality assessment framework is the basis for the metadata checks that have been incorporated into pyQuARC base package. Specific quality criteria for each CMR metadata element is documented in the following wiki: https://wiki.earthdata.nasa.gov/display/CMR/CMR+Metadata+Best+Practices%3A+Landing+Page

There is an “ARC Metadata QA/QC” section on the wiki page for each metadata element that lists quality criteria categorized by level of priority. Priority categories are designated as high (red), medium (yellow), or low (blue), and are intended to communicate the importance of meeting the specified criteria.

The CMR is designed around its own metadata standard called the Unified Metadata Model (UMM). In addition to being an extensible metadata model, the UMM also provides a cross-walk for mapping between the various CMR-supported metadata standards. CMR-supported metadata standards currently include: * DIF10 (Collection/Data Product-level only) * ECHO10 (Collection/Data Product and Granule/File-level metadata) * ISO19115-1 and ISO19115-2 (Collection/Data Product and Granule/File-level metadata) * UMM-JSON (UMM) * UMM-C (Collection/Data Product-level metadata) * UMM-G (Granule/File-level metadata) * UMM-S (Service metadata) * UMM-T (Tool metadata)

pyQuARC supports DIF10 (collection only), ECHO10 (collection and granule), UMM-C, and UMM-G standards. At this time, there are no plans to add ISO 19115 or UMM-S/T specific checks. Note that pyQuARC development is still underway, so further enhancements and revisions are planned.

For inquiries, please email: sheyenne.kirkland@uah.edu

pyQuARC as a Service (QuARC)

QuARC is pyQuARC deployed as a service and can be found here: https://quarc.nasa-impact.net/docs/.

QuARC is still in beta but is regularly synced with the latest version of pyQuARC on GitHub. Fully cloud-native, the architecture diagram of QuARC is shown below:

QuARC

Architecture

pyQuARC Architecture

The Downloader is used to obtain a copy of a metadata record of interest from the CMR. This is accomplished using a CMR API query, where the metadata record of interest is identified by its unique identifier in the CMR (concept_id). CMR API documentation can be found here: https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html

There is also the option to select and run pyQuARC on a metadata record already downloaded to your local desktop.

The checks.json file includes a comprehensive list of rules. Each rule is specified by its rule_id, associated function, and any dependencies on specific metadata elements.

The rule_mapping.json file specifies which metadata element(s) each rule applies to. The rule_mapping.json also references the messages.json file which includes messages that can be displayed when a check passes or fails.

Furthermore, the rule_mapping.json file specifies the level of severity associated with a failure. If a check fails, it will be assigned a severity category of “error”, “warning”, or "info.” These categories correspond to priority categorizations in ARC’s priority matrix and communicate the importance of the failed check, with “error” being the most critical category, “warning” indicating a failure of medium priority, and “info” indicating a minor issue or inconsistency. Default severity values are assigned based on ARC’s metadata quality assessment framework, but can be customized to meet individual needs.

Customization

pyQuARC is designed to be customizable. Output messages can be modified using the messages_override.json file - any messages added to messages_override.json will display over the default messages in the message.json file. Similarly, there is a rule_mapping_override.json file which can be used to override the default settings for which rules/checks are applied to which metadata elements.

There is also the opportunity for more sophisticated customization. New QA rules can be added and existing QA rules can be edited or removed. Support for new metadata standards can be added as well. Further details on how to customize pyQuARC will be provided in the technical user’s guide below.

While the pyQuARC base package is currently managed by the ARC team, the long term goal is for it to be owned and governed by the broader EOSDIS metadata community.

Install/User’s Guide

Running the program

Note: This program requires Python 3.8 installed in your system.

Clone the repo: https://github.com/NASA-IMPACT/pyQuARC/

Go to the project directory: cd pyQuARC

Create a python virtual environment: python -m venv env

Activate the environment: source env/bin/activate

Install the requirements: pip install -r requirements.txt

Run main.py:

```plaintext ▶ python pyQuARC/main.py -h
usage: main.py [-h] [--query QUERY | --conceptids CONCEPTIDS [CONCEPTIDS ...]] [--file FILE | --fake FAKE] [--format [FORMAT]] [--cmrhost [CMR_HOST]] [--version [VERSION]]

optional arguments: -h, --help Show this help message and exit --query QUERY CMR query URL. --conceptids CONCEPTIDS [CONCEPTIDS ...] List of concept IDs. --file FILE Path to the test file, either absolute or relative to the root dir. --fake FAKE Use a fake content for testing. --format [FORMAT] The metadata format. Choices are: echo-c (echo10 collection), echo-g (echo10 granule), dif10 (dif10 collection), umm-c (umm-json collection), umm-g (umm-json granules) --cmrhost [CMR_HOST] The cmr host base url. Default is: https://cmr.earthdata.nasa.gov --version [VERSION] The revision version of the collection. Default is the latest version.

`` To test a local file, use the--file` argument. Give it either an absolute file path or a file path relative to the project root directory.

Example: ▶ python pyQuARC/main.py --file "tests/fixtures/test_cmr_metadata.echo10" or ▶ python pyQuARC/main.py --file "/Users/batman/projects/pyQuARC/tests/fixtures/test_cmr_metadata.echo10"

Adding a custom rule

To add a custom rule, follow the following steps:

Add an entry to the schemas/rule_mapping.json file in the form:

json "rule_id": "<An id for the rule in snake case>": { "rule_name": "<Name of the Rule>", "fields_to_apply": { "<metadata format (eg. echo-c)>": { "fields": [ "<The primary field1 to apply to (full path separated by /)>", "<Related field 11>", "<Related field 12>", "<Related field ...>", "<Related field 1n>", ], "relation": "relation_between_the_fields_if_any", "dependencies": [ [ "<any dependent check that needs to be run before this check (if any), for this specific metadata format>", "<field to apply this dependent check to (if any)>" ] ] }, "echo-g": { "fields": [ "<The primary field2 to apply to (full path separated by /)>", "<Related field 21>", "<Related field 22>", "<Related field ...>", "<Related field 2n>", ], "relation": "relation_between_the_fields_if_any", "data": [ "<any external data that you want to send to the rule for this specific metadata format>" ] } }, "data" : [ "<any external data that you want to send to the rule>" ], "check_id": "< one of the available checks, see CHECKS.md, or custom check if you are a developer>" }

An example:

json "data_update_time_logic_check": { "rule_name": "Data Update Time Logic Check", "fields_to_apply": { "echo-c": [ { "fields": [ "Collection/LastUpdate", "Collection/InsertTime" ], "relation": "gte" } ], "echo-g": [ { "fields": [ "Granule/LastUpdate", "Granule/InsertTime" ], "relation": "gte" } ], "dif10": [ { "fields": [ "DIF/Metadata_Dates/Data_Last_Revision", "DIF/Metadata_Dates/Data_Creation" ], "relation": "gte", "dependencies": [ [ "date_or_datetime_format_check" ] ] } ] }, "severity": "info", "check_id": "datetime_compare" },

data is any external data that you want to pass to the check. For example, for a controlled_keywords_check, it would be the controlled keywords list:

json "data": [ ["keyword1", "keyword2"] ]

check_id is the id of the corresponding check from checks.json. It'll usually be one of the available checks. An exhaustive list of all the available checks can be found in CHECKS.md.

If you're writing your own custom check to schemas/checks.json:

Add an entry in the format:

```json "": {
"datatype": "",
"check
function": "",
"dependencies": [
""
],
"description": "",
"available":
},

```

The data_type can be datetime, string, url or custom.

The check_function should be either one of the available functions, or your own custom function.

An example:

json "date_compare": { "data_type": "datetime", "check_function": "compare", "dependencies": [ "datetime_format_check" ], "description": "Compares two datetimes based on the relation given.", "available": true },

If you’re writing your own check function:

Locate the validator file based on the data_type of the check in code/ directory. It is in the form: <data_type>_validator.py. Example: string_validator.py, url_validator.py, etc.

Write a @staticmethod member method in the class for that particular check. See examples in the file itself. The return value should be in the format:

json { "valid": <the_validity_based_on_the_check>, "value": <the_value_of_the_field_in_user_friendly_format> }

You can re-use any functions that are already there to reduce redundancy.

Adding output messages to checks:

Add an entry to the schemas/check_messages_override.json file like this:

json { "check_id": "<The id of the check/rule>", "message": { "success": "<The message to show if the check succeeds>", "failure": "<The message to show if the check fails>", "warning": "<The warning message>" }, "help": { "message": "<The help message if any.>", "url": "<The help url if any.>" }, "remediation": "<The remediation step to make the check valid.>" }

An example:

json { "check_id": "abstract_length_check", "message": { "success": "The length is correct.", "failure": "The length of the field should be less than 100. The current length is `{}`.", "warning": "Make sure length is 100." }, "help": { "message": "The length of the field can only be less than 100 characters.", "url": "www.lengthcheckurl.com" }, "remediation": "A remedy." } Note: See the {} in the failure message above? It is a placeholder for any value you want to show in the output message. To fill this placeholder with a particular value, you have to return that value from the check function that you write. You can have as many placeholders as you like, you just have to return that many values from your check function.

An example: Suppose you have a check function:

python @staticfunction def is_true(value1, value2): return { "valid": value1 and value2, "value": [value1, value2] }

And a message:

json ... "failure": "The values `{}` and `{}` do not amount to a true value", ... Then, if the check function receives input value1=0 and value2=1, the output message will be:

plaintext The values 0 and 1 do not amount to a true value

Using as a package

Note: This program requires Python 3.8 installed in your system.

Clone the repo: https://github.com/NASA-IMPACT/pyQuARC/

Go to the project directory: cd pyQuARC

Install package: python setup.py install

To check if the package was installed correctly:

```python ▶ python

from pyQuARC import ARC validator = ARC(fake=True) validator.validate() ... ```

To provide locally installed file:

```python ▶ python

from pyQuARC import ARC validator = ARC(file_path="") validator.validate() ... ```

To provide rules for new fields or override:

```python ▶ cat ruleoverride.json { "dataupdatetimelogiccheck": { "rulename": "Data Update Time Logic Check", "fieldstoapply": [ { "fields": [ "Collection/LastUpdate", "Collection/InsertTime" ], "relation": "lte" } ], "severity": "info", "checkid": "datecompare" }, "newfield": { "rulename": "Check for new field", "fieldstoapply": [ { "fields": [ "", "", ], "relation": "lte" } ], "severity": "info", "checkid": "<checkid>" } } ▶ python

from pyQuARC import ARC validator = ARC(checksoverride="<path to ruleoverride.json>") validator.validate() ... ```

To provide custom messages for new or old fields:

``python ▶ cat messages_override.json { "data_update_time_logic_check": { "failure": "The UpdateTime{}comes after the provided InsertTime{}.", "help": { "message": "", "url": "https://wiki.earthdata.nasa.gov/display/CMR/Data+Dates" }, "remediation": "Everything is alright!" }, "new_check": { "failure": "Custom check for{}and{}.", "help": { "message": "", "url": "https://wiki.earthdata.nasa.gov/display/CMR/Data+Dates" }, "remediation": "" } } ▶ python

from pyQuARC import ARC validator = ARC(checksoverride="<path to ruleoverride.json>", messagesoverride=<path to messagesoverride.json>) validator.validate() ... ```

Owner

  • Name: Inter Agency Implementation and Advanced Concepts
  • Login: NASA-IMPACT
  • Kind: organization
  • Email: esds.dsig@gmail.com

Citation (CITATION.cff)

cff-version: 1.2.0
title: "pyQuARC: Open Source Library for Earth Observation Metadata Quality Assessment"
message: "If you use this software, please cite it as below"
type: software
authors:
  - given-names: Slesa
    family-names: Adhikari
    email: slesa.adhikari@uah.edu
  - given-names: Iksha
    family-names: Gurung
    email: iksha.gurung@uah.edu
  - given-names: Jenny
    family-names: Wood
    email: jenny.wood@uah.edu
  - given-names: Jeanné
    family-names: le Roux
    email: jeanne.leroux@uah.edu
identifiers:
  - type: doi
    value: 10.5281/zenodo.10724717
repository-code: 'https://github.com/NASA-IMPACT/pyQuARC/tree/v1.2.5'
abstract: >-
  pyQuARC is designed to read and evaluate Earth observation metadata records hosted within the Common Metadata Repository (CMR), which is a centralized metadata repository for all of NASA's Earth observation data products. The CMR serves as the backend for NASA's Earthdata Search meaning that high-quality metadata helps connect users to the existing data in Earthdata Search. pyQuARC implements the Analysis and Review of CMR (ARC) team's metadata quality assessment framework to provide prioritized recommendations for metadata improvement and optimized search results. pyQuARC makes basic validation checks, pinpoints inconsistencies between dataset-level (i.e. collection) and file-level (i.e. granule) metadata, and identifies opportunities for more descriptive and robust information. It currently supports DIF10 (collection), ECHO10 (collection and granule), UMM-C, and UMM-G metadata standards. As open source software, pyQuARC can be adapted to add customized checks, implement future metadata standards, or support other metadata types.
keywords:
  - Metadata
  - Python
  - Data Curation
  - Earth Observation
  - DAAC
  - Collection
  - Granule
  - GCMD
  - Quality Assessment
  - DIF10
  - ECHO10
  - UMM-C
license: Apache-2.0
version: 1.2.5
date-released: '2021-08-19'

GitHub Events

Total
  • Issues event: 9
  • Watch event: 3
  • Delete event: 6
  • Member event: 2
  • Issue comment event: 62
  • Push event: 60
  • Pull request review event: 6
  • Pull request event: 31
  • Fork event: 1
  • Create event: 28
Last Year
  • Issues event: 9
  • Watch event: 3
  • Delete event: 6
  • Member event: 2
  • Issue comment event: 62
  • Push event: 60
  • Pull request review event: 6
  • Pull request event: 31
  • Fork event: 1
  • Create event: 28

Committers

Last synced: almost 2 years ago

All Time
  • Total Commits: 876
  • Total Committers: 31
  • Avg Commits per committer: 28.258
  • Development Distribution Score (DDS): 0.572
Past Year
  • Commits: 62
  • Committers: 15
  • Avg Commits per committer: 4.133
  • Development Distribution Score (DDS): 0.823
Top Committers
Name Email Commits
“slesaad” s****d@g****m 375
Samuel Ayers s****2@u****u 128
Shelby Bagwell s****7@u****u 85
Ashish Acharya a****4@g****m 70
Jenny Wood 5****d 41
xhagrg g****a@h****m 39
Jenny Wood j****d@w****l 23
xhagrg s****3@g****m 14
John Troutman j****9@u****u 12
Slesa Adhikari s****r@a****l 10
Shelby Bagwell s****l@b****l 10
Essence Raphael e****4@u****u 10
Jeanne-le-Roux j****0@u****u 9
Shelby Bagwell s****l@b****e 6
pstatonvt p****n@v****u 6
Eli Walker e****2@s****u 5
Rajesh Pandey r****h@R****l 4
Carson Davis c****s@g****m 4
Danielle Groenen 8****n 4
Jenny Wood j****d@M****v 3
Prasanna Koirala p****p@g****m 3
Jenny Wood j****d@w****v 2
Jenny Wood j****d@M****v 2
Jenny Wood j****d@M****v 2
dependabot[bot] 4****] 2
sydney-lybrand s****8@u****u 2
smk0033 s****3@u****u 1
Shelby Bagwell 7****l 1
esr0004 9****4 1
Jenny Wood j****7@u****u 1
and 1 more...

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 46
  • Total pull requests: 98
  • Average time to close issues: 9 months
  • Average time to close pull requests: 28 days
  • Total issue authors: 13
  • Total pull request authors: 19
  • Average comments per issue: 0.83
  • Average comments per pull request: 0.44
  • Merged pull requests: 69
  • Bot issues: 0
  • Bot pull requests: 11
Past Year
  • Issues: 8
  • Pull requests: 20
  • Average time to close issues: N/A
  • Average time to close pull requests: about 2 months
  • Issue authors: 4
  • Pull request authors: 5
  • Average comments per issue: 0.13
  • Average comments per pull request: 0.2
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • jenny-m-wood (21)
  • smk0033 (8)
  • svbagwell (7)
  • slesaad (3)
  • lavanya3k (3)
  • spa0002 (3)
  • rajeshpandey2053 (2)
  • fb0023 (2)
  • battistowx (1)
  • code-geek (1)
  • tbs1979 (1)
  • stephen-mcneal (1)
  • NISH1001 (1)
Pull Request Authors
  • slesaad (23)
  • jenny-m-wood (18)
  • lavanya3k (11)
  • smk0033 (10)
  • dependabot[bot] (10)
  • binni979 (9)
  • svbagwell (9)
  • spa0002 (6)
  • esr0004 (5)
  • rajeshpandey2053 (3)
  • xhagrg (3)
  • John-Troutman (3)
  • CarsonDavis (2)
  • bhawana11 (1)
  • sydney-lybrand (1)
Top Labels
Issue Labels
enhancement (3) bug (2)
Pull Request Labels
dependencies (10) enhancement (1)

Dependencies

requirements.txt pypi
  • colorama ==0.4.4
  • idna ==2.10
  • jsonschema ==3.2.0
  • lxml ==4.9.1
  • pathlib ==1.0.1
  • pytest ==5.4.3
  • pytz ==2020.1
  • requests ==2.24.0
  • strict-rfc3339 ==0.7
  • tqdm ==4.48.2
  • urlextract ==1.0.0
  • xmltodict ==0.12.0
.github/workflows/python-app.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
pyproject.toml pypi
setup.py pypi