rqssframework

The main code repository of Referencing Quality Scoring System metrics. Paper: https://www.semantic-web-journal.net/system/files/swj3593.pdf

https://github.com/seyedahbr/rqssframework

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.8%) to scientific vocabulary

Keywords

dataquality github-pages graph wikidata

Last synced: 6 months ago · JSON representation ·

Repository

The main code repository of Referencing Quality Scoring System metrics. Paper: https://www.semantic-web-journal.net/system/files/swj3593.pdf

Basic Info

Host: GitHub
Owner: seyedahbr
License: cc0-1.0
Language: Python
Default Branch: main
Homepage:
Size: 438 KB

Statistics

Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 2

Topics

dataquality github-pages graph wikidata

Created over 4 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

RQSS-Framework

Referencing Quality Scoring System - RQSS is a data quality assessment framework implemented to measure the quality of Wikidata references. RQSS is based on a comprehensive referencing quality assessment framework with 40 data quality metrics on 21 data quality dimensions. In this repository, the objective metrics of the framework (34 out of 40) have been implemented. The formal definitions of the metrics and a comprehensive analysis of Wikidata referencing scores in 7 topical and random subsets of Wikidata can be found in the repo's paper: RQSS: Referencing Quality Scoring System for Wikidata.

Input/Output

RQSS gets an RDF graph based on the Wikidata data model as the input. In version 1.0.2, the input graph should be accessible on a local/public SPARQL endpoint.

RQSS also requires access to the Internet for receiving metadata from the SPARQL endpoint, history pages, and EntitySchemas of Wikidata.

RQSS produce two kinds of output:

The computed quality scores as numbers between 0 and 1, along with the distribution of results in some metrics.
Graphical charts (bar, box and whisker, pie, etc.) to provide a high level view of the scores and/or distributions.

How to Deploy RQSS

RQSS is a modular framework, thus its implemented metrics can be called independently.

First, a local endpoint should be placed on the RDF data. We recommend using Blazegraph. Other triplestores, such as Jena Fuseki or GraphDB can be used as well. Suppose the dataset is available over a triplestore endpoint with the following address: http://localhost:9999/blazegraph/sparql

Based on what metric is desired to be computed, the process may be started by calling the Extractor to fetch the initial metadata from the dataset/Wikidata. Then the Framework_Runner is called and finally, the presenter might be deployed.

Compute all of the metrics

To compute all of the metrics at once, run the following command from the repo directory to extract all required metadata (note that depending on the size of the dataset and the performance of the host, extracting some of the metadata can take a long time and high disk space):

$ python RQSSFramework/RQSS_Extractor.py --endpoint http://localhost:9999/blazegraph/sparql -eu -sn -l -fr -rp -rpvt -ri -sr -irf -wes -cf -sfr -aof -pu -es -ss

The output of the Extractor now should be in ./rqss_extractor_output/ directory as .data files. Then, call the Framework-Runner with all options such as below:

$ python RQSSFramework/RQSS_Framework_Runner.py ./rqss_extractor_output/ --endpoint http://localhost:9999/blazegraph/sparql -dp -l -sec -i -rts -rls -rtm -rpc -rc -rs -rdns -mr -ha -ts -rf -ef --extract-google-cache -ev -et -cpsc -sbpc -pc -aof -el -rpd -hm -he -bn -mm -mfs

Now, the scores and distributions (if applicable) can be seen in the ./rqss_framework_output directory as .csv files (score files have a _ratio at the end of their filenames).

If graphical charts are desired, deploy the presenter as follows:

$ python RQSSFramework/RQSS_Presenter.py ./rqss_framework_output

Then, the .png files can be found in the rqss_presenter_output directory.

Compute individual metrics

RQSS metrics can be computed separately. To compute a metric, first look at the --help of the RQSSFramework/RQSS_Framework_Runner.py to obtain the name of the metric. Then see the --help of the RQSSFramework/RQSS_Extractor.py to check whether computing the metric need to fetch any metadata first. For example, to compute the Availability of External URIs, the help of the extractor tells us we need to obtain all external source URLs in the input graph. So we call this command: $ python ./RQSSFramework/RQSS_Extractor.py --endpoint http://localhost:9999/blazegraph/sparql --external-uris Then, we call the FrameworkRunner as below: ``` $ python RQSSFramework/RQSSFrameworkRunner.py ./rqssextractoroutput/ --dereferencing And then, to have the graphical chart, the Presenter can be called: $ python RQSSFramework/RQSSPresenter.py ./rqssframeworkoutput ```

Repo Structure

The main part of the framework code is located in the RQSSFramework package. In this directory, there is a package corresponding to each dimension of the framework, in which Python files have implemented one or more metrics. In addition to the dimensions and metrics packages, the following scripts and files exist in the RQSSFramework package:

entityschemaextractor.py: To fetch the most up-to-date EntitySchemas and their referencing information from Wikidata website
Queries.py: SPARQL queries used in the Extractor and the Framework_Runner
RQSS_Extractor.py: The Extractor module
RQSS_Framework_Runner.py: The metric computer module
RQSS_Presenter.py: The graphical chart representer module
ShExes.py: The ShEx schemas used in consistency and other dimensions

In addition, the utils directory in the RQSSFramework package contains the following scripts:

item_overlap_checker.py: The script is used to identify the overlapping items amongst randomly chosen subsets
lists.py: Contains the list of datasets, licensing keywords, and any other set of literal values used in the Framework Runner
topic_coverage.py: This script is used to compute the main high-level classes of items in a subset (the topics the subset covers).

More details

See the README.md inside the RQSSFramework.

Owner

Name: Seyed Amir Hosseini Beghaeiraveri
Login: seyedahbr
Kind: user
Location: Edinburgh, UK
Company: Heriot-Watt University

Website: https://seyedahbr.github.io/
Twitter: s_a_h_b_r
Repositories: 12
Profile: https://github.com/seyedahbr

PhD student @ Heriot-Watt University Linked Data . Semantic Web . Big Data . Distributed Systems

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: ' RQSSFramework'
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Seyed Amir
    family-names: Hosseini Beghaeiraveri
    email: sh200@hw.ac.uk
    affiliation: Heriot-Watt University
    orcid: 'https://orcid.org/0000-0002-9123-5686'
repository-code: 'https://github.com/seyedahbr/RQSSFramework'
url: 'https://github.com/seyedahbr/RQSSFramework/releases/tag/v1.0.1'
repository: 'https://github.com/seyedahbr/RQSSFramework'
repository-artifact: 'https://github.com/seyedahbr/RQSSFramework'
keywords:
  - Wikidata
  - subset
  - data quality
  - reference quality
license: CC-BY-1.0
version: 1.0.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science