rqssframework
The main code repository of Referencing Quality Scoring System metrics. Paper: https://www.semantic-web-journal.net/system/files/swj3593.pdf
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.8%) to scientific vocabulary
Keywords
Repository
The main code repository of Referencing Quality Scoring System metrics. Paper: https://www.semantic-web-journal.net/system/files/swj3593.pdf
Basic Info
Statistics
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 2
Topics
Metadata Files
README.md
RQSS-Framework
Referencing Quality Scoring System - RQSS is a data quality assessment framework implemented to measure the quality of Wikidata references. RQSS is based on a comprehensive referencing quality assessment framework with 40 data quality metrics on 21 data quality dimensions. In this repository, the objective metrics of the framework (34 out of 40) have been implemented. The formal definitions of the metrics and a comprehensive analysis of Wikidata referencing scores in 7 topical and random subsets of Wikidata can be found in the repo's paper: RQSS: Referencing Quality Scoring System for Wikidata.
Input/Output
RQSS gets an RDF graph based on the Wikidata data model as the input. In version 1.0.2, the input graph should be accessible on a local/public SPARQL endpoint.
RQSS also requires access to the Internet for receiving metadata from the SPARQL endpoint, history pages, and EntitySchemas of Wikidata.
RQSS produce two kinds of output:
- The computed quality scores as numbers between 0 and 1, along with the distribution of results in some metrics.
- Graphical charts (bar, box and whisker, pie, etc.) to provide a high level view of the scores and/or distributions.
How to Deploy RQSS
RQSS is a modular framework, thus its implemented metrics can be called independently.
First, a local endpoint should be placed on the RDF data. We recommend using Blazegraph. Other triplestores, such as Jena Fuseki or GraphDB can be used as well. Suppose the dataset is available over a triplestore endpoint with the following address: http://localhost:9999/blazegraph/sparql
Based on what metric is desired to be computed, the process may be started by calling the Extractor to fetch the initial metadata from the dataset/Wikidata. Then the Framework_Runner is called and finally, the presenter might be deployed.
Compute all of the metrics
To compute all of the metrics at once, run the following command from the repo directory to extract all required metadata (note that depending on the size of the dataset and the performance of the host, extracting some of the metadata can take a long time and high disk space):
$ python RQSSFramework/RQSS_Extractor.py --endpoint http://localhost:9999/blazegraph/sparql -eu -sn -l -fr -rp -rpvt -ri -sr -irf -wes -cf -sfr -aof -pu -es -ss
The output of the Extractor now should be in ./rqss_extractor_output/ directory as .data files. Then, call the Framework-Runner with all options such as below:
$ python RQSSFramework/RQSS_Framework_Runner.py ./rqss_extractor_output/ --endpoint http://localhost:9999/blazegraph/sparql -dp -l -sec -i -rts -rls -rtm -rpc -rc -rs -rdns -mr -ha -ts -rf -ef --extract-google-cache -ev -et -cpsc -sbpc -pc -aof -el -rpd -hm -he -bn -mm -mfs
Now, the scores and distributions (if applicable) can be seen in the ./rqss_framework_output directory as .csv files (score files have a _ratio at the end of their filenames).
If graphical charts are desired, deploy the presenter as follows:
$ python RQSSFramework/RQSS_Presenter.py ./rqss_framework_output
Then, the .png files can be found in the rqss_presenter_output directory.
Compute individual metrics
RQSS metrics can be computed separately. To compute a metric, first look at the --help of the RQSSFramework/RQSS_Framework_Runner.py to obtain the name of the metric. Then see the --help of the RQSSFramework/RQSS_Extractor.py to check whether computing the metric need to fetch any metadata first. For example, to compute the Availability of External URIs, the help of the extractor tells us we need to obtain all external source URLs in the input graph. So we call this command:
$ python ./RQSSFramework/RQSS_Extractor.py --endpoint http://localhost:9999/blazegraph/sparql --external-uris
Then, we call the FrameworkRunner as below:
```
$ python RQSSFramework/RQSSFrameworkRunner.py ./rqssextractoroutput/ --dereferencing
And then, to have the graphical chart, the Presenter can be called:
$ python RQSSFramework/RQSSPresenter.py ./rqssframeworkoutput
```
Repo Structure
The main part of the framework code is located in the RQSSFramework package. In this directory, there is a package corresponding to each dimension of the framework, in which Python files have implemented one or more metrics. In addition to the dimensions and metrics packages, the following scripts and files exist in the RQSSFramework package:
entityschemaextractor.py: To fetch the most up-to-date EntitySchemas and their referencing information from Wikidata websiteQueries.py: SPARQL queries used in the Extractor and the Framework_RunnerRQSS_Extractor.py: The Extractor moduleRQSS_Framework_Runner.py: The metric computer moduleRQSS_Presenter.py: The graphical chart representer moduleShExes.py: The ShEx schemas used in consistency and other dimensions
In addition, the utils directory in the RQSSFramework package contains the following scripts:
item_overlap_checker.py: The script is used to identify the overlapping items amongst randomly chosen subsetslists.py: Contains the list of datasets, licensing keywords, and any other set of literal values used in the Framework Runnertopic_coverage.py: This script is used to compute the main high-level classes of items in a subset (the topics the subset covers).
More details
See the README.md inside the RQSSFramework.
Owner
- Name: Seyed Amir Hosseini Beghaeiraveri
- Login: seyedahbr
- Kind: user
- Location: Edinburgh, UK
- Company: Heriot-Watt University
- Website: https://seyedahbr.github.io/
- Twitter: s_a_h_b_r
- Repositories: 12
- Profile: https://github.com/seyedahbr
PhD student @ Heriot-Watt University Linked Data . Semantic Web . Big Data . Distributed Systems
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: ' RQSSFramework'
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Seyed Amir
family-names: Hosseini Beghaeiraveri
email: sh200@hw.ac.uk
affiliation: Heriot-Watt University
orcid: 'https://orcid.org/0000-0002-9123-5686'
repository-code: 'https://github.com/seyedahbr/RQSSFramework'
url: 'https://github.com/seyedahbr/RQSSFramework/releases/tag/v1.0.1'
repository: 'https://github.com/seyedahbr/RQSSFramework'
repository-artifact: 'https://github.com/seyedahbr/RQSSFramework'
keywords:
- Wikidata
- subset
- data quality
- reference quality
license: CC-BY-1.0
version: 1.0.0
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1