oke-dataset-analysis
The basic usage of OPC UA Knowledge Extraction Dataset
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary
Repository
The basic usage of OPC UA Knowledge Extraction Dataset
Basic Info
- Host: GitHub
- Owner: Siemens-OKE
- License: mit
- Language: Python
- Default Branch: main
- Size: 7.94 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
OKE Dataset Analysis
This repository provides four different analyses for the OKE Dataset introduced in Tufek Ozkaya et al. (2023).
Motivation
The motivation of the analyses described below is to: 1. Detect inconsistencies among annotators of the dataset and highlight consistency of the keywords in consistency analysis 2. See the distributions of the annotated entities among the companion specifications of OPC UA and extract correlation between these specifications in entity distribution analysis 3. Just by using the annotated entities, and their frequencies, generating heatmap(s) to compute similarity score between companion specifications in common entity analysis
Dataset
The OPC UA Knowledge Extraction (OKE) Dataset is a dataset specifically created for sentence classification, named entity recognition and disambiguation (NERD), and entity linking. To learn more about the dataset and download it, please visit this link and get the latest version of it. After downloading the dataset, please make sure to place the files in the ./oke_dataset/ directory before running the analysis scripts.
The files should appear in the following hierarchy:
.
└── oke-dataset-analysis/
├── oke_dataset/
│ ├── csv/
│ │ ├── AutoId.csv
│ │ └── ...
│ └── excel/
│ ├── AutoId.xlsx
│ └── ...
├── output
└── ...
Prerequisites
Make sure that you have python >= 3.9 and install the requirements.txt.
Analysis
1. Keyword Analysis
This analysis will produce 4 different files of charts. * most used keywords * most used keywords, filtered by inclusion* * keywords assigned to more than 2 categories * most conflicting keyword**
*inclusion: Given a keyword, assume that it is seen n times in the specification. if it is annotated only a few times, and not annotated for the remaining. it will be discarded. this margin is determined by the threshold parameter.
**Here, most conflicting means that if keyword is assinged to more than 2 categories and those categories' rate is close to each other. for example, for a keyword annotated 10 times, if it is annotated 5 times as runtime-only, and 5 times as information model keyword. this is one of the most conflicting situations.
- IMPORTANT: Before running the analysis, please run the helper script as:
bash python merge_all_specs.pywhich will combine all the samples in all the files into one file. Afterwards one is free to run any experiment described below. #### Running commands: Running all sentences from all Excel sheets as follows. This will take some time (up to 5 mins depending on the hardware).
bash python keyword_analysis.pySelect a custom specification (default = all):
bash python keyword_analysis.py -s autoidspecification options: "all", "autoid", "iolink", "isa95", "machinetools", "mv1ccm", "mv2amcm", "packml", "padim", "profinet", "robotics", "uafx", "weihenstephan"Select the sentence type:
- for only rule sentences:
bash python keyword_analysis.py -s autoid -f --only_rule_sentences - for only non-rule sentences:
bash python keyword_analysis.py -s autoid -f --only_non_rule_sentencesthe default is all sentences.
- for only rule sentences:
Overriding the inclusion threshold for keywords (default=0.9)*:
bash python keyword_analysis.py -s autoid -f --only_non_rule_sentences --threshold 0.8Setting output path:
bash python keyword_analysis.py -s autoid -o test_output/*decreasing threshold will likely to lead to less keywords to be used during computation.
The output will be generated under ./output/keyword_analysis/ folder.
2. Entity Distribution Analysis
In order to run entity distribution analysis, the following commands can be called.
Computation Steps:
- Calculate the entity distribution histogram for each companion specification.
- Normalize these histogram vectors to have unit length.
- Take cosine similarity (dot product) between these vectors.
- Collect the pair-wise cosine similarities in a heatmap.
- Do it for rule and non-rule sentences seperately
Running commands:
for all entity categories:
bash python entity_distribution_analysis.py --entity_name allfor a specfict entity category call to generate heatmap:
bash python entity_distribution_analysis.py --entity_name information_model
Output:
The output will be generated under ./output/entitydistributionanalysis/ folder. There will be two sub folders generated for this purpose: 1. ./output/entitydistributionanalysis/distributionbarchart/ this shows the distribution charts of each entity of each datasheet 1. ./output/entitydistributionanalysis/distributionheatmap/ this involves the heatmaps of the distribution correlation of each datasheet based on different entity categories.
3. Common Entity Analysis
This analysis will produce four files, each of which contains a heatmap indicating the similarities of specifications based on each of the four following columns in the dataset. The columns are: "Information Model Keywords", "Constraint Keywords", "Relation Keywords" and "Runtime only".
Computation Steps:
- Collect all the keywords from specified column across all the companion specs.
- For each comp spec, calculate a keyword frequency histogram over the set obtained in step #1.
- Normalize these histogram vectors to have unit length.
- Take cosine similarity (dot product) between these vectors.
- Collect the pair-wise cosine similarities in a heatmap. #### Running commands:
- all sentences (unfiltered):
bash python common_entity_analysis.py
only rule sentences:
bash python common_entity_analysis.py -f --only_rule_sentencesonly non-rule sentences:
bash python common_entity_analysis.py -f --only_non_rule_sentences
Output:
The output will be generated under ./output/commonentityanalysis/ folder.
4. General Specification Analysis
This analysis will produce a single file that contains a heatmap indicating the similarities of specifications based on some columns in the dataset.
Running commands:
all sentences (unfiltered):
bash python spec_analysis.pyonly rule sentences:
bash python spec_analysis.py -f --only_rule_sentencesonly non-rule sentences:
bash python spec_analysis.py -f --only_non_rule_sentencesoverriding inclusion threshold for keywords (default=0.9)*:
bash python spec_analysis.py -f --only_rule_sentences --threshold 0.8
*decreasing threshold will likely to lead to less keywords to be used during computation
Output:
The output will be generated under ./output/specification_analysis/ folder.
Dataset Citation
@dataset{tufek_ozkaya_2023_10284578,
author = {Tufek Ozkaya, Nilay},
title = {OPC UA Knowledge Extraction (OKE) Dataset},
month = dec,
year = 2023,
publisher = {Zenodo},
doi = {10.5281/zenodo.10284577},
url = {https://doi.org/10.5281/zenodo.10284577}
}
Authors
Nilay Tüfek Özkaya (nilay.tuefek-oezkaya@siemens.com) Valentin Philipp (valentin.just@tuwien.ac.at) Berkay Ugur (berkaysenocak@gmail.com) Tathagata Bandyopadhyay (tathagata.bandyopadhyay@siemens.com)
Owner
- Name: Siemens-OKE
- Login: Siemens-OKE
- Kind: organization
- Repositories: 1
- Profile: https://github.com/Siemens-OKE
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Tufek
given-names: Nilay
orcid: https://orcid.org/0000-0002-0446-1299
title: "OKE Dataset Analysis"
version: 1.0.0
identifiers:
- type: doi
value: 10.5281/zenodo.10284577
date-released: 2024-02-02
GitHub Events
Total
- Push event: 3
Last Year
- Push event: 3