oke-dataset-analysis

The basic usage of OPC UA Knowledge Extraction Dataset

https://github.com/siemens-oke/oke-dataset-analysis

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

The basic usage of OPC UA Knowledge Extraction Dataset

Basic Info

Host: GitHub
Owner: Siemens-OKE
License: mit
Language: Python
Default Branch: main
Size: 7.94 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

OKE Dataset Analysis

This repository provides four different analyses for the OKE Dataset introduced in Tufek Ozkaya et al. (2023).

Motivation

The motivation of the analyses described below is to: 1. Detect inconsistencies among annotators of the dataset and highlight consistency of the keywords in consistency analysis 2. See the distributions of the annotated entities among the companion specifications of OPC UA and extract correlation between these specifications in entity distribution analysis 3. Just by using the annotated entities, and their frequencies, generating heatmap(s) to compute similarity score between companion specifications in common entity analysis

Dataset

The OPC UA Knowledge Extraction (OKE) Dataset is a dataset specifically created for sentence classification, named entity recognition and disambiguation (NERD), and entity linking. To learn more about the dataset and download it, please visit this link and get the latest version of it. After downloading the dataset, please make sure to place the files in the ./oke_dataset/ directory before running the analysis scripts.

The files should appear in the following hierarchy:

. └── oke-dataset-analysis/ ├── oke_dataset/ │ ├── csv/ │ │ ├── AutoId.csv │ │ └── ... │ └── excel/ │ ├── AutoId.xlsx │ └── ... ├── output └── ...

Prerequisites

Make sure that you have python >= 3.9 and install the requirements.txt.

Analysis

1. Keyword Analysis

This analysis will produce 4 different files of charts. * most used keywords * most used keywords, filtered by inclusion* * keywords assigned to more than 2 categories * most conflicting keyword**

*inclusion: Given a keyword, assume that it is seen n times in the specification. if it is annotated only a few times, and not annotated for the remaining. it will be discarded. this margin is determined by the threshold parameter.

**Here, most conflicting means that if keyword is assinged to more than 2 categories and those categories' rate is close to each other. for example, for a keyword annotated 10 times, if it is annotated 5 times as runtime-only, and 5 times as information model keyword. this is one of the most conflicting situations.

IMPORTANT: Before running the analysis, please run the helper script as: bash python merge_all_specs.py which will combine all the samples in all the files into one file. Afterwards one is free to run any experiment described below. #### Running commands:
Running all sentences from all Excel sheets as follows. This will take some time (up to 5 mins depending on the hardware).

bash python keyword_analysis.py
Select a custom specification (default = all): bash python keyword_analysis.py -s autoid specification options: "all", "autoid", "iolink", "isa95", "machinetools", "mv1ccm", "mv2amcm", "packml", "padim", "profinet", "robotics", "uafx", "weihenstephan"
Select the sentence type:
- for only rule sentences: bash python keyword_analysis.py -s autoid -f --only_rule_sentences
- for only non-rule sentences: bash python keyword_analysis.py -s autoid -f --only_non_rule_sentences the default is all sentences.
Overriding the inclusion threshold for keywords (default=0.9)*: bash python keyword_analysis.py -s autoid -f --only_non_rule_sentences --threshold 0.8
Setting output path: bash python keyword_analysis.py -s autoid -o test_output/

*decreasing threshold will likely to lead to less keywords to be used during computation.

The output will be generated under ./output/keyword_analysis/ folder.

2. Entity Distribution Analysis

In order to run entity distribution analysis, the following commands can be called.

Computation Steps:

Calculate the entity distribution histogram for each companion specification.
Normalize these histogram vectors to have unit length.
Take cosine similarity (dot product) between these vectors.
Collect the pair-wise cosine similarities in a heatmap.
Do it for rule and non-rule sentences seperately

Running commands:

for all entity categories: bash python entity_distribution_analysis.py --entity_name all
for a specfict entity category call to generate heatmap: bash python entity_distribution_analysis.py --entity_name information_model

Output:

The output will be generated under ./output/entitydistributionanalysis/ folder. There will be two sub folders generated for this purpose: 1. ./output/entitydistributionanalysis/distributionbarchart/ this shows the distribution charts of each entity of each datasheet 1. ./output/entitydistributionanalysis/distributionheatmap/ this involves the heatmaps of the distribution correlation of each datasheet based on different entity categories.

3. Common Entity Analysis

This analysis will produce four files, each of which contains a heatmap indicating the similarities of specifications based on each of the four following columns in the dataset. The columns are: "Information Model Keywords", "Constraint Keywords", "Relation Keywords" and "Runtime only".

Computation Steps:

Collect all the keywords from specified column across all the companion specs.
For each comp spec, calculate a keyword frequency histogram over the set obtained in step #1.
Normalize these histogram vectors to have unit length.
Take cosine similarity (dot product) between these vectors.
Collect the pair-wise cosine similarities in a heatmap. #### Running commands:
all sentences (unfiltered): bash python common_entity_analysis.py

only rule sentences: bash python common_entity_analysis.py -f --only_rule_sentences
only non-rule sentences: bash python common_entity_analysis.py -f --only_non_rule_sentences

Output:

The output will be generated under ./output/commonentityanalysis/ folder.

4. General Specification Analysis

This analysis will produce a single file that contains a heatmap indicating the similarities of specifications based on some columns in the dataset.

Running commands:

all sentences (unfiltered): bash python spec_analysis.py
only rule sentences: bash python spec_analysis.py -f --only_rule_sentences
only non-rule sentences: bash python spec_analysis.py -f --only_non_rule_sentences
overriding inclusion threshold for keywords (default=0.9)*: bash python spec_analysis.py -f --only_rule_sentences --threshold 0.8

*decreasing threshold will likely to lead to less keywords to be used during computation

Output:

The output will be generated under ./output/specification_analysis/ folder.

Dataset Citation

@dataset{tufek_ozkaya_2023_10284578,
    author       = {Tufek Ozkaya, Nilay},
    title        = {OPC UA Knowledge Extraction (OKE) Dataset},
    month        = dec,
    year         = 2023,
    publisher    = {Zenodo},
    doi          = {10.5281/zenodo.10284577},
    url          = {https://doi.org/10.5281/zenodo.10284577}
}

Authors

Nilay Tüfek Özkaya (nilay.tuefek-oezkaya@siemens.com) Valentin Philipp (valentin.just@tuwien.ac.at) Berkay Ugur (berkaysenocak@gmail.com) Tathagata Bandyopadhyay (tathagata.bandyopadhyay@siemens.com)

Owner

Name: Siemens-OKE
Login: Siemens-OKE
Kind: organization

Repositories: 1
Profile: https://github.com/Siemens-OKE

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Tufek
    given-names: Nilay
    orcid: https://orcid.org/0000-0002-0446-1299
title: "OKE Dataset Analysis"
version: 1.0.0
identifiers:
  - type: doi
    value: 10.5281/zenodo.10284577
date-released: 2024-02-02

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

oke-dataset-analysis

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

OKE Dataset Analysis

Motivation

Dataset

Prerequisites

Analysis

1. Keyword Analysis

2. Entity Distribution Analysis

Computation Steps:

Running commands:

Output:

3. Common Entity Analysis

Computation Steps:

Output:

4. General Specification Analysis

Running commands:

Output:

Dataset Citation

Authors

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year