anonymized-etl-flow-datasets-for-fsm
Anonymized version of six datasets taken from IBM's DataStage™ production systems and can be used for frequent subgraph mining
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary
Repository
Anonymized version of six datasets taken from IBM's DataStage™ production systems and can be used for frequent subgraph mining
Basic Info
- Host: GitHub
- Owner: IBM
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 1.39 MB
Statistics
- Stars: 8
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Anonymized ETL Flow Datasets for Frequent Subgraph Mining
Datasets Overview
This repository consists an anonymized version of six datasets taken from IBM's DataStage production systems and used for frequent subgraph mining in the paper Refactoring ETL Flows in The Wild.
If you are using this dataset in publications, please cite:
Dolev Adas, Ohad Eytan, Guy Khazma, Josep Samp, and Paula Ta-Shma.
"Refactoring ETL Flows in The Wild."
In 2023 IEEE International Conference on Big Data (BigData), pp. 1581-1590. IEEE, 2023.
We also have a companion blog post to the paper, and a blog post describing the dataset creation and motivation.
Dataset Format
Similar to the format used here and here, each dataset is a text file, where each line contains one of three options:
1. t # n - represents the start of flow number n.
2. v x l - represents vertex with id x and label of l (below we explain the process of deriving l).
3. e x y l - represents an edge from vertex x to vertex y with label l (in our case, all the labels are 1).
The Lifting and Anonimazion Proccess
As we explained in more detail in the paper, each stage in a flow of DataStage has parameters, and we have different options for deriving the label of the stage depending on which patterns we are looking for. We call this process lifting.
Here, we are publishing two types of lifting:
1. Simple: We only take the stage type as the label. This could be used to find general patterns and help create tools to help flow authoring.
2. Detailed: Take parameters into account, aiming to take all the parameters that would provide an option to refactor flows to use common subflows. Notice that as this is a WIP prototype, this might not be completely accurate (e.g., we take parameters that we shouldn't or vice versa).
These values hashed into unique integers to preserve our users' anonymity while keeping the structure of the flows and the ability to find common subgraphs using FSM algorithms. The hashes are not consistent between different datasets.
Acknowledgment
We thank the DataStage team for providing us the data and allowing us to share it with the community.
License
Although this Github repository is under the Apache-2.0 license, the actual datasaets are released under the CDLA-Sharing-1.0 license. By downloading or using them, you agree to the terms of this license.
Owner
- Name: International Business Machines
- Login: IBM
- Kind: organization
- Email: awesome@ibm.com
- Location: United States of America
- Website: https://www.ibm.com/opensource/
- Twitter: ibmdeveloper
- Repositories: 3,152
- Profile: https://github.com/IBM
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: Anonymized ETL Flow Datasets for Frequent Subgraph Mining
message: 'If you use this datasets, please cite it as below.'
type: dataset
authors:
- family-names: Adas
given-names: Dolev
- family-names: Eytan
given-names: Ohad
email: ohad.eytan1@ibm.com
affiliation: IBM Research
orcid: 'https://orcid.org/0000-0001-8655-794X'
- family-names: Khazma
given-names: Guy
- family-names: Sampé
given-names: Josep
- family-names: Ta-Shma
given-names: Paula
repository-code: >-
https://github.com/IBM/Anonymized-ETL-Flow-Datasets-for-FSM
abstract: >-
Anonymized version of six datasets taken from IBM's
DataStage™ production systems and can be used for frequent
subgraph mining
license: CDLA-Sharing-1.0
date-released: '2023-11-16'
preferred-citation:
type: conference-paper
title: Refactoring ETL Flows in The Wild
authors:
- family-names: Adas
given-names: Dolev
- family-names: Eytan
given-names: Ohad
email: ohad.eytan1@ibm.com
affiliation: IBM Research
orcid: 'https://orcid.org/0000-0001-8655-794X'
- family-names: Khazma
given-names: Guy
- family-names: Sampé
given-names: Josep
- family-names: Ta-Shma
given-names: Paula
year: '2023'
month: '12'
collection-title: "2023 IEEE International Conference on Big Data (Big Data)"
location: 'Sorrento, Italy'
doi: 10.1109/BigData59044.2023.10386531
GitHub Events
Total
- Watch event: 2
Last Year
- Watch event: 2