anonymized-etl-flow-datasets-for-fsm

Anonymized version of six datasets taken from IBM's DataStage™ production systems and can be used for frequent subgraph mining

https://github.com/ibm/anonymized-etl-flow-datasets-for-fsm

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Anonymized version of six datasets taken from IBM's DataStage™ production systems and can be used for frequent subgraph mining

Basic Info

Host: GitHub
Owner: IBM
License: apache-2.0
Language: Python
Default Branch: main
Size: 1.39 MB

Statistics

Stars: 8
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

Anonymized ETL Flow Datasets for Frequent Subgraph Mining

Datasets Overview

This repository consists an anonymized version of six datasets taken from IBM's DataStage production systems and used for frequent subgraph mining in the paper Refactoring ETL Flows in The Wild.

If you are using this dataset in publications, please cite:

Dolev Adas, Ohad Eytan, Guy Khazma, Josep Samp, and Paula Ta-Shma.

"Refactoring ETL Flows in The Wild."

In 2023 IEEE International Conference on Big Data (BigData), pp. 1581-1590. IEEE, 2023.

We also have a companion blog post to the paper, and a blog post describing the dataset creation and motivation.

Dataset Format

Similar to the format used here and here, each dataset is a text file, where each line contains one of three options: 1. t # n - represents the start of flow number n. 2. v x l - represents vertex with id x and label of l (below we explain the process of deriving l). 3. e x y l - represents an edge from vertex x to vertex y with label l (in our case, all the labels are 1).

The Lifting and Anonimazion Proccess

As we explained in more detail in the paper, each stage in a flow of DataStage has parameters, and we have different options for deriving the label of the stage depending on which patterns we are looking for. We call this process lifting.

Here, we are publishing two types of lifting: 1. Simple: We only take the stage type as the label. This could be used to find general patterns and help create tools to help flow authoring.
2. Detailed: Take parameters into account, aiming to take all the parameters that would provide an option to refactor flows to use common subflows. Notice that as this is a WIP prototype, this might not be completely accurate (e.g., we take parameters that we shouldn't or vice versa).

These values hashed into unique integers to preserve our users' anonymity while keeping the structure of the flows and the ability to find common subgraphs using FSM algorithms. The hashes are not consistent between different datasets.

Acknowledgment

We thank the DataStage team for providing us the data and allowing us to share it with the community.

License

Although this Github repository is under the Apache-2.0 license, the actual datasaets are released under the CDLA-Sharing-1.0 license. By downloading or using them, you agree to the terms of this license.

Owner

Name: International Business Machines
Login: IBM
Kind: organization
Email: awesome@ibm.com
Location: United States of America

Website: https://www.ibm.com/opensource/
Twitter: ibmdeveloper
Repositories: 3,152
Profile: https://github.com/IBM

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Anonymized ETL Flow Datasets for Frequent Subgraph Mining
message: 'If you use this datasets, please cite it as below.'
type: dataset
authors:
  - family-names: Adas
    given-names: Dolev
  - family-names: Eytan
    given-names: Ohad
    email: ohad.eytan1@ibm.com
    affiliation: IBM Research
    orcid: 'https://orcid.org/0000-0001-8655-794X'
  - family-names: Khazma
    given-names: Guy
  - family-names: Sampé
    given-names: Josep
  - family-names: Ta-Shma
    given-names: Paula
repository-code: >-
  https://github.com/IBM/Anonymized-ETL-Flow-Datasets-for-FSM
abstract: >-
  Anonymized version of six datasets taken from IBM's
  DataStage™ production systems and can be used for frequent
  subgraph mining 
license: CDLA-Sharing-1.0
date-released: '2023-11-16'
preferred-citation:
  type: conference-paper
  title: Refactoring ETL Flows in The Wild
  authors:
    - family-names: Adas
      given-names: Dolev
    - family-names: Eytan
      given-names: Ohad
      email: ohad.eytan1@ibm.com
      affiliation: IBM Research
      orcid: 'https://orcid.org/0000-0001-8655-794X'
    - family-names: Khazma
      given-names: Guy
    - family-names: Sampé
      given-names: Josep
    - family-names: Ta-Shma
      given-names: Paula
  year: '2023'
  month: '12'
  collection-title: "2023 IEEE International Conference on Big Data (Big Data)"
  location: 'Sorrento, Italy'
  doi: 10.1109/BigData59044.2023.10386531

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science