anonymized-etl-flow-datasets-for-fsm

Anonymized version of six datasets taken from IBM's DataStage™ production systems and can be used for frequent subgraph mining

https://github.com/ibm/anonymized-etl-flow-datasets-for-fsm

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.0%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Anonymized version of six datasets taken from IBM's DataStage™ production systems and can be used for frequent subgraph mining

Basic Info
  • Host: GitHub
  • Owner: IBM
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 1.39 MB
Statistics
  • Stars: 8
  • Watchers: 3
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed about 2 years ago
Metadata Files
Readme License Citation

README.md

Anonymized ETL Flow Datasets for Frequent Subgraph Mining

Datasets Overview

This repository consists an anonymized version of six datasets taken from IBM's DataStage production systems and used for frequent subgraph mining in the paper Refactoring ETL Flows in The Wild.

If you are using this dataset in publications, please cite:

Dolev Adas, Ohad Eytan, Guy Khazma, Josep Samp, and Paula Ta-Shma.

"Refactoring ETL Flows in The Wild."

In 2023 IEEE International Conference on Big Data (BigData), pp. 1581-1590. IEEE, 2023.

We also have a companion blog post to the paper, and a blog post describing the dataset creation and motivation.

Dataset Format

Similar to the format used here and here, each dataset is a text file, where each line contains one of three options: 1. t # n - represents the start of flow number n. 2. v x l - represents vertex with id x and label of l (below we explain the process of deriving l). 3. e x y l - represents an edge from vertex x to vertex y with label l (in our case, all the labels are 1).

The Lifting and Anonimazion Proccess

As we explained in more detail in the paper, each stage in a flow of DataStage has parameters, and we have different options for deriving the label of the stage depending on which patterns we are looking for. We call this process lifting.

Here, we are publishing two types of lifting: 1. Simple: We only take the stage type as the label. This could be used to find general patterns and help create tools to help flow authoring.
2. Detailed: Take parameters into account, aiming to take all the parameters that would provide an option to refactor flows to use common subflows. Notice that as this is a WIP prototype, this might not be completely accurate (e.g., we take parameters that we shouldn't or vice versa).

These values hashed into unique integers to preserve our users' anonymity while keeping the structure of the flows and the ability to find common subgraphs using FSM algorithms. The hashes are not consistent between different datasets.

Acknowledgment

We thank the DataStage team for providing us the data and allowing us to share it with the community.

License

Although this Github repository is under the Apache-2.0 license, the actual datasaets are released under the CDLA-Sharing-1.0 license. By downloading or using them, you agree to the terms of this license.

Owner

  • Name: International Business Machines
  • Login: IBM
  • Kind: organization
  • Email: awesome@ibm.com
  • Location: United States of America

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Anonymized ETL Flow Datasets for Frequent Subgraph Mining
message: 'If you use this datasets, please cite it as below.'
type: dataset
authors:
  - family-names: Adas
    given-names: Dolev
  - family-names: Eytan
    given-names: Ohad
    email: ohad.eytan1@ibm.com
    affiliation: IBM Research
    orcid: 'https://orcid.org/0000-0001-8655-794X'
  - family-names: Khazma
    given-names: Guy
  - family-names: Sampé
    given-names: Josep
  - family-names: Ta-Shma
    given-names: Paula
repository-code: >-
  https://github.com/IBM/Anonymized-ETL-Flow-Datasets-for-FSM
abstract: >-
  Anonymized version of six datasets taken from IBM's
  DataStage™ production systems and can be used for frequent
  subgraph mining 
license: CDLA-Sharing-1.0
date-released: '2023-11-16'
preferred-citation:
  type: conference-paper
  title: Refactoring ETL Flows in The Wild
  authors:
    - family-names: Adas
      given-names: Dolev
    - family-names: Eytan
      given-names: Ohad
      email: ohad.eytan1@ibm.com
      affiliation: IBM Research
      orcid: 'https://orcid.org/0000-0001-8655-794X'
    - family-names: Khazma
      given-names: Guy
    - family-names: Sampé
      given-names: Josep
    - family-names: Ta-Shma
      given-names: Paula
  year: '2023'
  month: '12'
  collection-title: "2023 IEEE International Conference on Big Data (Big Data)"
  location: 'Sorrento, Italy'
  doi: 10.1109/BigData59044.2023.10386531

GitHub Events

Total
  • Watch event: 2
Last Year
  • Watch event: 2