https://github.com/cau-se/distributedeventfactory

Stream Data Generator for Distributed Process Mining

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.0%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Stream Data Generator for Distributed Process Mining

Basic Info

Host: GitHub
Owner: cau-se
License: apache-2.0
Language: Python
Default Branch: main
Size: 3.43 MB

Statistics

Stars: 2
Watchers: 5
Forks: 0
Open Issues: 1
Releases: 2

Created almost 2 years ago · Last pushed 10 months ago

Metadata Files

Readme License

Distributed Event Factory (DEF)

Factory

About

The Distributed Event Factory (DEF) is a simulation tool for Distributed Process Mining. It is designed for simulating real world processes utilizing Markov chains. It produces events consisting of classical process mining measures caseId, activity and timestamp on assembled with location data as the node and a group. In contrast to traditional process mining, multiple independent event streams are generated instead of a single large one.

Who should use the DEF?

DEF mainly targets researchers who like to test and evaluate distributed data mining algorithms. It is suited for generating data at large scale. So, it can be leveraged for benchmarking and load testing.

Motivation

The Distributed Event Factory was designed for testing and evaluating Distributed Process Mining Algorithms. A use case for the tool is the simulation of a smart factory with multiple production facilities. Each Sensor writes its own event logs. On this data process mining should be applied. We use such a factory as a running example for configuring the DEF.

Core concepts

Three core concepts underlie DEF: DataSource, Simulation and Sink. Those entities are configured via yaml-files.

Datasources

DEF is based on a Markov chain. The vertices of the Markov chain represent the distributed data sources. The edges represent transitions between the data sources. They contain a probability for transitioning, the activity performed and a function modelling the process duration.

A minimal running example yaml is presented here:

yaml kind: datasource name: "MaterialPreparation" spec: name: "MaterialPreparation" group: "factory" selection: genericProbability distribution: [ 0.8, 0.2 ] eventData: - activity: "Waiting for Material" transition: "MaterialPreparation" duration: 2 - activity: "MaterialPreparation - Finished" duration: 1 transition: "AssemblyLineSetup"

Every Resource has to specify a kind in our case the kind datasource. The name (here: MaterialPreparation), is used for later referencing this datasource in the code. The specification spec defines the properties of the datasource. The name refers to the node name which is later referred in the node field of the generated event. This is also the case for the field group. Moreover, The selection of the next emitted event can be specified. In this case its genericProbability. It selects the events from the event data by the defined distribution. The event data refers to the edges in the Markov chain including duration, transition and activity.

Sink

To view and use the data which is generated by the datasources (a) sink(s) must be specified. The minimal sink-yaml.

yaml kind: sink name: console-sink spec: type: console id: "Sensor" dataSourceRefs: - "MaterialPreparation"

The type of the sink is here defined as console-sink which prints the events to the console. The id is a prefix to identify the sink in the console (especially useful if multiple console sinks are used). The dataSourceRefs defines from which datasource the events are emitted. A datasource can be also linked to multiple sinks.

Simulation

A simulation runs the process through our datasources and emits their events via the sinks. The type stream creates a datastream. The caseId defines how the case-ids are generated. In the incresing mode the caseId is an incrementally increased number. The load defines at which speed the simulation is run. In this case constantly 10 events per second are emitted.

yaml kind: simulation name: streamSimulation spec: type: stream caseId: increasing load: loadBehavior: load load: 10

Load Testing

All currently shown configurations are suited for generating mock events streams. As additional feature, DEF can also be used for real load testing of process mining applications. However, therefore specific configurations need to be specified. The type of simulation has to be laod-test and the sinks have to be loadtest sinks. The load-test sinks depend on a load-test backend which can precisely emit the specified load event at large scale.

The loadtest simulation has 2 new properties to specify: timeframe and a load-backend url. A timeframe is a batch of events which is submitted together to the load-backend. The number specifies the duration of this timeframe in milliseconds. yaml kind: sink name: http-sink spec: type: http id: "Sensor" timeframe: 1000 url: "http://localhost:8080/loadgenerator" dataSourceRefs: - "MaterialPreparation"

The simulation specifies a new property genTimeframesTilStart first timeframes can be buffered at the load-backend before the actual event emission is started. yaml kind: simulation name: assembly-line-simulation spec: type: loadTest caseId: increasing genTimeframesTilStart: 10 load: loadBehavior: constant load: 1000

Start the load backend

To use the load-backend docker needs to be installed on your machine. Currently, the load backend only supports writing data to a kafka queue. The load-backend is started via the script sh ./run-load-backend <bootstrapServerUrl> <kafkaTopic>

Installation

There are two options install DEF, via PyPI or building it yourself.

PyPI

Self Build

Requirements: - Python >= 3.10

Install dependencies: shell pip install -r requirements.txt

Start the Programm

Define the Distribted Event Factory and specify a config file like mentioned above.

```python

from distributedeventfactory.event_factory import EventFactory

eventfactory = EventFactory() (eventfactory .adddirectory("config/datasource/assemblyline") .addfile("config/simulation/stream.yaml") .add_file("config/sink/sink.yaml") ).run() ```

Advanced configuration

Owner

Name: Kiel University - Software Engineering Group
Login: cau-se
Kind: organization
Location: Kiel, Germany

Website: https://www.se.informatik.uni-kiel.de/en/research/projects
Repositories: 9
Profile: https://github.com/cau-se

GitHub Events

Total

Release event: 1
Watch event: 1
Member event: 2
Push event: 64
Create event: 10

Last Year

Release event: 1
Watch event: 1
Member event: 2
Push event: 64
Create event: 10

Dependencies

Dockerfile docker

python ${PYTHON_VERSION}-slim build

requirements.txt pypi

PyYAML *
kafka-python ==2.0.2
matplotlib *
networkx *
numpy *
python-dotenv *
rel ==0.4.9
scheduled-futures ==1.1.0
websocket-client ==1.6.1
websockets *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/cau-se/distributedeventfactory

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Distributed Event Factory (DEF)

About

Who should use the DEF?

Motivation

Core concepts

Datasources

Sink

Simulation

Load Testing

Start the load backend

Installation

PyPI

Self Build

Start the Programm

Advanced configuration

Owner

GitHub Events

Total

Last Year

Dependencies