StreamGen
StreamGen: a Python framework for generating streams of labeled data - Published in JOSS (2024)
Science Score: 98.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 7 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: arxiv.org, joss.theoj.org, zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Scientific Fields
Repository
Python framework for generating streams of labeled data.
Basic Info
- Host: GitHub
- Owner: Infineon
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://infineon.github.io/StreamGen/
- Size: 64.5 MB
Statistics
- Stars: 15
- Watchers: 3
- Forks: 0
- Open Issues: 2
- Releases: 6
Topics
Metadata Files
README.md
🌌 StreamGen
a 🐍 Python framework for generating streams of labelled data
⚗️ Motivation • 💡 Idea • 📦 Installation • 👀 Examples • 📖 Documentation • 🙏 Acknowledgement
⚗️ Motivation
Most machine learning systems rely on stationary, labeled, balanced and large-scale datasets. Incremental learning (IL), also referred to as lifelong learning (LL) or continual learning (CL), extends the traditional paradigm to work in dynamic and evolving environments. This requires such systems to acquire and preserve knowledge continually.
Existing CL frameworks like avalanche[^1] or continuum[^2] construct data streams by splitting large datasets into multiple experiences, which has a few disadvantages:
- results in unrealistic scenarios
- offers limited insight into distributions and their evolution
- not extendable to scenarios with fewer constraints on the stream properties
To answer different research questions in the field of CL, researchers need knowledge and control over:
- class distributions
- novelties and outliers
- complexity and evolution of the background domain
- semantics of the unlabeled parts of a domain
- class dependencies
- class composition (for multi-label modelling)
A more economical alternative to collecting and labelling streams with desired properties is the generation of synthetic streams[^6]. Some mentionable efforts in that direction include augmentation based dataset generation like ImageNet-C[^3] or simulation-based approaches like the EndlessCLSim[^4], where semantically labeled street-view images are generated (and labeled) by a game engine, that procedurally generates the city environment and simulates drift by modifying parameters (like the weather and illumination conditions) over time.
ImageNet-C [3]
EndlessCLSim [4]
This project builds on these ideas and presents a general framework for generating streams of labeled samples.
💡 Idea
This section introduces the main ideas and building blocks of the streamgen framework.
🎲 Building complex Distributions through random Transformations
There exists only a limited number of distributions one can directly sample from (e.g.: a gaussian distribution).
Instead of generating samples directly from a distribution, researchers often work with collected sets of samples. A common practice to increase the variability of such datasets is the use of stochastic transformations in a sequential augmentation pipeline:
```python from torchvision.transforms import v2
transforms = v2.Compose([ v2.RandomResizedCrop(size=(224, 224), antialias=True), v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), # ... ])
while generating_data: # option 1 - sample from a dataset sample = np.random.choice(dataset) # option 2 - sample from a distribution sample = np.random.randn(...)
augmented_sample = transforms(sample)
```
Combined with an initial sampler, that either samples from a data set or directly from a distribution, these chained transformations can represent complex distributions.
Function Composition Details
Two (or more) functions f: X → X, g: X → X having the same domain and codomain are often called **transformations**. One can form chains of transformations composed together, such as f ∘ f ∘ g ∘ f (which is the same as f(f(g(f(x)))) given some input x). Such chains have the algebraic structure of a **monoid**, called a transformation monoid or (much more seldom) a composition monoid. [^7] A lot of programming languages offer native support for such transformation monoids. Julia uses `|>` or `∘` for function chaining: ```julia distribution = sample |> filter |> augment distribution = augment ∘ filter ∘ sample ``` R uses the chain operator `%>%`: ```R distribution <- sample %>% filter() %>% augment() ``` In python, you can use `functools.reduce` to create simple monoids: ```python from functools import reduce from typing import Callable def compose(*funcs) -> Callable[[int], int]: """Compose a group of functions (f(g(h(...)))) into a single composite func.""" return reduce(lambda f, g: lambda x: f(g(x)), funcs) distribution = compose(sample, filter, augment) ``` > 🤚 StreamGen is not trying to implement general (and optimized) function composition in Python. It rather offers a very opinionated implementation, that is optimal for the data generation use-case.🌳 Sampling Trees
One shortcoming of this approach is that one can only generate samples from a single distribution -> different class distributions are not representable.
One solution to this problem is the use of a tree (or other directed acyclic graph (DAG)) data structure to store the transformations.
- samples are transformed during the traversal of the tree from the root to the leaves.
- each path through the tree represents its own class-conditional distribution.
- each branching point represents a categorical distribution which determines the path to take for a sample during the tree traversal.
⚙️ Parameter Schedules
If we want to model evolving distributions (streams), we either need to change the parameters of the stochastic transformations or the topology of the tree over time.
Currently, streamgen does not support scheduling topological changes (like adding branches and nodes), but by unrolling these changes over time into one static tree, topological changes can be modelled purely with branch probabilities.
💡 the directed acyclic graph above is not a tree anymore due to the merging of certain branches. Because these merges are very convenient in certain scenarios,
streamgensupport the definition of such trees by copying the paths below the merge to every branch before the merge. For an example of this, have a look atexamples/time series classification/04-multi-label-generation.ipynb.
📈 Data Drift Scenarios
The proposed tree structure can model all three common data drift scenarios by scheduling the parameters of the transformations at specific nodes.
📉 Covariate shift
📊 Prior probability shift
🏷️ Concept shift
📦 Installation
The graph visualizations require Graphviz to be installed on your system. Depending on your operating system and package manager, you might try one of the following options:
- ubuntu:
sudo apt-get install graphviz - windows:
choco install graphviz - macOs:
brew install graphviz
The basic version of the package can be installed from PyPi with:
sh
pip install streamgen
streamgen provides a few (pip) extras:
| extras group | needed for | additional dependencies |
| ------------ | -------------------------------------------------------------------------- | ---------------------------- |
| examples | running the example notebooks with their application specific dependencies | perlin-numpy, polars |
| cl | continual learning frameworks | continuum |
| all | shortcut for installing every extra | * |
To install the package with specific extras execute:
sh
pip install streamgen[<name_of_extra>]
🧑💻 to install a development environment (which you need if you want to work on the package, instead of just using the package),
cdinto the project's root directory and call:bash poetry install --sync --compile --all-extras
👀 Examples
There are example notebooks 🪐📓 showcasing and explaining streamgen features:
- 📈 time series
- 🖼️ analog wafer map streams based on the wm811k dataset[^5] in 🌐 wafer map generation
Here is a preview of what we will create in the time series examples:
📖 Documentation
The documentation is hosted through github pages.
To locally build and view it, call poe docs_local.
🙏 Acknowledgement
Made with ❤️ and ☕ by Laurenz Farthofer.
This work was funded by the Austrian Research Promotion Agency (FFG, Project No. 905107).
Special thanks to Benjamin Steinwender, Marius Birkenbach and Nikolaus Neugebauer for their valuable feedback.
I want to thank Infineon and KAI for letting me work on and publish this project.
Finally, I want to thank my university supervisors Thomas Pock and Marc Masana for their guidance.
🖼️ ©️ Banner Artwork Attribution

The art in the banner of this README is licensed under a Creative Commons Attribution-NonCommercial-No Derivatives Works 3.0 License. It was made by th3dutchzombi3. Check out his beautiful artwork ❤️
📄 References
[^1]: V. Lomonaco et al., “Avalanche: an End-to-End Library for Continual Learning,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA: IEEE, Jun. 2021, pp. 3595–3605. doi: 10.1109/CVPRW53098.2021.00399. [^2]: A. Douillard and T. Lesort, “Continuum: Simple Management of Complex Continual Learning Scenarios.” arXiv, Feb. 11, 2021. doi: 10.48550/arXiv.2102.06253. [^3]: D. Hendrycks and T. Dietterich, “Benchmarking Neural Network Robustness to Common Corruptions and Perturbations.” arXiv, Mar. 28, 2019. doi: 10.48550/arXiv.1903.12261. [^4]: T. Hess, M. Mundt, I. Pliushch, and V. Ramesh, “A Procedural World Generation Framework for Systematic Evaluation of Continual Learning.” arXiv, Dec. 13, 2021. doi: 10.48550/arXiv.2106.02585. [^5]: Wu, Ming-Ju, Jyh-Shing R. Jang, and Jui-Long Chen. “Wafer Map Failure Pattern Recognition and Similarity Ranking for Large-Scale Data Sets.” IEEE Transactions on Semiconductor Manufacturing 28, no. 1 (February 2015): 1–12. [^6]: J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, “Learning under Concept Drift: A Review” IEEE Trans. Knowl. Data Eng., pp. 1–1, 2018, doi: 10.1109/TKDE.2018.2876857. [^7]: “Function composition,” Wikipedia. Feb. 16, 2024. Accessed: Apr. 17, 2024. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Function_composition&oldid=1207989326
Owner
- Name: Infineon
- Login: Infineon
- Kind: organization
- Location: Neubiberg, Germany
- Website: https://www.infineon.com
- Repositories: 1,383
- Profile: https://github.com/Infineon
JOSS Publication
StreamGen: a Python framework for generating streams of labeled data
Authors
Tags
Data Generation Synthetic Data Data Streams Continual Learning Function CompositionCitation (CITATION.cff)
cff-version: "1.2.0"
authors:
- family-names: Farthofer
given-names: Laurenz A.
orcid: "https://orcid.org/0000-0003-1477-1327"
doi: 10.5281/zenodo.14273611
message: 👋 if you use this software, please cite our article in the
Journal of Open Source Software.
preferred-citation:
authors:
- family-names: Farthofer
given-names: Laurenz A.
orcid: "https://orcid.org/0000-0003-1477-1327"
date-published: 2024-12-06
doi: 10.21105/joss.07206
issn: 2475-9066
issue: 104
journal: Journal of Open Source Software
publisher:
name: Open Journals
start: 7206
title: "StreamGen: a Python framework for generating streams of
labeled data"
type: article
url: "https://joss.theoj.org/papers/10.21105/joss.07206"
volume: 9
title: "StreamGen: a Python framework for generating streams of labeled
data"
CodeMeta (codemeta.json)
{
"@context": "https://raw.githubusercontent.com/codemeta/codemeta/master/codemeta.jsonld",
"@type": "Code",
"author": [
{
"@id": "https://orcid.org/0000-0003-1477-1327",
"@type": "Person",
"affiliation": "Infineon Technologies AG., Graz University of Technology",
"email": "laurenz@hey.com",
"name": "Farthofer Laurenz"
}
],
"codeRepository": "https://github.com/Infineon/StreamGen",
"dateCreated": "2024-08-19",
"dateModified": "2024-08-19",
"datePublished": "2024-08-19",
"description": "A Python framework for generating streams of labeled data.",
"identifier": "",
"keywords": "python, data-structures, data-generation, function-composition, data-streams, continual-learning",
"license": "MIT",
"title": "StreamGen",
"version": "v1.0.2"
}
GitHub Events
Total
- Create event: 5
- Release event: 4
- Issues event: 10
- Watch event: 6
- Delete event: 2
- Issue comment event: 22
- Push event: 16
Last Year
- Create event: 5
- Release event: 4
- Issues event: 10
- Watch event: 6
- Delete event: 2
- Issue comment event: 22
- Push event: 16
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Farthofer Laurenz (KAI DSC) | L****r@k****t | 90 |
| Hundgeburth Laurenz (KAI DSC) | L****h@k****t | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 10
- Total pull requests: 0
- Average time to close issues: 8 days
- Average time to close pull requests: N/A
- Total issue authors: 2
- Total pull request authors: 0
- Average comments per issue: 1.6
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 6
- Pull requests: 0
- Average time to close issues: 13 days
- Average time to close pull requests: N/A
- Issue authors: 2
- Pull request authors: 0
- Average comments per issue: 2.33
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- LaurenzBeck (8)
- firefly-cpp (2)
- matthewfeickert (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 302 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 11
- Total maintainers: 1
pypi.org: streamgen
🌌 a framework for generating streams of labeled data.
- Homepage: https://github.com/Infineon/StreamGen
- Documentation: https://infineon.github.io/StreamGen/
- License: MIT
-
Latest release: 1.1.3
published 6 months ago
Rankings
Maintainers (1)
Dependencies
- Gr1N/setup-poetry v9 composite
- actions/cache v4 composite
- actions/checkout v4 composite
- actions/setup-python v3 composite
- 257 dependencies
- coverage ^7.4.3 develop
- ipywidgets ^8.1.2 develop
- itables ^2.0.0 develop
- jupyter ^1.0.0 develop
- nbconvert ^7.16.2 develop
- poethepoet ^0.25.0 develop
- portray ^1.8.0 develop
- pre-commit ^3.6.2 develop
- pytest ^8.0.2 develop
- qpsolvers ^4.3.2 develop
- ruff ^0.3.4 develop
- towncrier ^24.7.1 develop
- anytree ^2.12.1
- beartype ^0.17.2
- continuum ^1.2.7
- graphviz ^0.20.3
- ipympl ^0.9.4
- loguru ^0.7.2
- matplotlib ^3.8.3
- numpy ^1.26.4
- pandas ^2.2.1
- polars ^0.20.13
- python ^3.10
- pytorchcv >=0.0.67
- rich ^13.7.1
- scikit-image ^0.24
- seaborn ^0.13.2
- torch <=2.3
- aiohappyeyeballs ==2.3.7
- aiohttp ==3.10.4
- aiosignal ==1.3.1
- anyio ==4.4.0
- anytree ==2.12.1
- appnope ==0.1.4
- argon2-cffi ==23.1.0
- argon2-cffi-bindings ==21.2.0
- arrow ==1.3.0
- astroid ==2.15.8
- asttokens ==2.4.1
- async-lru ==2.0.4
- async-timeout ==4.0.3
- attrs ==24.2.0
- babel ==2.16.0
- beartype ==0.17.2
- beautifulsoup4 ==4.12.3
- bleach ==6.1.0
- certifi ==2024.7.4
- cffi ==1.17.0
- cfgv ==3.4.0
- charset-normalizer ==3.3.2
- click ==8.1.7
- colorama ==0.4.6
- comm ==0.2.2
- continuum ==1.2.7
- contourpy ==1.2.1
- coverage ==7.6.1
- cycler ==0.12.1
- datasets ==2.21.0
- debugpy ==1.8.5
- decorator ==5.1.1
- defusedxml ==0.7.1
- dill ==0.3.8
- distlib ==0.3.8
- docstring-parser ==0.16
- dodgy ==0.2.1
- exceptiongroup ==1.2.2
- executing ==2.0.1
- falcon ==2.0.0
- fastjsonschema ==2.20.0
- filelock ==3.15.4
- flake8 ==5.0.4
- flake8-polyfill ==1.0.2
- fonttools ==4.53.1
- fqdn ==1.5.1
- frozenlist ==1.4.1
- fsspec ==2024.6.1
- ghp-import ==2.1.0
- gitdb ==4.0.11
- gitpython ==3.1.43
- graphviz ==0.20.3
- h11 ==0.14.0
- h5py ==3.11.0
- httpcore ==1.0.5
- httpx ==0.27.0
- hug ==2.6.1
- huggingface-hub ==0.24.5
- identify ==2.6.0
- idna ==3.7
- imagehash ==4.3.1
- imageio ==2.35.1
- importlib-metadata ==8.2.0
- iniconfig ==2.0.0
- intel-openmp ==2021.4.0
- ipykernel ==6.29.5
- ipympl ==0.9.4
- ipython ==8.26.0
- ipython-genutils ==0.2.0
- ipywidgets ==8.1.3
- isoduration ==20.11.0
- isort ==5.13.2
- itables ==2.1.4
- jedi ==0.19.1
- jinja2 ==3.1.4
- joblib ==1.4.2
- json5 ==0.9.25
- jsonpointer ==3.0.0
- jsonschema ==4.23.0
- jsonschema-specifications ==2023.12.1
- jupyter ==1.0.0
- jupyter-client ==8.6.2
- jupyter-console ==6.6.3
- jupyter-core ==5.7.2
- jupyter-events ==0.10.0
- jupyter-lsp ==2.2.5
- jupyter-server ==2.14.2
- jupyter-server-terminals ==0.5.3
- jupyterlab ==4.2.4
- jupyterlab-pygments ==0.3.0
- jupyterlab-server ==2.27.3
- jupyterlab-widgets ==3.0.11
- kiwisolver ==1.4.5
- lazy-loader ==0.4
- lazy-object-proxy ==1.10.0
- livereload ==2.7.0
- loguru ==0.7.2
- mako ==1.3.5
- markdown ==3.7
- markdown-it-py ==3.0.0
- markupsafe ==2.1.5
- matplotlib ==3.9.2
- matplotlib-inline ==0.1.7
- mccabe ==0.7.0
- mdurl ==0.1.2
- mergedeep ==1.3.4
- mistune ==3.0.2
- mkdocs ==1.3.0
- mkdocs-material ==8.5.4
- mkdocs-material-extensions ==1.3.1
- mkl ==2021.4.0
- mpmath ==1.3.0
- multidict ==6.0.5
- multiprocess ==0.70.16
- mypy ==1.11.1
- mypy-extensions ==1.0.0
- nbclient ==0.10.0
- nbconvert ==7.16.4
- nbformat ==5.10.4
- nest-asyncio ==1.6.0
- networkx ==3.3
- nodeenv ==1.9.1
- notebook ==7.2.1
- notebook-shim ==0.2.4
- numpy ==1.26.4
- nvidia-cublas-cu12 ==12.1.3.1
- nvidia-cuda-cupti-cu12 ==12.1.105
- nvidia-cuda-nvrtc-cu12 ==12.1.105
- nvidia-cuda-runtime-cu12 ==12.1.105
- nvidia-cudnn-cu12 ==8.9.2.26
- nvidia-cufft-cu12 ==11.0.2.54
- nvidia-curand-cu12 ==10.3.2.106
- nvidia-cusolver-cu12 ==11.4.5.107
- nvidia-cusparse-cu12 ==12.1.0.106
- nvidia-nccl-cu12 ==2.20.5
- nvidia-nvjitlink-cu12 ==12.6.20
- nvidia-nvtx-cu12 ==12.1.105
- overrides ==7.7.0
- packaging ==24.1
- pandas ==2.2.2
- pandocfilters ==1.5.1
- parso ==0.8.4
- pastel ==0.2.1
- pdocs ==1.2.0
- pep8-naming ==0.10.0
- pexpect ==4.9.0
- pillow ==10.4.0
- platformdirs ==4.2.2
- pluggy ==1.5.0
- poethepoet ==0.25.1
- polars ==0.20.31
- portray ==1.8.0
- pre-commit ==3.8.0
- prometheus-client ==0.20.0
- prompt-toolkit ==3.0.47
- prospector ==1.10.3
- psutil ==6.0.0
- ptyprocess ==0.7.0
- pure-eval ==0.2.3
- pyarrow ==17.0.0
- pycodestyle ==2.9.1
- pycparser ==2.22
- pydocstyle ==6.3.0
- pyflakes ==2.5.0
- pygments ==2.18.0
- pylint ==2.17.7
- pylint-celery ==0.3
- pylint-django ==2.5.3
- pylint-flask ==0.6
- pylint-plugin-utils ==0.7
- pymdown-extensions ==10.9
- pyparsing ==3.1.2
- pytest ==8.3.2
- pytest-mock ==3.14.0
- python-dateutil ==2.9.0.post0
- python-json-logger ==2.0.7
- pytorchcv ==0.0.68
- pytz ==2024.1
- pywavelets ==1.7.0
- pywin32 ==306
- pywinpty ==2.0.13
- pyyaml ==6.0.2
- pyyaml-env-tag ==0.1
- pyzmq ==26.1.1
- qpsolvers ==4.3.3
- qtconsole ==5.5.2
- qtpy ==2.4.1
- referencing ==0.35.1
- requests ==2.32.3
- requirements-detector ==1.2.2
- rfc3339-validator ==0.1.4
- rfc3986-validator ==0.1.1
- rich ==13.7.1
- rpds-py ==0.20.0
- ruff ==0.3.7
- scikit-image ==0.24.0
- scikit-learn ==1.5.1
- scipy ==1.14.0
- seaborn ==0.13.2
- semver ==3.0.2
- send2trash ==1.8.3
- setoptconf-tmp ==0.3.1
- setuptools ==72.2.0
- six ==1.16.0
- smmap ==5.0.1
- sniffio ==1.3.1
- snowballstemmer ==2.2.0
- soupsieve ==2.6
- stack-data ==0.6.3
- sympy ==1.13.2
- tbb ==2021.13.1
- termcolor ==2.4.0
- terminado ==0.18.1
- threadpoolctl ==3.5.0
- tifffile ==2024.8.10
- tinycss2 ==1.3.0
- toml ==0.10.2
- tomli ==2.0.1
- tomlkit ==0.13.2
- torch ==2.3.0
- torchvision ==0.18.0
- tornado ==6.4.1
- towncrier ==24.7.1
- tqdm ==4.66.5
- traitlets ==5.14.3
- triton ==2.3.0
- types-python-dateutil ==2.9.0.20240316
- typing-extensions ==4.12.2
- tzdata ==2024.1
- uri-template ==1.3.0
- urllib3 ==2.2.2
- virtualenv ==20.26.3
- watchdog ==4.0.2
- wcwidth ==0.2.13
- webcolors ==24.8.0
- webencodings ==0.5.1
- websocket-client ==1.8.0
- widgetsnbextension ==4.0.11
- win32-setctime ==1.1.0
- wrapt ==1.16.0
- xxhash ==3.5.0
- yarl ==1.9.4
- yaspin ==2.5.0
- zipp ==3.20.0
- actions/checkout v4 composite
- actions/upload-artifact v4 composite
- openjournals/openjournals-draft-action master composite
