tabular.gen.cluster_based

TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Generation - Cluster Based

https://github.com/tno-sdg/tabular.gen.cluster_based

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.2%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Generation - Cluster Based

Basic Info

Host: GitHub
Owner: TNO-SDG
License: apache-2.0
Language: Python
Default Branch: main
Size: 23.4 KB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Created almost 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Generation - Cluster Based

This package provides a simple synthetic data generator for tabular data. In short, it works by clustering a given tabular dataset (by default using k-means clustering), from which per-attribute histograms per cluster are created. These histograms are sampled to generate synthetic data.

PET Lab

The TNO PET Lab consists of generic software components, procedures, and functionalities developed and maintained on a regular basis to facilitate and aid in the development of PET solutions. The lab is a cross-project initiative allowing us to integrate and reuse previously developed PET functionalities to boost the development of new protocols and solutions.

The package tno.sdg.tabular.gen.cluster_based is part of the TNO Python Toolbox.

Limitations in (end-)use: the content of this software package may solely be used for applications that comply with international export control laws.
This implementation of cryptographic software has not been audited. Use at your own risk.

Documentation

Documentation of the tno.sdg.tabular.gen.cluster_based package can be found here.

Install

Easily install the tno.sdg.tabular.gen.cluster_based package using pip:

console $ python -m pip install tno.sdg.tabular.gen.cluster_based

Note: If you are cloning the repository and wish to edit the source code, be sure to install the package in editable mode:

console $ python -m pip install -e 'tno.sdg.tabular.gen.cluster_based'

If you wish to run the tests you can use:

console $ python -m pip install 'tno.sdg.tabular.gen.cluster_based[tests]'

Usage

The tno.sdg.tabular.gen.cluster_based package provides a single class ClusterBasedGenerator that provides a simple interface to the synthetic data generation.

First, the ClusterBasedGenerator must be fitted on a real dataset using the ClusterBasedGenerator.fit method. The user must specify the type of each column of the dataset via the data_types parameter. Once fitted, the user can call ClusterBasedGenerator.sample to generate synthetic data samples.

```python import pandas as pd from tno.sdg.tabular.gen.cluster_based import ClusterBasedGenerator, DataType

df = pd.readcsv("src/tno/sdg/tabular/gen/clusterbased/test/data/adult.data") dfsubset = df[["age", "sex", "income", "workclass", "education", "marital-status"]] generator = ClusterBasedGenerator() generator.fit(dfsubset, [DataType.CONTINUOUS, DataType.CATEGORICAL, DataType.CATEGORICAL, DataType.CATEGORICAL, DataType.CATEGORICAL, DataType.CATEGORICAL]) samples = generator.sample()

```

Histogram Templates

The generator uses histograms to generate data. A single histogram represents a single feature. The bins of this histogram are, by default, derived from the data. If you wish to provide a custom template for the histogram, you can create one or more HistogramTemplate for the desired features and pass these to the ClusterBasedGenerator.

python age_template = ContinuousHistogramTemplate(lims=[0,10,20,30,40,50,60,70,80,90,100]) education_template = CategoricalHistogramTemplate(values=['Bachelors, Masters']) generator = ClusterBasedGenerator( histogram_templates={ 'age': age_template 'education': education_template # we let marital-status be derived from the data } )

Clustering

The ClusterBasedGenerator, as the name suggests, uses clustering to achieve synthetic data generation. By default, sklearn.cluster.KMeans is used with parameters n_clusters=8, init="random", n_init="auto". To change the clusterer, simply pass a clustering algorithm to ClusterBasedGenerator. The clusterer is expected to subclass BaseEstimator (base class of scipy) and implement fit and predict.

For example, to use KMeans but with a different amount of clusters, you can pass:

python generator = ClusterBasedGenerator(clusterer=KMeans(n_clusters=2))

Preprocessing

Depending on the clustering algorithm and input data used, the data may need to be preprocessed. For KMeans, the default clustering algorithm, preprocessing is required.

The default preprocessor applies the StandardScaler to DataType.CONTINUOUS features and the OneHotEncoder to DataType.CATEGORICAL features.

It is possible to provide a custom preprocessor in the same manner as for the clusterer. The preprocessor should be a BaseEstimator with the methods fit and predict implemented. It is possible to combine multiple existing preprocessors (such as OneHotEncoder) together, and even bulid a Pipeline. See default_processor and ClusterBasedGenerator.fit for examples on how to use these scipy features.

```python from sklearn.compose import makecolumntransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler

def custompreprocessor() -> BaseEstimator: return makecolumn_transformer( (StandardScaler(), 'age'), (OneHotEncoder(), 'education'), ('drop', 'marital-status') )

generator = ClusterBasedGenerator(preprocessor=custom_preprocessor()) ```

Owner

Name: TNO - PET Lab - Synthetic Data Generation (SDG)
Login: TNO-SDG
Kind: organization
Email: petlab@tno.nl

Repositories: 1
Profile: https://github.com/TNO-SDG

Part of TNO PET Lab

Citation (CITATION.cff)

cff-version: 1.2.0
license: Apache-2.0
message: If you use this software, please cite it using these metadata.
authors:
  - name: TNO PET Lab
    city: The Hague
    country: NL
    email: petlab@tno.nl
    website: https://pet.tno.nl
type: software
url: https://pet.tno.nl
contact:
  - name: TNO PET Lab
    city: The Hague
    country: NL
    email: petlab@tno.nl
    website: https://pet.tno.nl
repository-code: https://github.com/TNO-SDG/gen.cluster_based
repository-artifact: https://pypi.org/project/tno.sdg.tabular.gen.cluster_based
title: TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Generation - Cluster Based
version: 0.2.0
date-released: 2024-12-10

GitHub Events

Total

Last Year

Packages

Total packages: 1
Total downloads:
- pypi 11 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 1
Total maintainers: 1

pypi.org: tno.sdg.tabular.gen.cluster-based

Cluster Based Synthetic Data Generation

Homepage: https://pet.tno.nl/
Documentation: https://docs.pet.tno.nl/sdg/tabular/gen/cluster_based/0.2.0
License: Apache License, Version 2.0
Latest release: 0.2.0
published over 1 year ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 11 Last month

Rankings

Dependent packages count: 9.9%

Average: 32.8%

Dependent repos count: 55.8%

Maintainers (1)

PETLab

Last synced: 10 months ago

Dependencies

pyproject.toml pypi

pandas >2.0,<3.0
scikit-learn >=1.0,<2.0
typing_extensions >=4.4; python_version<'3.12'

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

tabular.gen.cluster_based

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Generation - Cluster Based

PET Lab

Documentation

Install

Usage

Histogram Templates

Clustering

Preprocessing

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Packages

pypi.org: tno.sdg.tabular.gen.cluster-based

Rankings

Maintainers (1)

Dependencies