syndat

Synthetic data quality evaluation & visualization

https://github.com/scai-bio/syndat

Science Score: 75.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
✓
Institutional organization owner
Organization scai-bio has institutional domain (www.scai.fraunhofer.de)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary

Keywords

data-quality data-visualization synthetic-data

Last synced: 10 months ago · JSON representation ·

Repository

Synthetic data quality evaluation & visualization

Basic Info

Host: GitHub
Owner: SCAI-BIO
License: mit
Language: Python
Default Branch: main
Homepage: https://syndat.readthedocs.io
Size: 347 KB

Statistics

Stars: 3
Watchers: 3
Forks: 0
Open Issues: 4
Releases: 24

Topics

data-quality data-visualization synthetic-data

Created over 2 years ago · Last pushed 10 months ago

Metadata Files

Readme License Citation

Syndat

version

Syndat is a software package that provides basic functionalities for the evaluation and visualisation of synthetic data. Quality scores can be computed on 3 base metrics (Discrimation, Correlation and Distribution) and data may be visualized to inspect correlation structures or statistical distribution plots.

Syndat also allows users to generate stratified and interpretable visualisations, including raincloud plots, GOF plots, and trajectory comparisons, offering deeper insights into the quality of synthetic clinical data across different subgroups.

Installation

Install via pip:

bash pip install syndat

Usage

Fidelity metrics

Jenson-Shannon Distance

The Jenson-Shannon distance is a measure of similarity between two probability distributions. In our case, we compute probability distributions for each feature in the datasets and compute and can thus compare the statistic feature similarity of two dataframes.

It is bounded between 0 and 1, with 0 indicating identical distributions.

(Normalized) Correlation Difference

In addition to statistical similarity between the same features, we also want to make sure to preserve the correlations across different features. The normalized correlation difference measures the similarity of the correlation matrix of two dataframes.

A low correlation difference near zero indicates that the correlation structure of the synthetic data is similar to the real data.

Discriminator AUC

A classifier is trained to discriminate between real and synthetic data. Based on the Receiver Operating Characteristic (ROC) curve, we compute the area under the curve (AUC) as a measure of how well the classifier can distinguish between the two datasets.

An AUC of 0.5 indicates that the classifier is unable to distinguish between the two datasets, while an AUC of 1.0 indicates perfect discrimination.

Exemplary usage:

```python import pandas as pd from syndat.metrics import ( jensenshannondistance, normalizedcorrelationdifference, discriminator_auc )

real = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5], 'feature2': ['A', 'B', 'A', 'B', 'C'] })

synthetic = pd.DataFrame({ 'feature1': [1, 2, 2, 3, 3], 'feature2': ['A', 'B', 'A', 'C', 'C'] })

print(jensenshannondistance(real, synthetic))

{'feature1': 0.4990215421876156, 'feature2': 0.22141025172133794}

print(normalizedcorrelationdifference(real, synthetic))

0.24571345029108108

print(discriminator_auc(real, synthetic))

0.6 ```

Scoring Functions

For convenience and easier interpretation, a normalized score can be computed for each of the metrics instead:

```python

JSD score is being aggregated over all features

distributionsimilarityscore = syndat.scores.distribution(real, synthetic) discriminationscore = syndat.scores.discrimination(real, synthetic) correlationscore = syndat.scores.correlation(real, synthetic) ```

Scores are defined in a range of 0-100, with a higher score corresponding to better data fidelity.

Visualization

Visualize real vs. synthetic data distributions, summary statistics and discriminating features:

```python import pandas as pd import syndat

real = pd.readcsv("real.csv") synthetic = pd.readcsv("synthetic.csv")

plot all feature distribution and store image files

syndat.visualization.plotdistributions(real, synthetic, storedestination="results/plots") syndat.visualization.plotcorrelations(real, synthetic, storedestination="results/plots")

plot and display specific feature distribution plot

syndat.visualization.plotnumericalfeature("featurexy", real, synthetic) syndat.visualization.plotnumericalfeature("featurexy", real, synthetic)

plot a shap plot of differentiating feature for real and synthetic data

syndat.visualization.plotshapdiscrimination(real, synthetic) ```

Postprocessing

Postprocess synthetic data to improve data fidelity:

```python import pandas as pd import syndat

real = pd.readcsv("real.csv") synthetic = pd.readcsv("synthetic.csv")

postprocess synthetic data

syntheticpost = syndat.postprocessing.assertminmax(real, synthetic) syntheticpost = syndat.postprocessing.normalizefloat_precision(real, synthetic) ```

Evaluation and Visualization of Synthetic Clinical Trial Data

An example demonstrating how to compute distribution, discrimination, and correlation scores, as well as how to generate stratified visualizations (gof, raincloud and other plots), is available in examples/rct_example.py.

Owner

Name: Fraunhofer SCAI Bioinformatics Department
Login: SCAI-BIO
Kind: organization

Website: https://www.scai.fraunhofer.de/en/business-research-areas/bioinformatics/
Repositories: 29
Profile: https://github.com/SCAI-BIO

Deparment of Bioinformatics at Fraunhofer SCAI

Citation (CITATION.cff)

cff-version: 1.2.0
title: syndat
type: software
message: If you use this software, please cite the following article.
license: Apache-2.0
language: en

authors:
  - given-names: Tim
    family-names: Adams
    affiliation: Fraunhofer Institute for Algorithms and Scientific Computing SCAI
    email: tim.adams@scai.fraunhofer.de
    orcid: https://orcid.org/0000-0002-2823-0102

contributors:
  - given-names: Diego
    family-names: Valderrama
    affiliation: Fraunhofer Institute for Algorithms and Scientific Computing SCAI
    email: diego.felipe.valderrama.nino@scai.fraunhofer.de
    orcid: https://orcid.org/0000-0002-3450-0359

repository-code: https://github.com/SCAI-BIO/syndat
repository-artifact: https://pypi.org/project/syndat/
abstract: >
  Syndat is a software package that provides basic functionalities for the evaluation 
  and visualisation of synthetic data. 

keywords:
  - synthetic data
  - data quality
  - data privacy
  - data utility
  - clinical data
  - data visualization

preferred-citation:
  type: article
  title: On the fidelity versus privacy and utility trade-off of synthetic patient data
  authors:
    - family-names: Adams
      given-names: Tim
    - family-names: Birkenbihl
      given-names: Colin
    - family-names: Otte
      given-names: Karen
    - family-names: Ng
      given-names: Hwei Geok
    - family-names: Rieling
      given-names: Jonas Adrian
    - family-names: Näher
      given-names: Anatol-Fiete
    - family-names: Sax
      given-names: Ulrich
    - family-names: Prasser
      given-names: Fabian
    - family-names: Fröhlich
      given-names: Holger
  journal: iScience
  volume: 28
  number: 5
  pages: 112382
  year: 2025
  doi: 10.1016/j.isci.2025.112382

references:
  - type: conference-paper
    title: NFDI4Health workflow and service for synthetic data generation, assessment and risk management
    authors:
      - family-names: Moazemi
        given-names: Sobhan
      - family-names: Adams
        given-names: Tim
      - family-names: Ng
        given-names: Hwei Geok
      - family-names: Kühnel
        given-names: Lisa
      - family-names: Schneider
        given-names: Julian
      - family-names: Näher
        given-names: Anatol-Fiete
      - family-names: Fluck
        given-names: Juliane
      - family-names: Fröhlich
        given-names: Holger
    collection-title: Studies in Health Technology and Informatics
    year: 2024
    doi: 10.3233/SHTI240834
  - type: website
    title: Syndat Web Dashboard
    url: https://syndat.scai.fraunhofer.de/

GitHub Events

Total

Create event: 34
Release event: 10
Issues event: 7
Watch event: 2
Delete event: 22
Issue comment event: 4
Push event: 66
Pull request review comment event: 1
Pull request review event: 3
Pull request event: 43

Last Year

Create event: 34
Release event: 10
Issues event: 7
Watch event: 2
Delete event: 22
Issue comment event: 4
Push event: 66
Pull request review comment event: 1
Pull request review event: 3
Pull request event: 43

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 9
Total pull requests: 59
Average time to close issues: 16 days
Average time to close pull requests: 2 days
Total issue authors: 1
Total pull request authors: 3
Average comments per issue: 0.11
Average comments per pull request: 0.08
Merged pull requests: 45
Bot issues: 0
Bot pull requests: 4

Past Year

Issues: 6
Pull requests: 35
Average time to close issues: about 21 hours
Average time to close pull requests: about 24 hours
Issue authors: 1
Pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.09
Merged pull requests: 27
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

tiadams (9)

Pull Request Authors

tiadams (50)
DiegoValderrama (5)
dependabot[bot] (4)

Top Labels

Issue Labels

documentation (3) enhancement (1) bug (1)

Pull Request Labels

dependencies (4)

Packages

Total packages: 1
Total downloads:
- pypi 329 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 23
Total maintainers: 1

pypi.org: syndat

A library for evaluation & visualization of synthetic data.

Homepage: https://github.com/SCAI-BIO/syndat
Documentation: https://github.com/SCAI-BIO/syndat#readme
License: MIT
Latest release: 0.13.2
published 10 months ago

Versions: 23
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 329 Last month

Rankings

Dependent packages count: 9.9%

Average: 37.8%

Dependent repos count: 65.7%

Maintainers (1)

tiadams