syndat

Synthetic data quality evaluation & visualization

https://github.com/scai-bio/syndat

Science Score: 75.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
    Organization scai-bio has institutional domain (www.scai.fraunhofer.de)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.9%) to scientific vocabulary

Keywords

data-quality data-visualization synthetic-data
Last synced: 6 months ago · JSON representation ·

Repository

Synthetic data quality evaluation & visualization

Basic Info
Statistics
  • Stars: 3
  • Watchers: 3
  • Forks: 0
  • Open Issues: 4
  • Releases: 24
Topics
data-quality data-visualization synthetic-data
Created about 2 years ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.md

Syndat

DOI tests codecov docs version

Syndat is a software package that provides basic functionalities for the evaluation and visualisation of synthetic data. Quality scores can be computed on 3 base metrics (Discrimation, Correlation and Distribution) and data may be visualized to inspect correlation structures or statistical distribution plots.

Syndat also allows users to generate stratified and interpretable visualisations, including raincloud plots, GOF plots, and trajectory comparisons, offering deeper insights into the quality of synthetic clinical data across different subgroups.

Installation

Install via pip:

bash pip install syndat

Usage

Fidelity metrics

Jenson-Shannon Distance

The Jenson-Shannon distance is a measure of similarity between two probability distributions. In our case, we compute probability distributions for each feature in the datasets and compute and can thus compare the statistic feature similarity of two dataframes.

It is bounded between 0 and 1, with 0 indicating identical distributions.

(Normalized) Correlation Difference

In addition to statistical similarity between the same features, we also want to make sure to preserve the correlations across different features. The normalized correlation difference measures the similarity of the correlation matrix of two dataframes.

A low correlation difference near zero indicates that the correlation structure of the synthetic data is similar to the real data.

Discriminator AUC

A classifier is trained to discriminate between real and synthetic data. Based on the Receiver Operating Characteristic (ROC) curve, we compute the area under the curve (AUC) as a measure of how well the classifier can distinguish between the two datasets.

An AUC of 0.5 indicates that the classifier is unable to distinguish between the two datasets, while an AUC of 1.0 indicates perfect discrimination.

Exemplary usage:

```python import pandas as pd from syndat.metrics import ( jensenshannondistance, normalizedcorrelationdifference, discriminator_auc )

real = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5], 'feature2': ['A', 'B', 'A', 'B', 'C'] })

synthetic = pd.DataFrame({ 'feature1': [1, 2, 2, 3, 3], 'feature2': ['A', 'B', 'A', 'C', 'C'] })

print(jensenshannondistance(real, synthetic))

{'feature1': 0.4990215421876156, 'feature2': 0.22141025172133794}

print(normalizedcorrelationdifference(real, synthetic))

0.24571345029108108

print(discriminator_auc(real, synthetic))

0.6 ```

Scoring Functions

For convenience and easier interpretation, a normalized score can be computed for each of the metrics instead:

```python

JSD score is being aggregated over all features

distributionsimilarityscore = syndat.scores.distribution(real, synthetic) discriminationscore = syndat.scores.discrimination(real, synthetic) correlationscore = syndat.scores.correlation(real, synthetic) ```

Scores are defined in a range of 0-100, with a higher score corresponding to better data fidelity.

Visualization

Visualize real vs. synthetic data distributions, summary statistics and discriminating features:

```python import pandas as pd import syndat

real = pd.readcsv("real.csv") synthetic = pd.readcsv("synthetic.csv")

plot all feature distribution and store image files

syndat.visualization.plotdistributions(real, synthetic, storedestination="results/plots") syndat.visualization.plotcorrelations(real, synthetic, storedestination="results/plots")

plot and display specific feature distribution plot

syndat.visualization.plotnumericalfeature("featurexy", real, synthetic) syndat.visualization.plotnumericalfeature("featurexy", real, synthetic)

plot a shap plot of differentiating feature for real and synthetic data

syndat.visualization.plotshapdiscrimination(real, synthetic) ```

Postprocessing

Postprocess synthetic data to improve data fidelity:

```python import pandas as pd import syndat

real = pd.readcsv("real.csv") synthetic = pd.readcsv("synthetic.csv")

postprocess synthetic data

syntheticpost = syndat.postprocessing.assertminmax(real, synthetic) syntheticpost = syndat.postprocessing.normalizefloat_precision(real, synthetic) ```

Evaluation and Visualization of Synthetic Clinical Trial Data

An example demonstrating how to compute distribution, discrimination, and correlation scores, as well as how to generate stratified visualizations (gof, raincloud and other plots), is available in examples/rct_example.py.

Owner

  • Name: Fraunhofer SCAI Bioinformatics Department
  • Login: SCAI-BIO
  • Kind: organization

Deparment of Bioinformatics at Fraunhofer SCAI

Citation (CITATION.cff)

cff-version: 1.2.0
title: syndat
type: software
message: If you use this software, please cite the following article.
license: Apache-2.0
language: en

authors:
  - given-names: Tim
    family-names: Adams
    affiliation: Fraunhofer Institute for Algorithms and Scientific Computing SCAI
    email: tim.adams@scai.fraunhofer.de
    orcid: https://orcid.org/0000-0002-2823-0102

contributors:
  - given-names: Diego
    family-names: Valderrama
    affiliation: Fraunhofer Institute for Algorithms and Scientific Computing SCAI
    email: diego.felipe.valderrama.nino@scai.fraunhofer.de
    orcid: https://orcid.org/0000-0002-3450-0359

repository-code: https://github.com/SCAI-BIO/syndat
repository-artifact: https://pypi.org/project/syndat/
abstract: >
  Syndat is a software package that provides basic functionalities for the evaluation 
  and visualisation of synthetic data. 

keywords:
  - synthetic data
  - data quality
  - data privacy
  - data utility
  - clinical data
  - data visualization

preferred-citation:
  type: article
  title: On the fidelity versus privacy and utility trade-off of synthetic patient data
  authors:
    - family-names: Adams
      given-names: Tim
    - family-names: Birkenbihl
      given-names: Colin
    - family-names: Otte
      given-names: Karen
    - family-names: Ng
      given-names: Hwei Geok
    - family-names: Rieling
      given-names: Jonas Adrian
    - family-names: Näher
      given-names: Anatol-Fiete
    - family-names: Sax
      given-names: Ulrich
    - family-names: Prasser
      given-names: Fabian
    - family-names: Fröhlich
      given-names: Holger
  journal: iScience
  volume: 28
  number: 5
  pages: 112382
  year: 2025
  doi: 10.1016/j.isci.2025.112382

references:
  - type: conference-paper
    title: NFDI4Health workflow and service for synthetic data generation, assessment and risk management
    authors:
      - family-names: Moazemi
        given-names: Sobhan
      - family-names: Adams
        given-names: Tim
      - family-names: Ng
        given-names: Hwei Geok
      - family-names: Kühnel
        given-names: Lisa
      - family-names: Schneider
        given-names: Julian
      - family-names: Näher
        given-names: Anatol-Fiete
      - family-names: Fluck
        given-names: Juliane
      - family-names: Fröhlich
        given-names: Holger
    collection-title: Studies in Health Technology and Informatics
    year: 2024
    doi: 10.3233/SHTI240834
  - type: website
    title: Syndat Web Dashboard
    url: https://syndat.scai.fraunhofer.de/

GitHub Events

Total
  • Create event: 34
  • Release event: 10
  • Issues event: 7
  • Watch event: 2
  • Delete event: 22
  • Issue comment event: 4
  • Push event: 66
  • Pull request review comment event: 1
  • Pull request review event: 3
  • Pull request event: 43
Last Year
  • Create event: 34
  • Release event: 10
  • Issues event: 7
  • Watch event: 2
  • Delete event: 22
  • Issue comment event: 4
  • Push event: 66
  • Pull request review comment event: 1
  • Pull request review event: 3
  • Pull request event: 43

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 9
  • Total pull requests: 59
  • Average time to close issues: 16 days
  • Average time to close pull requests: 2 days
  • Total issue authors: 1
  • Total pull request authors: 3
  • Average comments per issue: 0.11
  • Average comments per pull request: 0.08
  • Merged pull requests: 45
  • Bot issues: 0
  • Bot pull requests: 4
Past Year
  • Issues: 6
  • Pull requests: 35
  • Average time to close issues: about 21 hours
  • Average time to close pull requests: about 24 hours
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.09
  • Merged pull requests: 27
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • tiadams (9)
Pull Request Authors
  • tiadams (50)
  • DiegoValderrama (5)
  • dependabot[bot] (4)
Top Labels
Issue Labels
documentation (3) enhancement (1) bug (1)
Pull Request Labels
dependencies (4)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 329 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 23
  • Total maintainers: 1
pypi.org: syndat

A library for evaluation & visualization of synthetic data.

  • Versions: 23
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 329 Last month
Rankings
Dependent packages count: 9.9%
Average: 37.8%
Dependent repos count: 65.7%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/python-publish.yaml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite
  • pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite
requirements.txt pypi
  • matplotlib *
  • numpy *
  • pandas *
  • plotly *
  • scikit-learn *
  • scipy *
  • seaborn *
  • setuptools ==69.0.2
setup.py pypi
.github/workflows/tests.yaml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite