Science Score: 75.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
✓Institutional organization owner
Organization scai-bio has institutional domain (www.scai.fraunhofer.de) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary
Keywords
Repository
Synthetic data quality evaluation & visualization
Basic Info
- Host: GitHub
- Owner: SCAI-BIO
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://syndat.readthedocs.io
- Size: 347 KB
Statistics
- Stars: 3
- Watchers: 3
- Forks: 0
- Open Issues: 4
- Releases: 24
Topics
Metadata Files
README.md
Syndat
Syndat is a software package that provides basic functionalities for the evaluation and visualisation of synthetic data. Quality scores can be computed on 3 base metrics (Discrimation, Correlation and Distribution) and data may be visualized to inspect correlation structures or statistical distribution plots.
Syndat also allows users to generate stratified and interpretable visualisations, including raincloud plots, GOF plots, and trajectory comparisons, offering deeper insights into the quality of synthetic clinical data across different subgroups.
Installation
Install via pip:
bash
pip install syndat
Usage
Fidelity metrics
Jenson-Shannon Distance
The Jenson-Shannon distance is a measure of similarity between two probability distributions. In our case, we compute probability distributions for each feature in the datasets and compute and can thus compare the statistic feature similarity of two dataframes.
It is bounded between 0 and 1, with 0 indicating identical distributions.
(Normalized) Correlation Difference
In addition to statistical similarity between the same features, we also want to make sure to preserve the correlations across different features. The normalized correlation difference measures the similarity of the correlation matrix of two dataframes.
A low correlation difference near zero indicates that the correlation structure of the synthetic data is similar to the real data.
Discriminator AUC
A classifier is trained to discriminate between real and synthetic data. Based on the Receiver Operating Characteristic (ROC) curve, we compute the area under the curve (AUC) as a measure of how well the classifier can distinguish between the two datasets.
An AUC of 0.5 indicates that the classifier is unable to distinguish between the two datasets, while an AUC of 1.0 indicates perfect discrimination.
Exemplary usage:
```python import pandas as pd from syndat.metrics import ( jensenshannondistance, normalizedcorrelationdifference, discriminator_auc )
real = pd.DataFrame({ 'feature1': [1, 2, 3, 4, 5], 'feature2': ['A', 'B', 'A', 'B', 'C'] })
synthetic = pd.DataFrame({ 'feature1': [1, 2, 2, 3, 3], 'feature2': ['A', 'B', 'A', 'C', 'C'] })
print(jensenshannondistance(real, synthetic))
{'feature1': 0.4990215421876156, 'feature2': 0.22141025172133794}
print(normalizedcorrelationdifference(real, synthetic))
0.24571345029108108
print(discriminator_auc(real, synthetic))
0.6 ```
Scoring Functions
For convenience and easier interpretation, a normalized score can be computed for each of the metrics instead:
```python
JSD score is being aggregated over all features
distributionsimilarityscore = syndat.scores.distribution(real, synthetic) discriminationscore = syndat.scores.discrimination(real, synthetic) correlationscore = syndat.scores.correlation(real, synthetic) ```
Scores are defined in a range of 0-100, with a higher score corresponding to better data fidelity.
Visualization
Visualize real vs. synthetic data distributions, summary statistics and discriminating features:
```python import pandas as pd import syndat
real = pd.readcsv("real.csv") synthetic = pd.readcsv("synthetic.csv")
plot all feature distribution and store image files
syndat.visualization.plotdistributions(real, synthetic, storedestination="results/plots") syndat.visualization.plotcorrelations(real, synthetic, storedestination="results/plots")
plot and display specific feature distribution plot
syndat.visualization.plotnumericalfeature("featurexy", real, synthetic) syndat.visualization.plotnumericalfeature("featurexy", real, synthetic)
plot a shap plot of differentiating feature for real and synthetic data
syndat.visualization.plotshapdiscrimination(real, synthetic) ```
Postprocessing
Postprocess synthetic data to improve data fidelity:
```python import pandas as pd import syndat
real = pd.readcsv("real.csv") synthetic = pd.readcsv("synthetic.csv")
postprocess synthetic data
syntheticpost = syndat.postprocessing.assertminmax(real, synthetic) syntheticpost = syndat.postprocessing.normalizefloat_precision(real, synthetic) ```
Evaluation and Visualization of Synthetic Clinical Trial Data
An example demonstrating how to compute distribution, discrimination, and correlation scores, as well as how to generate stratified visualizations (gof, raincloud and other plots), is available in examples/rct_example.py.
Owner
- Name: Fraunhofer SCAI Bioinformatics Department
- Login: SCAI-BIO
- Kind: organization
- Website: https://www.scai.fraunhofer.de/en/business-research-areas/bioinformatics/
- Repositories: 29
- Profile: https://github.com/SCAI-BIO
Deparment of Bioinformatics at Fraunhofer SCAI
Citation (CITATION.cff)
cff-version: 1.2.0
title: syndat
type: software
message: If you use this software, please cite the following article.
license: Apache-2.0
language: en
authors:
- given-names: Tim
family-names: Adams
affiliation: Fraunhofer Institute for Algorithms and Scientific Computing SCAI
email: tim.adams@scai.fraunhofer.de
orcid: https://orcid.org/0000-0002-2823-0102
contributors:
- given-names: Diego
family-names: Valderrama
affiliation: Fraunhofer Institute for Algorithms and Scientific Computing SCAI
email: diego.felipe.valderrama.nino@scai.fraunhofer.de
orcid: https://orcid.org/0000-0002-3450-0359
repository-code: https://github.com/SCAI-BIO/syndat
repository-artifact: https://pypi.org/project/syndat/
abstract: >
Syndat is a software package that provides basic functionalities for the evaluation
and visualisation of synthetic data.
keywords:
- synthetic data
- data quality
- data privacy
- data utility
- clinical data
- data visualization
preferred-citation:
type: article
title: On the fidelity versus privacy and utility trade-off of synthetic patient data
authors:
- family-names: Adams
given-names: Tim
- family-names: Birkenbihl
given-names: Colin
- family-names: Otte
given-names: Karen
- family-names: Ng
given-names: Hwei Geok
- family-names: Rieling
given-names: Jonas Adrian
- family-names: Näher
given-names: Anatol-Fiete
- family-names: Sax
given-names: Ulrich
- family-names: Prasser
given-names: Fabian
- family-names: Fröhlich
given-names: Holger
journal: iScience
volume: 28
number: 5
pages: 112382
year: 2025
doi: 10.1016/j.isci.2025.112382
references:
- type: conference-paper
title: NFDI4Health workflow and service for synthetic data generation, assessment and risk management
authors:
- family-names: Moazemi
given-names: Sobhan
- family-names: Adams
given-names: Tim
- family-names: Ng
given-names: Hwei Geok
- family-names: Kühnel
given-names: Lisa
- family-names: Schneider
given-names: Julian
- family-names: Näher
given-names: Anatol-Fiete
- family-names: Fluck
given-names: Juliane
- family-names: Fröhlich
given-names: Holger
collection-title: Studies in Health Technology and Informatics
year: 2024
doi: 10.3233/SHTI240834
- type: website
title: Syndat Web Dashboard
url: https://syndat.scai.fraunhofer.de/
GitHub Events
Total
- Create event: 34
- Release event: 10
- Issues event: 7
- Watch event: 2
- Delete event: 22
- Issue comment event: 4
- Push event: 66
- Pull request review comment event: 1
- Pull request review event: 3
- Pull request event: 43
Last Year
- Create event: 34
- Release event: 10
- Issues event: 7
- Watch event: 2
- Delete event: 22
- Issue comment event: 4
- Push event: 66
- Pull request review comment event: 1
- Pull request review event: 3
- Pull request event: 43
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 9
- Total pull requests: 59
- Average time to close issues: 16 days
- Average time to close pull requests: 2 days
- Total issue authors: 1
- Total pull request authors: 3
- Average comments per issue: 0.11
- Average comments per pull request: 0.08
- Merged pull requests: 45
- Bot issues: 0
- Bot pull requests: 4
Past Year
- Issues: 6
- Pull requests: 35
- Average time to close issues: about 21 hours
- Average time to close pull requests: about 24 hours
- Issue authors: 1
- Pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.09
- Merged pull requests: 27
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- tiadams (9)
Pull Request Authors
- tiadams (50)
- DiegoValderrama (5)
- dependabot[bot] (4)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 329 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 23
- Total maintainers: 1
pypi.org: syndat
A library for evaluation & visualization of synthetic data.
- Homepage: https://github.com/SCAI-BIO/syndat
- Documentation: https://github.com/SCAI-BIO/syndat#readme
- License: MIT
-
Latest release: 0.13.2
published 6 months ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v3 composite
- actions/setup-python v3 composite
- pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite
- matplotlib *
- numpy *
- pandas *
- plotly *
- scikit-learn *
- scipy *
- seaborn *
- setuptools ==69.0.2
- actions/checkout v3 composite
- actions/setup-python v3 composite