https://github.com/bmi-labmedinfo/synthcheck

Dashboard to evaluate synthetic data quality

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.4%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Dashboard to evaluate synthetic data quality

Basic Info

Host: GitHub
Owner: bmi-labmedinfo
License: mit
Language: Python
Default Branch: main
Size: 71.3 KB

Statistics

Stars: 0
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed over 2 years ago

Metadata Files

Readme License

SynthCheck: a dashboard to evaluate synthetic data quality

Table of Contents

About The Project
Installation
Application Structure
License

About The Project

Machine Learning and Artificial Intelligence are increasingly being exploited to solve health-related problems, such as prognosis prediction from Electronic Health Records or detecting patterns in multi-omics data. Data plays a significant role in the development of such systems, but concerns have been raised when dealing with patient's data, with regulators underlying the need to protect patients' privacy. To this end, in recent years, there has been a growing proposal to replace original data (derived from real patients) with the use of synthetic data that mimic the main statistical characteristics of their real counterparts. Regardless of the methods employed to generate them, it is essential to assess the quality of the synthetic data. To address this constraint, we've created a Dash application that users can install and utilize on their computers. This application allows users to upload both original and synthetic data, generating various metrics to assess resemblance, utility, and privacy. Furthermore, users can download a report containing the obtained results. (DOI: 10.5220/0012558700003657)

↰ Back To Top

Installation

This repository provides a Conda environment configuration file (synthcheck_env.yml) to streamline the setup process. Follow these steps to create the environment:

[!IMPORTANT] Make sure you have Conda installed. If not, install Conda before proceeding.

Steps to Create the Environment

Create the Conda Environment

Run the following command to create the environment using the provided .yml file:

bash conda env create -f synthcheck_env.yml

This command will set up a Conda environment named according to specifications in the synthcheck_env.yml file.
Activate the Environment

Once the environment is created, activate it using:

bash conda activate synthcheck_env

Running the Code

Once the virtual environment is activated, you can run the code using the following steps:

bash python SynthCheck_app.py

Additional Notes

To deactivate the environment, simply use:

bash conda deactivate
You can now work within this Conda environment to run the application.

↰ Back To Top

Application Structure

The application is organized into two main sections:

Data Upload for Quality Evaluation

The data upload process for quality evaluation is divided into several components:

1. Uploading Original and Synthetic Datasets

Users are prompted to upload two CSV files: - Original Dataset: it contains the dataset used when generating the synthetic data (example original dataset). - Synthetic Dataset: it comprises the synthetic data for quality evaluation purposes (example synthetic dataset).

[!TIP] Ensure that categorical feature categories are encoded with numerical values (e.g., 'benign' = 0 and 'malign' = 1).

2. Feature Type Descriptor File

In addition to the datasets, users are required to upload a descriptor file in CSV format (example feature type file). This file is structured with two columns:

Example:

| Feature | Type | |-----------------|------------| | Age | numerical | | Gender | categorical| | Income | numerical | | Education | categorical|

[!WARNING] The accepted values in the 'Type' column are exclusively 'numerical' and 'categorical'. Additionally, the file must include column headers.

Quality Assessment of Synthetic Data

The second section empowers users to perform a comprehensive quality assessment of the uploaded synthetic data. This section comprises three subsections, each dedicated to implementing distinct quality analyses.

Resemblance Section

This section provides access to three subsections:

URA Analysis: it conducts various statistical tests and distance metric comparisons for both numerical and categorical features.
MRA Analysis: it omputes metrics related to Multiple Resemblance Analysis such as correlation matrices, outliers analysis, variance explained analysis and UMAP method implementations.
DLA Analysis: it presents, for each classifier used in the Data Labeling Analysis, the values of performance metrics.

Utility Section

This section implements TRTR (Train on Real, Test on Real) and TSTR (Train on Synthetic, Test on Real) approaches for a selected target class and machine learning model.

Privacy Section

This section consists of three subsections dedicated to privacy evaluation:

SEA Analysis: it computes metrics like cosine similarity, Euclidean distance and Hausdorff distance, displaying corresponding density plots or values.
MIA Simulation: it simulates Membership Inference Attacks with adjustable attacker parameters and showcases attacker performance.
AIA Simulation: it allows simulation of Attribute Inference Attacks where the user sets the attacker's access to features, displaying recostruction performance metrics.

Each section provides options to download reports containing the displayed graphs and tables.

↰ Back To Top

License

Distributed under MIT License. See LICENSE for more information.

↰ Back To Top

Citation

If you use SynthCheck, please cite

Santangelo, G.; Nicora, G.; Bellazzi, R. and Dagliati, A. (2024). SynthCheck: A Dashboard for Synthetic Data Quality Assessment. In Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies - HEALTHINF; ISBN 978-989-758-688-0; ISSN 2184-4305, SciTePress, pages 246-256. DOI: 10.5220/0012558700003657

Owner

Name: BMI "Mario Stefanelli" Lab - UNIPV
Login: bmi-labmedinfo
Kind: organization
Email: labmedinfo@unipv.it
Location: Italy

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/bmi-labmedinfo/synthcheck

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

SynthCheck: a dashboard to evaluate synthetic data quality

About The Project

Installation

Steps to Create the Environment

Running the Code

Additional Notes

Application Structure

Data Upload for Quality Evaluation

1. Uploading Original and Synthetic Datasets

2. Feature Type Descriptor File

Example:

Quality Assessment of Synthetic Data

Resemblance Section

Utility Section

Privacy Section

License

Citation

Owner

GitHub Events

Total

Last Year