https://github.com/bmi-labmedinfo/synthcheck
Dashboard to evaluate synthetic data quality
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.4%) to scientific vocabulary
Repository
Dashboard to evaluate synthetic data quality
Basic Info
- Host: GitHub
- Owner: bmi-labmedinfo
- License: mit
- Language: Python
- Default Branch: main
- Size: 71.3 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
SynthCheck: a dashboard to evaluate synthetic data quality
Table of Contents
About The Project
Machine Learning and Artificial Intelligence are increasingly being exploited to solve health-related problems, such as prognosis prediction from Electronic Health Records or detecting patterns in multi-omics data. Data plays a significant role in the development of such systems, but concerns have been raised when dealing with patient's data, with regulators underlying the need to protect patients' privacy. To this end, in recent years, there has been a growing proposal to replace original data (derived from real patients) with the use of synthetic data that mimic the main statistical characteristics of their real counterparts. Regardless of the methods employed to generate them, it is essential to assess the quality of the synthetic data. To address this constraint, we've created a Dash application that users can install and utilize on their computers. This application allows users to upload both original and synthetic data, generating various metrics to assess resemblance, utility, and privacy. Furthermore, users can download a report containing the obtained results. (DOI: 10.5220/0012558700003657)
Installation
This repository provides a Conda environment configuration file (synthcheck_env.yml) to streamline the setup process. Follow these steps to create the environment:
[!IMPORTANT] Make sure you have Conda installed. If not, install Conda before proceeding.
Steps to Create the Environment
Create the Conda Environment
Run the following command to create the environment using the provided
.ymlfile:bash conda env create -f synthcheck_env.ymlThis command will set up a Conda environment named according to specifications in the
synthcheck_env.ymlfile.Activate the Environment
Once the environment is created, activate it using:
bash conda activate synthcheck_env
Running the Code
Once the virtual environment is activated, you can run the code using the following steps:
bash
python SynthCheck_app.py
Additional Notes
To deactivate the environment, simply use:
bash conda deactivateYou can now work within this Conda environment to run the application.
Application Structure
The application is organized into two main sections:
Data Upload for Quality Evaluation
The data upload process for quality evaluation is divided into several components:
1. Uploading Original and Synthetic Datasets
Users are prompted to upload two CSV files: - Original Dataset: it contains the dataset used when generating the synthetic data (example original dataset). - Synthetic Dataset: it comprises the synthetic data for quality evaluation purposes (example synthetic dataset).
[!TIP] Ensure that categorical feature categories are encoded with numerical values (e.g., 'benign' = 0 and 'malign' = 1).
2. Feature Type Descriptor File
In addition to the datasets, users are required to upload a descriptor file in CSV format (example feature type file). This file is structured with two columns:
Example:
| Feature | Type | |-----------------|------------| | Age | numerical | | Gender | categorical| | Income | numerical | | Education | categorical|
[!WARNING] The accepted values in the 'Type' column are exclusively 'numerical' and 'categorical'. Additionally, the file must include column headers.
Quality Assessment of Synthetic Data
The second section empowers users to perform a comprehensive quality assessment of the uploaded synthetic data. This section comprises three subsections, each dedicated to implementing distinct quality analyses.
Resemblance Section
This section provides access to three subsections:
URA Analysis: it conducts various statistical tests and distance metric comparisons for both numerical and categorical features.
MRA Analysis: it omputes metrics related to Multiple Resemblance Analysis such as correlation matrices, outliers analysis, variance explained analysis and UMAP method implementations.
DLA Analysis: it presents, for each classifier used in the Data Labeling Analysis, the values of performance metrics.
Utility Section
This section implements TRTR (Train on Real, Test on Real) and TSTR (Train on Synthetic, Test on Real) approaches for a selected target class and machine learning model.
Privacy Section
This section consists of three subsections dedicated to privacy evaluation:
SEA Analysis: it computes metrics like cosine similarity, Euclidean distance and Hausdorff distance, displaying corresponding density plots or values.
MIA Simulation: it simulates Membership Inference Attacks with adjustable attacker parameters and showcases attacker performance.
AIA Simulation: it allows simulation of Attribute Inference Attacks where the user sets the attacker's access to features, displaying recostruction performance metrics.
Each section provides options to download reports containing the displayed graphs and tables.
License
Distributed under MIT License. See LICENSE for more information.
Citation
If you use SynthCheck, please cite
Santangelo, G.; Nicora, G.; Bellazzi, R. and Dagliati, A. (2024). SynthCheck: A Dashboard for Synthetic Data Quality Assessment. In Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies - HEALTHINF; ISBN 978-989-758-688-0; ISSN 2184-4305, SciTePress, pages 246-256. DOI: 10.5220/0012558700003657
Owner
- Name: BMI "Mario Stefanelli" Lab - UNIPV
- Login: bmi-labmedinfo
- Kind: organization
- Email: labmedinfo@unipv.it
- Location: Italy
- Website: http://www.labmedinfo.org
- Repositories: 1
- Profile: https://github.com/bmi-labmedinfo
Repository for BMI lab code and sw products