dp-cgans
A library to generate synthetic tabular or RDF data using Conditional Generative Adversary Networks (GANs) combined with Differential Privacy techniques.
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 5 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, springer.com, wiley.com, mdpi.com, ieee.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.9%) to scientific vocabulary
Keywords
Repository
A library to generate synthetic tabular or RDF data using Conditional Generative Adversary Networks (GANs) combined with Differential Privacy techniques.
Basic Info
Statistics
- Stars: 93
- Watchers: 2
- Forks: 28
- Open Issues: 4
- Releases: 6
Topics
Metadata Files
README.md
👯 DP-CGANS (Differentially Private - Conditional Generative Adversarial NetworkS)
Abstract: This repository presents a Conditional Generative Adversary Networks (GANs) on tabular data (and RDF data) combining with Differential Privacy techniques. Our pre-print publication: Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy.
Author: Chang Sun, Institute of Data Science, Maastricht University Start date: Nov-2021 Status: Under development
Note: "Standing on the shoulders of giants". This repository is inspired by the excellent work of CTGAN from Synthetic Data Vault (SDV), Tensorflow Privacy, and RdfPdans. Highly appreciate they shared the ideas and implementations, made code publicly available, well-written documentation. More related work can be found in the References below.
This package is extended from SDV (https://github.com/sdv-dev/SDV), CTGAN (https://github.com/sdv-dev/CTGAN), and Differential Privacy in GANs (https://github.com/civisanalytics/dpwgan). The author modified the conditional matrix and cost functions to emphasize on the relations between variables. The main changes are in ctgan/synthesizers/ctgan.py ../datasampler.py ../datatransformer.py
📥️ Installation
You will need Python >=3.8+ and <=3.11. sdv ==1.6.0, and rdt==1.9.0
shell
pip install dp-cgans
🪄 Usage
⌨️ Use as a command-line interface
You can easily generate synthetic data for a file using your terminal after installing dp-cgans with pip.
To quickly run our example, you can download the example data:
bash
wget https://raw.githubusercontent.com/sunchang0124/dp_cgans/main/resources/example_tabular_data_UCIAdult.csv
Then run dp-cgans:
bash
dp-cgans gen example_tabular_data_UCIAdult.csv --epochs 2 --output out.csv --gen-size 100
Get a full rundown of the available options for generating synthetic data with:
bash
dp-cgans gen --help
🐍 Use with python
This library can also be used directly in python scripts
If your input is tabular data (e.g., csv):
```python from dpcgans import DPCGAN import pandas as pd
tabulardata=pd.readcsv("../resources/exampletabulardata_UCIAdult.csv")
We adjusted the original CTGAN model from SDV. Instead of looking at the distribution of individual variable, we extended to two variables and keep their corrll
model = DPCGAN( epochs=100, # number of training epochs batchsize=100, # the size of each batch logfrequency=True, verbose=True, generatordim=(128, 128, 128), discriminatordim=(128, 128, 128), generatorlr=2e-4, discriminatorlr=2e-4, discriminatorsteps=1, private=False, )
print("Start training model") model.fit(tabular_data) model.save("generator.pkl")
Generate 100 synthetic rows
syndata = model.sample(100) syndata.tocsv("syndata_file.csv") ```
🧑💻 Development setup
For development, we recommend to install and use Hatch, as it will automatically install and sync the dependencies when running development scripts. But you can also directly create a virtual environment and install the library with pip install -e .
Install
Clone the repository:
bash
git clone https://github.com/sunchang0124/dp_cgans
cd dp_cgans
When working in development the
hatchtool will automatically install and sync the dependencies when running a script. But you can also directly
Run
Run the library with the CLI:
bash
hatch -v run dp-cgans gen --help
You can also enter a new shell with the virtual environments automatically activated:
bash
hatch shell
dp-cgans gen --help
Tests
Run the tests locally:
bash
hatch run pytest -s
Format
Run formatting and linting (black and ruff):
bash
hatch run fmt
Reset the virtual environments
In case the virtual environments is not updating as expected you can easily reset it with:
bash
hatch env prune
📦️ New release process
The deployment of new releases is done automatically by a GitHub Action workflow when a new release is created on GitHub. To release a new version:
Make sure the
PYPI_API_TOKENsecret has been defined in the GitHub repository (in Settings > Secrets > Actions). You can get an API token from PyPI here.Increment the
versionnumber insrc/dp_cgans/__init__.pyfile:
bash
hatch version fix # Bump from 0.0.1 to 0.0.2
hatch version minor # Bump from 0.0.1 to 0.1.0
hatch version 0.1.1 # Bump to the specified version
- Create a new release on GitHub, which will automatically trigger the publish workflow, and publish the new release to PyPI.
You can also manually build and publish from you laptop:
bash
hatch build
hatch publish
📚️ References / Further reading
There are many excellent work on generating synthetic data using GANS and other methods. We list the studies that made great conbributions for the field and inspiring for our work.
GANS
- Synthetic Data Vault (SDV) [Paper] [Github]
- Modeling Tabular Data using Conditional GAN (a part of SDV) [Paper] [Github]
- Wasserstein GAN [Paper]
- Improved Training of Wasserstein GANs [Paper]
- Synthesising Tabular Data using Wasserstein Conditional GANs with Gradient Penalty (WCGAN-GP) [Paper]
- PacGAN: The power of two samples in generative adversarial networks [Paper]
- CTAB-GAN: Effective Table Data Synthesizing [Paper]
- Conditional Tabular GAN-Based Two-Stage Data Generation Scheme for Short-Term Load Forecasting [Paper]
- TabFairGAN: Fair Tabular Data Generation with Generative Adversarial Networks [Paper]
- Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning [Paper]
##### Differential Privacy
- Tensorflow Privacy [Github]
- Renyi Differential Privacy [Paper]
- DP-CGAN : Differentially Private Synthetic Data and Label Generation [Paper]
- Differentially Private Generative Adversarial Network [Paper] [Github] Another implementation [Github]
- Private Data Generation Toolbox [Github]
- autodp: Automating differential privacy computation [Github]
- Differentially Private Synthetic Medical Data Generation using Convolutional GANs [Paper]
- DTGAN: Differential Private Training for Tabular GANs [Paper]
- DIFFERENTIALLY PRIVATE SYNTHETIC DATA: APPLIED EVALUATIONS AND ENHANCEMENTS [Paper]
- FFPDG: FAST, FAIR AND PRIVATE DATA GENERATION [Paper]
Others
- EvoGen: a Generator for Synthetic Versioned RDF [Paper]
- Generation and evaluation of synthetic patient data [Paper]
- Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation [Paper]
- Generating and evaluating cross-sectional synthetic electronic healthcare data: Preserving data utility and patient privacy [Paper]
- Synthetic data for open and reproducible methodological research in social sciences and official statistics [Paper]
- A Study of the Impact of Synthetic Data Generation Techniques on Data Utility using the 1991 UK Samples of Anonymised Records [Paper]
Owner
- Name: Chang
- Login: sunchang0124
- Kind: user
- Company: Institute of Data Science
- Repositories: 27
- Profile: https://github.com/sunchang0124
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Sun
given-names: Chang
affiliation: Institute of Data Science, Maastricht University
orcid: https://orcid.org/0000-0001-8325-8848
email: sunchang0124@gmail.com
title: "DP-CGANS (Differential Privacy Conditional Generative Adversarial NetworkS) for Generating Synthetic Tabular Data"
doi: 10.48550/arXiv.2206.13787
repository-code: https://github.com/sunchang0124/dp_cgans
date-released: 2022-07-27
url: https://pypi.org/project/dp-cgans/
# version: 0.0.3
GitHub Events
Total
- Release event: 1
- Issues event: 8
- Watch event: 24
- Issue comment event: 3
- Push event: 8
- Fork event: 6
Last Year
- Release event: 1
- Issues event: 8
- Watch event: 24
- Issue comment event: 3
- Push event: 8
- Fork event: 6
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Chang | s****4@g****m | 42 |
| Vincent Emonet | v****t@g****m | 16 |
| cudillal | c****d@h****m | 5 |
| Chang Sun | c****n@c****e | 2 |
| Chang Sun | c****n@C****l | 2 |
| Chang Sun | c****n@c****l | 1 |
| Chang Sun | c****n@c****l | 1 |
| Chang Sun | c****n@c****l | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 10
- Total pull requests: 2
- Average time to close issues: about 1 year
- Average time to close pull requests: 2 minutes
- Total issue authors: 9
- Total pull request authors: 1
- Average comments per issue: 0.8
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 3
- Pull requests: 0
- Average time to close issues: 2 months
- Average time to close pull requests: N/A
- Issue authors: 2
- Pull request authors: 0
- Average comments per issue: 0.33
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- wilcovanvorstenbosch (2)
- Rock910 (1)
- TeDiou (1)
- rtaori (1)
- Houmamelte (1)
- hafidh561 (1)
- caprone (1)
- cailv (1)
- vdemchenko3 (1)
Pull Request Authors
- cudillal (2)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 40 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 6
- Total maintainers: 1
pypi.org: dp-cgans
A library to generate synthetic tabular or RDF data using Conditional Generative Adversary Networks (GANs) combined with Differential Privacy techniques.
- Homepage: https://github.com/sunchang0124/dp_cgans
- Documentation: https://github.com/sunchang0124/dp_cgans
- License: MIT License Copyright (c) 2023-present Sun Chang <sunchang0124@gmail.com> Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
-
Latest release: 0.0.6
published about 2 years ago
Rankings
Maintainers (1)
Dependencies
- autoflake ^1.3.1 develop
- black ^19.10b0 develop
- flake8 ^3.7.9 develop
- isort ^4.3.21 develop
- mypy ^0.770 develop
- pytest ^5.4.1 develop
- pytest-cov ^2.8.1 develop
- copulas *
- faker *
- graphviz *
- numpy *
- pandas *
- pyreadstat *
- python >=3.8,<3.10
- rdt 0.6.4
- scipy *
- sdv 0.14.0
- sklearn *
- torch *
- typer *
- wheel *
- actions/checkout v3 composite
- actions/setup-python v3 composite
- pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite
- actions/checkout v3 composite
- actions/setup-python v3 composite
- python 3.9 build