dp-cgans

A library to generate synthetic tabular or RDF data using Conditional Generative Adversary Networks (GANs) combined with Differential Privacy techniques.

https://github.com/sunchang0124/dp_cgans

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, springer.com, wiley.com, mdpi.com, ieee.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.9%) to scientific vocabulary

Keywords

differential-privacy gan synthesizer synthetic-data
Last synced: 6 months ago · JSON representation ·

Repository

A library to generate synthetic tabular or RDF data using Conditional Generative Adversary Networks (GANs) combined with Differential Privacy techniques.

Basic Info
  • Host: GitHub
  • Owner: sunchang0124
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 266 KB
Statistics
  • Stars: 93
  • Watchers: 2
  • Forks: 28
  • Open Issues: 4
  • Releases: 6
Topics
differential-privacy gan synthesizer synthetic-data
Created over 4 years ago · Last pushed 12 months ago
Metadata Files
Readme License Citation

README.md

👯 DP-CGANS (Differentially Private - Conditional Generative Adversarial NetworkS)

PyPi Shield Py versions Test package Publish package

Abstract: This repository presents a Conditional Generative Adversary Networks (GANs) on tabular data (and RDF data) combining with Differential Privacy techniques. Our pre-print publication: Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy.

Author: Chang Sun, Institute of Data Science, Maastricht University Start date: Nov-2021 Status: Under development

Note: "Standing on the shoulders of giants". This repository is inspired by the excellent work of CTGAN from Synthetic Data Vault (SDV), Tensorflow Privacy, and RdfPdans. Highly appreciate they shared the ideas and implementations, made code publicly available, well-written documentation. More related work can be found in the References below.

This package is extended from SDV (https://github.com/sdv-dev/SDV), CTGAN (https://github.com/sdv-dev/CTGAN), and Differential Privacy in GANs (https://github.com/civisanalytics/dpwgan). The author modified the conditional matrix and cost functions to emphasize on the relations between variables. The main changes are in ctgan/synthesizers/ctgan.py ../datasampler.py ../datatransformer.py

📥️ Installation

You will need Python >=3.8+ and <=3.11. sdv ==1.6.0, and rdt==1.9.0

shell pip install dp-cgans

🪄 Usage

⌨️ Use as a command-line interface

You can easily generate synthetic data for a file using your terminal after installing dp-cgans with pip.

To quickly run our example, you can download the example data:

bash wget https://raw.githubusercontent.com/sunchang0124/dp_cgans/main/resources/example_tabular_data_UCIAdult.csv

Then run dp-cgans:

bash dp-cgans gen example_tabular_data_UCIAdult.csv --epochs 2 --output out.csv --gen-size 100

Get a full rundown of the available options for generating synthetic data with:

bash dp-cgans gen --help

🐍 Use with python

This library can also be used directly in python scripts

If your input is tabular data (e.g., csv):

```python from dpcgans import DPCGAN import pandas as pd

tabulardata=pd.readcsv("../resources/exampletabulardata_UCIAdult.csv")

We adjusted the original CTGAN model from SDV. Instead of looking at the distribution of individual variable, we extended to two variables and keep their corrll

model = DPCGAN( epochs=100, # number of training epochs batchsize=100, # the size of each batch logfrequency=True, verbose=True, generatordim=(128, 128, 128), discriminatordim=(128, 128, 128), generatorlr=2e-4, discriminatorlr=2e-4, discriminatorsteps=1, private=False, )

print("Start training model") model.fit(tabular_data) model.save("generator.pkl")

Generate 100 synthetic rows

syndata = model.sample(100) syndata.tocsv("syndata_file.csv") ```

🧑‍💻 Development setup

For development, we recommend to install and use Hatch, as it will automatically install and sync the dependencies when running development scripts. But you can also directly create a virtual environment and install the library with pip install -e .

Install

Clone the repository:

bash git clone https://github.com/sunchang0124/dp_cgans cd dp_cgans

When working in development the hatch tool will automatically install and sync the dependencies when running a script. But you can also directly

Run

Run the library with the CLI:

bash hatch -v run dp-cgans gen --help

You can also enter a new shell with the virtual environments automatically activated:

bash hatch shell dp-cgans gen --help

Tests

Run the tests locally:

bash hatch run pytest -s

Format

Run formatting and linting (black and ruff):

bash hatch run fmt

Reset the virtual environments

In case the virtual environments is not updating as expected you can easily reset it with:

bash hatch env prune

📦️ New release process

The deployment of new releases is done automatically by a GitHub Action workflow when a new release is created on GitHub. To release a new version:

  1. Make sure the PYPI_API_TOKEN secret has been defined in the GitHub repository (in Settings > Secrets > Actions). You can get an API token from PyPI here.

  2. Increment the version number in src/dp_cgans/__init__.py file:

bash hatch version fix # Bump from 0.0.1 to 0.0.2 hatch version minor # Bump from 0.0.1 to 0.1.0 hatch version 0.1.1 # Bump to the specified version

  1. Create a new release on GitHub, which will automatically trigger the publish workflow, and publish the new release to PyPI.

You can also manually build and publish from you laptop:

bash hatch build hatch publish

📚️ References / Further reading

There are many excellent work on generating synthetic data using GANS and other methods. We list the studies that made great conbributions for the field and inspiring for our work.

GANS
  1. Synthetic Data Vault (SDV) [Paper] [Github]
  2. Modeling Tabular Data using Conditional GAN (a part of SDV) [Paper] [Github]
  3. Wasserstein GAN [Paper]
  4. Improved Training of Wasserstein GANs [Paper]
  5. Synthesising Tabular Data using Wasserstein Conditional GANs with Gradient Penalty (WCGAN-GP) [Paper]
  6. PacGAN: The power of two samples in generative adversarial networks [Paper]
  7. CTAB-GAN: Effective Table Data Synthesizing [Paper]
  8. Conditional Tabular GAN-Based Two-Stage Data Generation Scheme for Short-Term Load Forecasting [Paper]
  9. TabFairGAN: Fair Tabular Data Generation with Generative Adversarial Networks [Paper]
  10. Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning [Paper]

##### Differential Privacy

  1. Tensorflow Privacy [Github]
  2. Renyi Differential Privacy [Paper]
  3. DP-CGAN : Differentially Private Synthetic Data and Label Generation [Paper]
  4. Differentially Private Generative Adversarial Network [Paper] [Github] Another implementation [Github]
  5. Private Data Generation Toolbox [Github]
  6. autodp: Automating differential privacy computation [Github]
  7. Differentially Private Synthetic Medical Data Generation using Convolutional GANs [Paper]
  8. DTGAN: Differential Private Training for Tabular GANs [Paper]
  9. DIFFERENTIALLY PRIVATE SYNTHETIC DATA: APPLIED EVALUATIONS AND ENHANCEMENTS [Paper]
  10. FFPDG: FAST, FAIR AND PRIVATE DATA GENERATION [Paper]
Others
  1. EvoGen: a Generator for Synthetic Versioned RDF [Paper]
  2. Generation and evaluation of synthetic patient data [Paper]
  3. Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation [Paper]
  4. Generating and evaluating cross-sectional synthetic electronic healthcare data: Preserving data utility and patient privacy [Paper]
  5. Synthetic data for open and reproducible methodological research in social sciences and official statistics [Paper]
  6. A Study of the Impact of Synthetic Data Generation Techniques on Data Utility using the 1991 UK Samples of Anonymised Records [Paper]

Owner

  • Name: Chang
  • Login: sunchang0124
  • Kind: user
  • Company: Institute of Data Science

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Sun
    given-names: Chang
    affiliation: Institute of Data Science, Maastricht University
    orcid: https://orcid.org/0000-0001-8325-8848
    email: sunchang0124@gmail.com
title: "DP-CGANS (Differential Privacy Conditional Generative Adversarial NetworkS) for Generating Synthetic Tabular Data"
doi: 10.48550/arXiv.2206.13787
repository-code: https://github.com/sunchang0124/dp_cgans
date-released: 2022-07-27
url: https://pypi.org/project/dp-cgans/
# version: 0.0.3

GitHub Events

Total
  • Release event: 1
  • Issues event: 8
  • Watch event: 24
  • Issue comment event: 3
  • Push event: 8
  • Fork event: 6
Last Year
  • Release event: 1
  • Issues event: 8
  • Watch event: 24
  • Issue comment event: 3
  • Push event: 8
  • Fork event: 6

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 70
  • Total Committers: 8
  • Avg Commits per committer: 8.75
  • Development Distribution Score (DDS): 0.4
Past Year
  • Commits: 14
  • Committers: 1
  • Avg Commits per committer: 14.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Chang s****4@g****m 42
Vincent Emonet v****t@g****m 16
cudillal c****d@h****m 5
Chang Sun c****n@c****e 2
Chang Sun c****n@C****l 2
Chang Sun c****n@c****l 1
Chang Sun c****n@c****l 1
Chang Sun c****n@c****l 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 10
  • Total pull requests: 2
  • Average time to close issues: about 1 year
  • Average time to close pull requests: 2 minutes
  • Total issue authors: 9
  • Total pull request authors: 1
  • Average comments per issue: 0.8
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 0
  • Average time to close issues: 2 months
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 0.33
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • wilcovanvorstenbosch (2)
  • Rock910 (1)
  • TeDiou (1)
  • rtaori (1)
  • Houmamelte (1)
  • hafidh561 (1)
  • caprone (1)
  • cailv (1)
  • vdemchenko3 (1)
Pull Request Authors
  • cudillal (2)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 40 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 6
  • Total maintainers: 1
pypi.org: dp-cgans

A library to generate synthetic tabular or RDF data using Conditional Generative Adversary Networks (GANs) combined with Differential Privacy techniques.

  • Homepage: https://github.com/sunchang0124/dp_cgans
  • Documentation: https://github.com/sunchang0124/dp_cgans
  • License: MIT License Copyright (c) 2023-present Sun Chang <sunchang0124@gmail.com> Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
  • Latest release: 0.0.6
    published about 2 years ago
  • Versions: 6
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 40 Last month
Rankings
Dependent packages count: 10.0%
Forks count: 12.5%
Stargazers count: 12.9%
Average: 15.0%
Downloads: 18.1%
Dependent repos count: 21.7%
Maintainers (1)
Last synced: 6 months ago

Dependencies

pyproject.toml pypi
  • autoflake ^1.3.1 develop
  • black ^19.10b0 develop
  • flake8 ^3.7.9 develop
  • isort ^4.3.21 develop
  • mypy ^0.770 develop
  • pytest ^5.4.1 develop
  • pytest-cov ^2.8.1 develop
  • copulas *
  • faker *
  • graphviz *
  • numpy *
  • pandas *
  • pyreadstat *
  • python >=3.8,<3.10
  • rdt 0.6.4
  • scipy *
  • sdv 0.14.0
  • sklearn *
  • torch *
  • typer *
  • wheel *
.github/workflows/publish.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite
  • pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite
.github/workflows/test.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite
Dockerfile docker
  • python 3.9 build
docker-compose.yml docker