https://github.com/sdv-dev/sdv

Synthetic data generation for tabular data

Keywords

data-generation deep-learning gan gans generative-adversarial-network generative-ai generative-model generativeai machine-learning multi-table relational-datasets sdv synthetic-data synthetic-data-generation time-series

Keywords from Contributors

automl data-pipeline feature-engineering automated-feature-engineering data-profilers datacleaner pipeline-testing automated-machine-learning autograder report

Last synced: 5 months ago · JSON representation

Repository

Synthetic data generation for tabular data

Basic Info

Host: GitHub
Owner: sdv-dev
License: other
Language: Python
Default Branch: main
Homepage: https://docs.sdv.dev/sdv
Size: 31.8 MB

Statistics

Stars: 3,150
Watchers: 43
Forks: 380
Open Issues: 154
Releases: 81

Topics

data-generation deep-learning gan gans generative-adversarial-network generative-ai generative-model generativeai machine-learning multi-table relational-datasets sdv synthetic-data synthetic-data-generation time-series

Created almost 8 years ago · Last pushed 6 months ago

Metadata Files

Readme Changelog Contributing License Codeowners Authors

README.md

This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

[![Dev Status](https://img.shields.io/badge/Dev%20Status-5%20--%20Production%2fStable-green)](https://pypi.org/search/?c=Development+Status+%3A%3A+5+-+Production%2FStable) [![PyPi Shield](https://img.shields.io/pypi/v/SDV.svg)](https://pypi.python.org/pypi/SDV) [![Unit Tests](https://github.com/sdv-dev/SDV/actions/workflows/unit.yml/badge.svg?branch=main)](https://github.com/sdv-dev/SDV/actions/workflows/unit.yml?query=branch%3Amain) [![Integration Tests](https://github.com/sdv-dev/SDV/actions/workflows/integration.yml/badge.svg?branch=main)](https://github.com/sdv-dev/SDV/actions/workflows/integration.yml?query=branch%3Amain) [![Coverage Status](https://codecov.io/gh/sdv-dev/SDV/branch/main/graph/badge.svg)](https://codecov.io/gh/sdv-dev/SDV) [![Downloads](https://static.pepy.tech/personalized-badge/sdv?period=total&units=international_system&left_color=grey&right_color=blue&left_text=Downloads)](https://pepy.tech/project/sdv) [![Colab](https://img.shields.io/badge/Tutorials-Try%20now!-orange?logo=googlecolab)](https://docs.sdv.dev/sdv/demos) [![Slack](https://img.shields.io/badge/Slack-Join%20now!-36C5F0?logo=slack)](https://bit.ly/sdv-slack-invite)

Overview

The Synthetic Data Vault (SDV) is a Python library designed to be your one-stop shop for creating tabular synthetic data. The SDV uses a variety of machine learning algorithms to learn patterns from your real data and emulate them in synthetic data.

Features

:brain: Create synthetic data using machine learning. The SDV offers multiple models, ranging from classical statistical methods (GaussianCopula) to deep learning methods (CTGAN). Generate data for single tables, multiple connected tables or sequential tables.

:bar_chart: Evaluate and visualize data. Compare the synthetic data to the real data against a variety of measures. Diagnose problems and generate a quality report to get more insights.

:arrows_counterclockwise: Preprocess, anonymize and define constraints. Control data processing to improve the quality of synthetic data, choose from different types of anonymization and define business rules in the form of logical constraints.

| Important Links | | | --------------------------------------------- | ----------------------------------------------------------------------------------------------------| | Tutorials | Get some hands-on experience with the SDV. Launch the tutorial notebooks and run the code yourself. | | :book: Docs | Learn how to use the SDV library with user guides and API references. | | :orange_book: Blog | Get more insights about using the SDV, deploying models and our synthetic data community. | | Community | Join our Slack workspace for announcements and discussions. | | :computer: Website | Check out the SDV website for more information about the project. |

Install

The SDV is publicly available under the Business Source License. Install SDV using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.

bash pip install sdv

bash conda install -c pytorch -c conda-forge sdv

Getting Started

Load a demo dataset to get started. This dataset is a single table describing guests staying at a fictional hotel.

```python from sdv.datasets.demo import download_demo

realdata, metadata = downloaddemo( modality='singletable', datasetname='fakehotelguests') ```

Single Table Metadata Example

The demo also includes metadata, a description of the dataset, including the data types in each column and the primary key (guest_email).

Synthesizing Data

Next, we can create an SDV synthesizer, an object that you can use to create synthetic data. It learns patterns from the real data and replicates them to generate synthetic data. Let's use the GaussianCopulaSynthesizer.

```python from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata) synthesizer.fit(data=real_data) ```

And now the synthesizer is ready to create synthetic data!

python synthetic_data = synthesizer.sample(num_rows=500)

The synthetic data will have the following properties: - Sensitive columns are fully anonymized. The email, billing address and credit card number columns contain new data so you don't expose the real values. - Other columns follow statistical patterns. For example, the proportion of room types, the distribution of check in dates and the correlations between room rate and room type are preserved. - Keys and other relationships are intact. The primary key (guest email) is unique for each row. If you have multiple tables, the connection between a primary and foreign keys makes sense.

Evaluating Synthetic Data

The SDV library allows you to evaluate the synthetic data by comparing it to the real data. Get started by generating a quality report.

```python from sdv.evaluation.singletable import evaluatequality

qualityreport = evaluatequality( realdata, syntheticdata, metadata) ```

``` Generating report ...

(1/2) Evaluating Column Shapes: |████████████████| 9/9 [00:00<00:00, 1133.09it/s]| Column Shapes Score: 89.11%

(2/2) Evaluating Column Pair Trends: |██████████████████████████████████████████| 36/36 [00:00<00:00, 502.88it/s]| Column Pair Trends Score: 88.3%

Overall Score (Average): 88.7% ```

This object computes an overall quality score on a scale of 0 to 100% (100 being the best) as well as detailed breakdowns. For more insights, you can also visualize the synthetic vs. real data.

```python from sdv.evaluation.singletable import getcolumn_plot

fig = getcolumnplot( realdata=realdata, syntheticdata=syntheticdata, columnname='amenitiesfee', metadata=metadata )

fig.show() ```

Real vs. Synthetic Data

What's Next?

Using the SDV library, you can synthesize single table, multi table and sequential data. You can also customize the full synthetic data workflow, including preprocessing, anonymization and adding constraints.

To learn more, visit the SDV Demo page.

Credits

Thank you to our team of contributors who have built and maintained the SDV ecosystem over the years!

View Contributors

Citation

If you use SDV for your research, please cite the following paper:

Neha Patki, Roy Wedge, Kalyan Veeramachaneni. The Synthetic Data Vault. IEEE DSAA 2016.

@inproceedings{ SDV, title={The Synthetic data vault}, author={Patki, Neha and Wedge, Roy and Veeramachaneni, Kalyan}, booktitle={IEEE International Conference on Data Science and Advanced Analytics (DSAA)}, year={2016}, pages={399-410}, doi={10.1109/DSAA.2016.49}, month={Oct} }

The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.

Owner

Name: The Synthetic Data Vault Project
Login: sdv-dev
Kind: organization
Email: sdv@sdv.dev

Website: https://sdv.dev
Repositories: 9
Profile: https://github.com/sdv-dev

Committers

Last synced: 9 months ago

All Time

Total Commits: 1,690
Total Committers: 51
Avg Commits per committer: 33.137
Development Distribution Score (DDS): 0.796

Past Year

Commits: 261
Committers: 10
Avg Commits per committer: 26.1
Development Distribution Score (DDS): 0.648

Top Committers

Name	Email	Commits
Andrew Montanez	a**w@s**v	345
Carles Sala	c**s@p**m	278
Felipe Alex Hofmann	f**o@g**m	152
Manuel Alvarez	m**l@p**m	146
Plamen Valentinov Kolev	4****r	145
JDTheRipperPC	j**c@g**m	141
Katharine Xiao	2****o	91
R-Palazzo	1****o	76
Frances Hartwell	f**s@d**m	74
John La	l****7	58
SDV Team	9****m	56
amontane	a**w@M**l	24
Gaurav Sheni	g**i@g**m	13
github-actions[bot]	4****]	10
Neha Patki	n**i@g**m	9
Roy Wedge	r**e@d**m	9
Patrick	3****m	7
amontane	a**w@d**u	5
Arash Akhgari	8****i	4
amontane	a**w@d**u	3
amontane	a**w@d**U	3
amontane	a**w@d**u	3
amontane	a**w@d**u	3
Sarah Alnegheimish	4****h	3
Aylr	A****r	2
amontane	a**w@M**t	2
amontane	a**w@d**U	2
amontane	a**w@d**U	2
amontane	a**w@d**u	2
tssbas	8****s	1
and 21 more...

Committer Domains (Top 20 + Academic)

pythiac.com: 2 datacebo.com: 2 dhcp-18-111-117-153.dyn.mit.edu: 2 sdv.dev: 1 dhcp-18-30-16-77.dyn.mit.edu: 1 dhcp-18-111-18-140.dyn.mit.edu: 1 dhcp-18-111-45-100.dyn.mit.edu: 1 dhcp-18-111-48-211.dyn.mit.edu: 1 dhcp-18-111-84-18.dyn.mit.edu: 1 macbook-pro-2.attlocal.net: 1 dhcp-18-111-33-88.dyn.mit.edu: 1 dhcp-18-111-46-36.dyn.mit.edu: 1 dhcp-18-30-31-141.dyn.mit.edu: 1 dhcp-18-111-112-224.dyn.mit.edu: 1 dhcp-18-111-13-155.dyn.mit.edu: 1 dhcp-18-111-25-26.dyn.mit.edu: 1 dhcp-18-30-72-194.dyn.mit.edu: 1 dhcp-18-30-42-125.dyn.mit.edu: 1 dhcp-18-30-111-189.dyn.mit.edu: 1 dhcp-18-111-66-21.dyn.mit.edu: 1 dhcp-18-111-59-206.dyn.mit.edu: 1 dhcp-18-111-57-26.dyn.mit.edu: 1 dhcp-18-111-51-205.dyn.mit.edu: 1 dhcp-18-111-49-12.dyn.mit.edu: 1 dhcp-18-111-40-248.dyn.mit.edu: 1 dhcp-18-111-37-27.dyn.mit.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 749
Total pull requests: 885
Average time to close issues: 3 months
Average time to close pull requests: 5 days
Total issue authors: 206
Total pull request authors: 22
Average comments per issue: 1.44
Average comments per pull request: 1.11
Merged pull requests: 715
Bot issues: 0
Bot pull requests: 22

Past Year

Issues: 216
Pull requests: 355
Average time to close issues: 19 days
Average time to close pull requests: 4 days
Issue authors: 54
Pull request authors: 14
Average comments per issue: 0.87
Average comments per pull request: 1.18
Merged pull requests: 270
Bot issues: 0
Bot pull requests: 4

View more stats

Top Authors

Issue Authors

npatki (194)
amontanez24 (88)
frances-h (46)
srinify (34)
R-Palazzo (30)
pvk-developer (27)
gsheni (16)
fealho (13)
Ng-ms (10)
rwedge (9)
csala (9)
jalr4ever (8)
celsofranssa (8)
lajohn4747 (5)
wilcovanvorstenbosch (4)

Pull Request Authors

R-Palazzo (143)
fealho (136)
sdv-team (134)
amontanez24 (100)
pvk-developer (96)
lajohn4747 (95)
frances-h (85)
gsheni (35)
rwedge (23)
github-actions[bot] (20)
dbrown (2)
Deathn0t (2)
dependabot[bot] (2)
eltociear (2)
omelyanchikd (2)

Top Labels

Issue Labels

feature request (273) bug (264) question (107) under discussion (95) new (65) internal (48) data:sequential (48) maintenance (46) data:multi-table (43) resolution:resolved (38) feature:metadata (37) feature:constraints (27) resolution:duplicate (27) feature:sampling (24) resolution:WAI (17) documentation (16) data:single-table (12) resolution:cannot replicate (10) resolution:obsolete (7) feature:performance (6) feature:rdt (5) feature:data-ingestion (5) feature: modeling (4) resolution:out of scope (3) feature:metrics (2) feature:evaluation (2) feature:data-connectors (1) feature:preprocessing (1) os: windows (1)

Pull Request Labels

dependencies (2) feature request (1) new (1)

Packages

Total packages: 5
Total downloads:
- pypi 67,348 last-month
Total docker downloads: 82,018

Total dependent packages: 26
(may contain duplicates)
Total dependent repositories: 37
(may contain duplicates)
Total versions: 354
Total maintainers: 11

pypi.org: sdv

Generate synthetic data for single table, multi table and sequential data

Documentation: https://sdv.readthedocs.io/
License: BSL-1.1
Latest release: 1.26.0
published 6 months ago

Versions: 180
Dependent Packages: 25
Dependent Repositories: 36
Downloads: 67,348 Last month
Docker Downloads: 82,018

Rankings

Dependent packages count: 0.6%

Docker downloads count: 1.2%

Downloads: 1.4%

Stargazers count: 1.7%

Average: 1.8%

Dependent repos count: 2.4%

Forks count: 3.3%

Maintainers (9)

mit_dai_lab kveerama npatki amontanez24 fealho pvkdeveloper francesh lajohn rwedge-datacebo

Last synced: 6 months ago

proxy.golang.org: github.com/sdv-dev/sdv

Documentation: https://pkg.go.dev/github.com/sdv-dev/sdv#section-documentation
License: other
Latest release: v1.26.0
published 6 months ago

Versions: 82
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.4%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 6 months ago

proxy.golang.org: github.com/sdv-dev/SDV

Documentation: https://pkg.go.dev/github.com/sdv-dev/SDV#section-documentation
License: other
Latest release: v1.26.0
published 6 months ago

Versions: 82
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.4%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 6 months ago

spack.io: py-sdv

The Synthetic Data Vault (SDV) is a Synthetic Data Generation ecosystem of libraries that allows users to easily learn single-table, multi-table and timeseries datasets to later on generate new Synthetic Data that has the same format and statistical properties as the original dataset.

Homepage: https://github.com/sdv-dev/SDV
License: []
Latest release: 0.14.0
published almost 4 years ago

Versions: 2
Dependent Packages: 1
Dependent Repositories: 0

Rankings

Dependent repos count: 0.0%

Stargazers count: 6.9%

Forks count: 7.4%

Average: 10.6%

Dependent packages count: 28.1%

Maintainers (2)

Kerilk jke513

Last synced: 6 months ago

conda-forge.org: sdv

Homepage: https://github.com/sdv-dev/SDV
License: BUSL-1.1
Latest release: 0.17.1
published over 3 years ago

Versions: 8
Dependent Packages: 0
Dependent Repositories: 1

Rankings

Stargazers count: 11.0%

Forks count: 12.2%

Dependent repos count: 24.4%

Average: 24.8%

Dependent packages count: 51.6%

Last synced: 6 months ago

https://github.com/sdv-dev/sdv

Science Score: 59.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Overview

Features

Install

Getting Started

Synthesizing Data

Evaluating Synthetic Data

What's Next?

Credits

Citation

Owner

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: sdv

Rankings

Maintainers (9)

proxy.golang.org: github.com/sdv-dev/sdv

Rankings

proxy.golang.org: github.com/sdv-dev/SDV

Rankings

spack.io: py-sdv

Rankings

Maintainers (2)

conda-forge.org: sdv

Rankings