synthdatasingularity

https://github.com/pegleggen/synthdatasingularity

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: pegleggen
License: apache-2.0
Language: Jupyter Notebook
Default Branch: main
Size: 0 Bytes

Statistics

Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 1

Created 9 months ago · Last pushed 9 months ago

Metadata Files

Readme License Citation

SynthData_Singularity

Colab: Synthetic Data Generation with MIT SynthDataVault

This Google Colaboratory notebook provides an introduction and practical guide to generating synthetic data using the MIT SynthDataVault library. It demonstrates how to leverage this powerful tool in Python to create realistic synthetic datasets whilst preserving the statistical properties and privacy of original data.

Overview

Automatic text summarisation is the process of creating a concise and coherent summary of a longer document. This Colab focuses on extractive summarisation, where the summary is formed by selecting important sentences directly from the original text. We'll be using the powerful 🤗 Transformers library, which provides thousands of pre-trained models, to fine-tune a model specifically for this purpose.

Features

Easy Setup: Directly runnable in Google Colab with minimal configuration.
MIT SynthDataVault Integration: Demonstrates core functionalities of the synthdatavault library.
Practical Examples: Includes code examples for generating synthetic data from a given dataset.
Data Analysis: Shows how to compare the statistical properties of real and synthetic data.
Privacy Considerations: Highlights the benefits of synthetic data in privacy-sensitive scenarios.

Getting Started

To get started with this Colab, you'll need a Google account to access Google Colab.

Hardware Requirements

This notebook doesn't have specific hardware requirements and can run efficiently on a CPU runtime. You generally don't need a GPU for synthetic data generation with synthdatavault.

To check or change your runtime type:

Go to Runtime in the Colab menu.
Select Change runtime type.
Ensure None (for CPU) is selected under Hardware accelerator if you wish to stick with CPU, or keep GPU if it's already selected and you prefer.

Setup

No local installation is required. Simply open the Colab notebook in your browser:

https://colab.research.google.com/drive/1_GPjenw4voPWL7STov81c3XeXISgTPnb?usp=sharing

Once opened, you can run the cells sequentially. The first few cells will handle the installation of necessary libraries.

Usage

The notebook is structured to guide you through the process of generating synthetic data. You will:

Install necessary libraries: This will primarily involve installing synthdatavault and other data manipulation/visualisation libraries (e.g., pandas, matplotlib).
Load a sample dataset: The Colab will use a publicly available or generated sample dataset for demonstration.
Initialise and configure synthdatavault: Learn how to set up the synthetic data generation model.
Generate synthetic data: Execute the generation process.
Compare real vs. synthetic data: Visualise and analyse the statistical similarities and differences between the original and synthetic datasets.

Simply execute each cell in the notebook in order. Explanations are provided within the notebook to clarify each step and the underlying concepts.

Key Concepts

The Colab will likely touch upon concepts fundamental to synthetic data generation, including:

Differential Privacy: A strong, mathematically rigorous definition of privacy protection often implemented in synthetic data generation.
Generative Models: The underlying machine learning models (e.g., GANs, VAEs, or statistical models) used by synthdatavault to learn data distributions.
Data Utility: Metrics and methods to assess how well the synthetic data preserves the statistical properties and analytical utility of the original data.
Privacy-Utility Trade-off: The inherent balance between preserving privacy and maintaining data utility.

Results

After running the notebook, you will observe:

The generated synthetic dataset.
Visualisations (e.g., histograms, scatter plots, correlation matrices) comparing the original and synthetic data distributions.
Potentially, quantitative metrics to assess the similarity and utility of the synthetic data.

Troubleshooting

GPU Not Available: Ensure you've correctly set the runtime type to GPU. If you encounter issues, try restarting the runtime (Runtime -> Restart runtime).
Out of Memory Errors: If you're running into memory issues, try reducing the batch_size in your training arguments. You might also consider using a smaller pre-trained model if available.
Installation Issues: If a library fails to install, check your internet connection or try restarting the Colab session.

Contributing

Contributions to this Colab are welcome! If you have suggestions for improvements, bug fixes, or new features, feel free to:

Fork the original notebook.
Make your changes.
Share your improved version, perhaps with a brief explanation of your modifications.

Licence

This project is open-source and available under the MIT Licence.

Acknowledgements

The MIT Lincoln Laboratory for developing and open-sourcing the SynthDataVault library.
Google Colaboratory for providing a free and accessible platform for machine learning and data science.
The broader data privacy and synthetic data communities for their ongoing research and development.

Owner

Name: Genevieve Smith-Nunes
Login: pegleggen
Kind: user
Location: Cambridge, UK
Company: readysaltedcode

Website: www.readysaltedcode.org
Repositories: 19
Profile: https://github.com/pegleggen

PhD Student -DataDrivenDance, Computing Education, Ballet, and Ethics

Citation (CITATION.cff)

abstract: <p>This is just sharing the colab notebook</p>
authors:
- affiliation: readysaltedcode
  family-names: Dr. Genevieve Smith-Nunes
cff-version: 1.2.0
date-released: '2025-05-27'
doi: 10.5281/zenodo.15524480
identifiers:
- type: swh
  value: swh:1:dir:cd6209399ebddf696a364d41f2c8d4ce43521224;origin=https://doi.org/10.5281/zenodo.15524479;visit=swh:1:snp:a9d714efeda6843764805f865ae391f058ee1558;anchor=swh:1:rel:2b765b10793bf2fe7590f659c4d49c7c6e1fe3ad;path=pegleggen-SynthDataSingularity-379b278
license:
- apache-2.0
message: If you use this software, please cite it using the metadata from this file.
repository-code: https://github.com/pegleggen/SynthDataSingularity/tree/share
title: 'pegleggen/SynthDataSingularity: Sharing colab'
type: software
version: share

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science