Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: pegleggen
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 0 Bytes
Statistics
  • Stars: 1
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created 9 months ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

SynthData_Singularity

Colab: Synthetic Data Generation with MIT SynthDataVault

This Google Colaboratory notebook provides an introduction and practical guide to generating synthetic data using the MIT SynthDataVault library. It demonstrates how to leverage this powerful tool in Python to create realistic synthetic datasets whilst preserving the statistical properties and privacy of original data.

Table of Contents

Overview

Automatic text summarisation is the process of creating a concise and coherent summary of a longer document. This Colab focuses on extractive summarisation, where the summary is formed by selecting important sentences directly from the original text. We'll be using the powerful 🤗 Transformers library, which provides thousands of pre-trained models, to fine-tune a model specifically for this purpose.

Features

  • Easy Setup: Directly runnable in Google Colab with minimal configuration.
  • MIT SynthDataVault Integration: Demonstrates core functionalities of the synthdatavault library.
  • Practical Examples: Includes code examples for generating synthetic data from a given dataset.
  • Data Analysis: Shows how to compare the statistical properties of real and synthetic data.
  • Privacy Considerations: Highlights the benefits of synthetic data in privacy-sensitive scenarios.

Getting Started

To get started with this Colab, you'll need a Google account to access Google Colab.

Hardware Requirements

This notebook doesn't have specific hardware requirements and can run efficiently on a CPU runtime. You generally don't need a GPU for synthetic data generation with synthdatavault.

To check or change your runtime type:

  1. Go to Runtime in the Colab menu.
  2. Select Change runtime type.
  3. Ensure None (for CPU) is selected under Hardware accelerator if you wish to stick with CPU, or keep GPU if it's already selected and you prefer.

Setup

No local installation is required. Simply open the Colab notebook in your browser:

https://colab.research.google.com/drive/1_GPjenw4voPWL7STov81c3XeXISgTPnb?usp=sharing

Once opened, you can run the cells sequentially. The first few cells will handle the installation of necessary libraries.

Usage

The notebook is structured to guide you through the process of generating synthetic data. You will:

  1. Install necessary libraries: This will primarily involve installing synthdatavault and other data manipulation/visualisation libraries (e.g., pandas, matplotlib).
  2. Load a sample dataset: The Colab will use a publicly available or generated sample dataset for demonstration.
  3. Initialise and configure synthdatavault: Learn how to set up the synthetic data generation model.
  4. Generate synthetic data: Execute the generation process.
  5. Compare real vs. synthetic data: Visualise and analyse the statistical similarities and differences between the original and synthetic datasets.

Simply execute each cell in the notebook in order. Explanations are provided within the notebook to clarify each step and the underlying concepts.

Key Concepts

The Colab will likely touch upon concepts fundamental to synthetic data generation, including:

  • Differential Privacy: A strong, mathematically rigorous definition of privacy protection often implemented in synthetic data generation.
  • Generative Models: The underlying machine learning models (e.g., GANs, VAEs, or statistical models) used by synthdatavault to learn data distributions.
  • Data Utility: Metrics and methods to assess how well the synthetic data preserves the statistical properties and analytical utility of the original data.
  • Privacy-Utility Trade-off: The inherent balance between preserving privacy and maintaining data utility.

Results

After running the notebook, you will observe:

  • The generated synthetic dataset.
  • Visualisations (e.g., histograms, scatter plots, correlation matrices) comparing the original and synthetic data distributions.
  • Potentially, quantitative metrics to assess the similarity and utility of the synthetic data.

Troubleshooting

  • GPU Not Available: Ensure you've correctly set the runtime type to GPU. If you encounter issues, try restarting the runtime (Runtime -> Restart runtime).
  • Out of Memory Errors: If you're running into memory issues, try reducing the batch_size in your training arguments. You might also consider using a smaller pre-trained model if available.
  • Installation Issues: If a library fails to install, check your internet connection or try restarting the Colab session.

Contributing

Contributions to this Colab are welcome! If you have suggestions for improvements, bug fixes, or new features, feel free to:

  1. Fork the original notebook.
  2. Make your changes.
  3. Share your improved version, perhaps with a brief explanation of your modifications.

Licence

This project is open-source and available under the MIT Licence.

Acknowledgements

  • The MIT Lincoln Laboratory for developing and open-sourcing the SynthDataVault library.
  • Google Colaboratory for providing a free and accessible platform for machine learning and data science.
  • The broader data privacy and synthetic data communities for their ongoing research and development.

Owner

  • Name: Genevieve Smith-Nunes
  • Login: pegleggen
  • Kind: user
  • Location: Cambridge, UK
  • Company: readysaltedcode

PhD Student -DataDrivenDance, Computing Education, Ballet, and Ethics

Citation (CITATION.cff)

abstract: <p>This is just sharing the colab notebook</p>
authors:
- affiliation: readysaltedcode
  family-names: Dr. Genevieve Smith-Nunes
cff-version: 1.2.0
date-released: '2025-05-27'
doi: 10.5281/zenodo.15524480
identifiers:
- type: swh
  value: swh:1:dir:cd6209399ebddf696a364d41f2c8d4ce43521224;origin=https://doi.org/10.5281/zenodo.15524479;visit=swh:1:snp:a9d714efeda6843764805f865ae391f058ee1558;anchor=swh:1:rel:2b765b10793bf2fe7590f659c4d49c7c6e1fe3ad;path=pegleggen-SynthDataSingularity-379b278
license:
- apache-2.0
message: If you use this software, please cite it using the metadata from this file.
repository-code: https://github.com/pegleggen/SynthDataSingularity/tree/share
title: 'pegleggen/SynthDataSingularity: Sharing colab'
type: software
version: share

GitHub Events

Total
  • Watch event: 1
  • Push event: 1
Last Year
  • Watch event: 1
  • Push event: 1