synthdatasingularity
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: pegleggen
- License: apache-2.0
- Language: Jupyter Notebook
- Default Branch: main
- Size: 0 Bytes
Statistics
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
SynthData_Singularity
Colab: Synthetic Data Generation with MIT SynthDataVault
This Google Colaboratory notebook provides an introduction and practical guide to generating synthetic data using the MIT SynthDataVault library. It demonstrates how to leverage this powerful tool in Python to create realistic synthetic datasets whilst preserving the statistical properties and privacy of original data.
Table of Contents
- Overview
- Features
- Getting Started
- Usage
- Key Concepts
- Results
- Troubleshooting
- Contributing
- Licence
- Acknowledgements
Overview
Automatic text summarisation is the process of creating a concise and coherent summary of a longer document. This Colab focuses on extractive summarisation, where the summary is formed by selecting important sentences directly from the original text. We'll be using the powerful 🤗 Transformers library, which provides thousands of pre-trained models, to fine-tune a model specifically for this purpose.
Features
- Easy Setup: Directly runnable in Google Colab with minimal configuration.
- MIT SynthDataVault Integration: Demonstrates core functionalities of the
synthdatavaultlibrary. - Practical Examples: Includes code examples for generating synthetic data from a given dataset.
- Data Analysis: Shows how to compare the statistical properties of real and synthetic data.
- Privacy Considerations: Highlights the benefits of synthetic data in privacy-sensitive scenarios.
Getting Started
To get started with this Colab, you'll need a Google account to access Google Colab.
Hardware Requirements
This notebook doesn't have specific hardware requirements and can run efficiently on a CPU runtime. You generally don't need a GPU for synthetic data generation with synthdatavault.
To check or change your runtime type:
- Go to
Runtimein the Colab menu. - Select
Change runtime type. - Ensure
None(for CPU) is selected underHardware acceleratorif you wish to stick with CPU, or keep GPU if it's already selected and you prefer.
Setup
No local installation is required. Simply open the Colab notebook in your browser:
https://colab.research.google.com/drive/1_GPjenw4voPWL7STov81c3XeXISgTPnb?usp=sharing
Once opened, you can run the cells sequentially. The first few cells will handle the installation of necessary libraries.
Usage
The notebook is structured to guide you through the process of generating synthetic data. You will:
- Install necessary libraries: This will primarily involve installing
synthdatavaultand other data manipulation/visualisation libraries (e.g.,pandas,matplotlib). - Load a sample dataset: The Colab will use a publicly available or generated sample dataset for demonstration.
- Initialise and configure
synthdatavault: Learn how to set up the synthetic data generation model. - Generate synthetic data: Execute the generation process.
- Compare real vs. synthetic data: Visualise and analyse the statistical similarities and differences between the original and synthetic datasets.
Simply execute each cell in the notebook in order. Explanations are provided within the notebook to clarify each step and the underlying concepts.
Key Concepts
The Colab will likely touch upon concepts fundamental to synthetic data generation, including:
- Differential Privacy: A strong, mathematically rigorous definition of privacy protection often implemented in synthetic data generation.
- Generative Models: The underlying machine learning models (e.g., GANs, VAEs, or statistical models) used by
synthdatavaultto learn data distributions. - Data Utility: Metrics and methods to assess how well the synthetic data preserves the statistical properties and analytical utility of the original data.
- Privacy-Utility Trade-off: The inherent balance between preserving privacy and maintaining data utility.
Results
After running the notebook, you will observe:
- The generated synthetic dataset.
- Visualisations (e.g., histograms, scatter plots, correlation matrices) comparing the original and synthetic data distributions.
- Potentially, quantitative metrics to assess the similarity and utility of the synthetic data.
Troubleshooting
- GPU Not Available: Ensure you've correctly set the runtime type to GPU. If you encounter issues, try restarting the runtime (
Runtime -> Restart runtime). - Out of Memory Errors: If you're running into memory issues, try reducing the
batch_sizein your training arguments. You might also consider using a smaller pre-trained model if available. - Installation Issues: If a library fails to install, check your internet connection or try restarting the Colab session.
Contributing
Contributions to this Colab are welcome! If you have suggestions for improvements, bug fixes, or new features, feel free to:
- Fork the original notebook.
- Make your changes.
- Share your improved version, perhaps with a brief explanation of your modifications.
Licence
This project is open-source and available under the MIT Licence.
Acknowledgements
- The MIT Lincoln Laboratory for developing and open-sourcing the
SynthDataVaultlibrary. - Google Colaboratory for providing a free and accessible platform for machine learning and data science.
- The broader data privacy and synthetic data communities for their ongoing research and development.
Owner
- Name: Genevieve Smith-Nunes
- Login: pegleggen
- Kind: user
- Location: Cambridge, UK
- Company: readysaltedcode
- Website: www.readysaltedcode.org
- Repositories: 19
- Profile: https://github.com/pegleggen
PhD Student -DataDrivenDance, Computing Education, Ballet, and Ethics
Citation (CITATION.cff)
abstract: <p>This is just sharing the colab notebook</p> authors: - affiliation: readysaltedcode family-names: Dr. Genevieve Smith-Nunes cff-version: 1.2.0 date-released: '2025-05-27' doi: 10.5281/zenodo.15524480 identifiers: - type: swh value: swh:1:dir:cd6209399ebddf696a364d41f2c8d4ce43521224;origin=https://doi.org/10.5281/zenodo.15524479;visit=swh:1:snp:a9d714efeda6843764805f865ae391f058ee1558;anchor=swh:1:rel:2b765b10793bf2fe7590f659c4d49c7c6e1fe3ad;path=pegleggen-SynthDataSingularity-379b278 license: - apache-2.0 message: If you use this software, please cite it using the metadata from this file. repository-code: https://github.com/pegleggen/SynthDataSingularity/tree/share title: 'pegleggen/SynthDataSingularity: Sharing colab' type: software version: share
GitHub Events
Total
- Watch event: 1
- Push event: 1
Last Year
- Watch event: 1
- Push event: 1