https://github.com/sebhaan/tabpfgen
TabPFGen: Synthetic Tabular Data Generation with TabPFN
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.2%) to scientific vocabulary
Keywords
Repository
TabPFGen: Synthetic Tabular Data Generation with TabPFN
Basic Info
- Host: GitHub
- Owner: sebhaan
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://sebhaan.github.io/TabPFGen/
- Size: 152 KB
Statistics
- Stars: 22
- Watchers: 3
- Forks: 3
- Open Issues: 0
- Releases: 4
Topics
Metadata Files
README.md
TabPFGen: Synthetic Tabular Data Generation with TabPFN

TabPFGen is a Python library for generating high-quality synthetic tabular data using energy-based modeling and stochastic gradient Langevin dynamics (SGLD). It supports both classification and regression tasks with built-in visualization capabilities.
Integration with TabPFN Extensions: TabPFGen is being integrated into the tabpfn-extensions ecosystem as a separate module (TabPFGen Data Synthesizer Extension, PR #83), which will enable seamless integration with other TabPFN tools and extensions.
Motivation
While there are many tools available for generating synthetic images or text, creating realistic tabular data that preserves the statistical properties and relationships of the original dataset has been more challenging.
Generating synthetic tabular data is particularly useful in scenarios where:
- You have limited real data but need more samples for training
- You can't share real data due to privacy concerns
- You need to balance an imbalanced dataset
- You want to test how your models would perform with more data
What makes TabPFGen interesting is that it's built on the TabPFN transformer architecture and doesn't require additional training. It includes built-in visualization tools to help you verify the quality of the generated data by comparing distributions, feature correlations, and other important metrics between the real and synthetic datasets.
Key Features
- Energy-based synthetic data generation
- Support for both classification and regression tasks
- Automatic dataset balancing for imbalanced classes
- Class-balanced sampling option
- Comprehensive visualization tools
- Built on TabPFN transformer architecture
- No additional training required
Installation
bash
pip install tabpfgen
Verify installation
bash
python -c 'from tabpfgen import TabPFGen; print("Installation successful!")'
Quick Start
Classification Example
```python from tabpfgen import TabPFGen from tabpfgen.visuals import visualizeclassificationresults from sklearn.datasets import loadbreastcancer
Load data
X, y = loadbreastcancer(returnXy=True)
Initialize generator
generator = TabPFGen(nsgldsteps=500)
Generate synthetic data
Xsynth, ysynth = generator.generateclassification( X, y, nsamples=100, balance_classes=True )
Visualize results
visualizeclassificationresults( X, y, Xsynth, ysynth, featurenames=loadbreastcancer().featurenames ) ```
Dataset Balancing Example
```python from tabpfgen import TabPFGen from tabpfgen.visuals import visualizeclassificationresults from sklearn.datasets import make_classification
Create imbalanced dataset
X, y = makeclassification(nsamples=1000, nclasses=3, ninformative=3, nredundant=1, weights=[0.7, 0.2, 0.1], randomstate=42)
Initialize generator
generator = TabPFGen(nsgldsteps=500)
Balance dataset automatically (balances to majority class size)
Xsynth, ysynth, Xcombined, ycombined = generator.balance_dataset(X, y)
Or specify custom target per class:
Xsynth, ysynth, Xcombined, ycombined = generator.balancedataset( X, y, targetper_class=1000 )
print(f"Original dataset: {len(X)} samples") print(f"Synthetic samples: {len(Xsynth)} samples") print(f"Combined dataset: {len(Xcombined)} samples")
visualizeclassificationresults( X, y, Xsynth, ysynth, featurenames=[f'feature{i}' for i in range(X.shape[1])] ) ```
Note on Balancing Results: The final class distribution may be approximately balanced rather than perfectly balanced. This is due to TabPFN's label refinement process, which prioritizes data quality and realism over exact class counts. The method ensures significant improvement in class balance while maintaining high-quality synthetic samples.
Regression Example
```python from tabpfgen import TabPFGen from tabpfgen.visuals import visualizeregressionresults from sklearn.datasets import load_diabetes
Load regression dataset
X, y = loaddiabetes(returnX_y=True)
Initialize generator
generator = TabPFGen(nsgldsteps=500)
Generate synthetic regression data
Xsynth, ysynth = generator.generateregression( X, y, nsamples=100, use_quantiles=True )
Visualize results
visualizeregressionresults( X, y, Xsynth, ysynth, featurenames=loaddiabetes().feature_names ) ```
Visualization Features
The package includes comprehensive visualization tools:
Classification Visualizations
- Class distribution comparison
- t-SNE visualization of feature space
- Feature importance analysis
- Feature distribution comparisons
- Feature correlation matrices
Regression Visualizations
- Target value distribution comparison
- Q-Q plots for distribution analysis
- Box plot comparisons
- Feature importance analysis
- Scatter plots of important features
- t-SNE visualization with target value mapping
- Residuals analysis
- Feature correlation matrices
Parameters
TabPFGen
n_sgld_steps: Number of SGLD iterations (default: 1000)sgld_step_size: Step size for SGLD updates (default: 0.01)sgld_noise_scale: Scale of noise in SGLD (default: 0.01)device: Computing device ('cpu' or 'cuda', default: 'auto')
Classification Generation
n_samples: Number of synthetic samples to generatebalance_classes: Whether to generate balanced class distributions (default: True)
Dataset Balancing
target_per_class: Target number of samples per class (default: None, uses majority class size)min_class_size: Minimum class size to include in balancing (default: 5)
Regression Generation
n_samples: Number of synthetic samples to generateuse_quantiles: Whether to use quantile-based sampling (default: True)
Tests
bash
python -m unittest tests/test_tabpfgen.py
Documentation
For detailed documentation and tutorials, visit our tutorial pages.
How It Works
Energy-Based Modeling: Uses a distance-based energy function that combines:
- Feature space distances between synthetic and real samples
- Class-conditional information for classification tasks
SGLD Sampling: Generates synthetic samples through iterative updates:
x_new = x - step_size * gradient + noise_scale * random_noiseQuality Assurance:
- Automatic feature scaling
- Class balance maintenance
- Distribution matching through energy minimization
- Quantile-based sampling for regression
Limitations
- Memory usage scales with dataset size
- SGLD convergence can be sensitive to step size parameters
- Computation time increases with
n_sgld_steps - Dataset balancing produces approximate rather than perfect balance due to TabPFN's quality-focused label refinement process
References
This project is inspired by the TabPFGen method described in Ma, Junwei, et al. This is an independent implementation and may not strictly follow all aspects of the original approach. We are not affiliated with the original authors.
Ma, Junwei, et al. "TabPFGen--Tabular Data Generation with TabPFN." arXiv preprint arXiv:2406.05216 (2024).
Hollmann, Noah, et al. "Accurate predictions on small data with a tabular foundation model." Nature 637.8045 (2025): 319-326.
Owner
- Name: Seb Haan
- Login: sebhaan
- Kind: user
- Company: The University of Sydney
- Repositories: 4
- Profile: https://github.com/sebhaan
GitHub Events
Total
- Create event: 3
- Issues event: 5
- Release event: 3
- Watch event: 16
- Issue comment event: 5
- Push event: 11
- Public event: 1
- Fork event: 2
Last Year
- Create event: 3
- Issues event: 5
- Release event: 3
- Watch event: 16
- Issue comment event: 5
- Push event: 11
- Public event: 1
- Fork event: 2
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 3
- Total pull requests: 0
- Average time to close issues: 2 days
- Average time to close pull requests: N/A
- Total issue authors: 3
- Total pull request authors: 0
- Average comments per issue: 1.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 3
- Pull requests: 0
- Average time to close issues: 2 days
- Average time to close pull requests: N/A
- Issue authors: 3
- Pull request authors: 0
- Average comments per issue: 1.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- Yanboliu123 (1)
- marco-virgolin-ist (1)
- noahho (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 69 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 5
- Total maintainers: 1
pypi.org: tabpfgen
Synthetic tabular data generation using energy-based modeling and TabPFN
- Homepage: https://github.com/sebhaan/TabPFGen
- Documentation: https://tabpfgen.readthedocs.io/
- License: Apache Software License
-
Latest release: 0.1.4
published 9 months ago