https://github.com/chris-santiago/tabular-ssl
Using Cursor to create a project via AI Chat
Science Score: 46.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
✓Committers with academic emails
1 of 1 committers (100.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary
Repository
Using Cursor to create a project via AI Chat
Basic Info
- Host: GitHub
- Owner: chris-santiago
- Language: Python
- Default Branch: master
- Homepage: https://chris-santiago.github.io/tabular-ssl/
- Size: 1.28 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 2
- Releases: 0
Metadata Files
README.md
This was generated using Cursor IDE w/agent chat.
Notes: - Default models (or auto) is pretty terrible. Code quality is inconsistent; it can feel like you're working with a struggling student. At first things look good, but the code is riddled with errors. - Test generation w/default models is even worse. Most tests are not functional-- they're importing classes/funcs that don't exist in the actual source code. It quickly devolves into and endless loop for error fixes. - Using the latest, "thinking" or reasoning models is a much better experience. Claude-4-sonnet cleaned up much of the mess from the "auto" models. - Models use excessice emoji.
🎯 Tabular SSL: Self-Supervised Learning for Tabular Data
A unified, modular framework for self-supervised learning on tabular data with state-of-the-art corruption strategies, consistent interfaces, and fast experimentation.
✨ Key Features
🔧 Consistent Design: All components follow unified interfaces
🧩 Modular Architecture: Easy to compose and extend
⚡ Fast Experimentation: Swap components with simple config changes
🎭 State-of-the-Art SSL: VIME, SCARF, ReConTab implementations
🏦 Ready-to-Use Data: IBM TabFormer credit card dataset included
📱 Interactive Demos: Understand corruption strategies visually
🚀 Quick Start
Installation
bash
git clone https://github.com/yourusername/tabular-ssl.git
cd tabular-ssl
pip install -r requirements.txt
pip install -e .
export PYTHONPATH=$PWD/src
Interactive Demos
```bash
🎭 Explore corruption strategies
python democorruptionstrategies.py
🏦 Real credit card transaction data
python democreditcard_data.py
🔧 Final unified design demo
python demofinaldesign.py ```
Train Models
```bash
🎯 VIME: Value imputation + mask estimation
python train.py model=ssl_vime
🌟 SCARF: Contrastive learning with feature corruption
python train.py model=ssl_scarf
🔧 ReConTab: Multi-task reconstruction learning
python train.py model=ssl_recontab
🤖 Simple MLP classifier
python train.py model=base_mlp ```
🧪 Easy Experimentation
Component Swapping
```bash
Switch to RNN backbone
python train.py model=sslvime sequenceencoder=rnn
Use autoencoder event encoder
python train.py model=sslscarf eventencoder=autoencoder
Remove sequence modeling
python train.py model=sslvime sequenceencoder=null
Custom corruption rate
python train.py model=sslvime corruption.corruptionrate=0.5 ```
Systematic Comparison
```bash
Compare all SSL strategies
python train.py -m model=sslvime,sslscarf,ssl_recontab
Quick iteration setup
python train.py experiment=quickvimessl
Full comparison experiment
python train.py experiment=compare_corruptions ```
Architecture Variants
```bash
Transformer + VIME SSL
python train.py model=sslvime sequenceencoder=transformer
S4 + ReConTab SSL
python train.py model=sslrecontab sequenceencoder=s4
RNN + SCARF SSL
python train.py model=sslscarf sequenceencoder=rnn sequenceencoder.rnntype=gru ```
🏗️ Architecture Overview
Unified Component Hierarchy
BaseComponent (Abstract)
├── EventEncoder
│ ├── MLPEventEncoder
│ ├── AutoEncoderEventEncoder
│ └── ContrastiveEventEncoder
├── SequenceEncoder
│ ├── TransformerSequenceEncoder
│ ├── RNNSequenceEncoder
│ └── S4SequenceEncoder
├── ProjectionHead
│ └── MLPProjectionHead
├── PredictionHead
│ └── ClassificationHead
├── EmbeddingLayer
│ └── CategoricalEmbedding
└── BaseCorruption
├── VIMECorruption (NeurIPS 2020)
├── SCARFCorruption (arXiv 2021)
└── ReConTabCorruption
Model Composition
```python
Flexible model composition
SSLModel( eventencoder=MLPEventEncoder(...), sequenceencoder=TransformerSequenceEncoder(...), corruption=VIMECorruption(...) # ← Type auto-detected ) ```
Consistent Interfaces
All components follow the same patterns:
- Corruption strategies return Dict[str, torch.Tensor] with 'corrupted', 'targets', 'mask', 'metadata'
- All components expose input_dim and output_dim properties
- Auto-detection eliminates configuration errors
📁 Configuration Structure
Standardized Layout
configs/
├── corruption/ # 🎭 VIME, SCARF, ReConTab, etc.
├── event_encoder/ # 📦 MLP, Autoencoder, Contrastive
├── sequence_encoder/ # 🔗 Transformer, RNN, S4, null
├── projection_head/ # 📐 MLP, null
├── prediction_head/ # 🎯 Classification, null
├── embedding/ # 🔤 Categorical, null
├── model/ # 🤖 Complete model configs
└── experiment/ # 🧪 Experiment templates
Consistent Pattern
All component configs follow: ```yaml
Component Description
target: tabular_ssl.models.components.ComponentClass param1: value1 param2: value2 ```
🎭 State-of-the-Art Corruption Strategies
VIME - Value Imputation and Mask Estimation
From "VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain" (NeurIPS 2020)
```yaml
configs/corruption/vime.yaml
target: tabularssl.models.components.VIMECorruption corruptionrate: 0.3 categoricalindices: [] numericalindices: [0, 1, 2, 3] ```
Features: - 🎯 Dual pretext tasks: mask estimation + value imputation - 🔢 Handles categorical and numerical features differently - 📊 Returns both corrupted data and mask for training
SCARF - Contrastive Learning with Feature Corruption
From "SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption" (arXiv 2021)
```yaml
configs/corruption/scarf.yaml
target: tabularssl.models.components.SCARFCorruption corruptionrate: 0.6 corruptionstrategy: randomswap ```
Features: - 🌟 High corruption rate (60%) for effective contrastive learning - 🔄 Feature swapping between samples in batch - 🌡️ Temperature-scaled InfoNCE loss
ReConTab - Multi-task Reconstruction
Reconstruction-based contrastive learning for tabular data
```yaml
configs/corruption/recontab.yaml
target: tabularssl.models.components.ReConTabCorruption corruptionrate: 0.15 corruptiontypes: [masking, noise, swapping] maskingstrategy: random ```
Features: - 🔧 Multiple corruption types: masking, noise injection, swapping - 📊 Detailed corruption tracking for reconstruction targets - 🎯 Multi-task learning with specialized heads
📊 Sample Data
IBM TabFormer Credit Card Dataset
```python from tabularssl.data.sampledata import loadcreditcard_sample
Download and load real transaction data
data, info = loadcreditcardsample() print(f"Loaded {len(data)} transactions") print(f"Features: {info['featurenames']}") ```
Synthetic Transaction Generator
```python from tabularssl.data.sampledata import generatesequentialtransactions
Generate synthetic data for experimentation
data = generatesequentialtransactions( numusers=1000, transactionsperuser=50, numfeatures=10 ) ```
🔧 Advanced Usage
Custom Component Creation
```python from tabular_ssl.models.components import BaseCorruption
class CustomCorruption(BaseCorruption): def forward(self, x: torch.Tensor) -> Dict[str, torch.Tensor]: # Your corruption logic return { 'corrupted': corruptedx, 'targets': x, 'mask': corruptionmask } ```
Configuration Override Examples
```bash
Hyperparameter sweep
python train.py -m model=sslvime \ corruption.corruptionrate=0.1,0.3,0.5 \ model.learning_rate=1e-4,5e-4,1e-3
Architecture ablation
python train.py model=sslvime \ eventencoder.hiddendims=[64,128] \ sequenceencoder.num_layers=2
Custom experiment
python train.py model=sslvime \ eventencoder=autoencoder \ sequenceencoder=s4 \ corruption.corruptionrate=0.4 ```
Modular Composition
```python
Mix and match components
model = SSLModel( eventencoder=AutoEncoderEventEncoder(...), sequenceencoder=S4SequenceEncoder(...), projection_head=MLPProjectionHead(...), corruption=ReConTabCorruption(...) ) ```
📚 Project Structure
tabular-ssl/
├── 📁 configs/ # Hydra configurations
│ ├── 🎭 corruption/ # Corruption strategies
│ ├── 📦 event_encoder/ # Event encoders
│ ├── 🔗 sequence_encoder/ # Sequence encoders
│ ├── 📐 projection_head/ # Projection heads
│ ├── 🎯 prediction_head/ # Prediction heads
│ ├── 🔤 embedding/ # Embedding layers
│ ├── 🤖 model/ # Complete models
│ └── 🧪 experiment/ # Experiment configs
├── 📁 src/tabular_ssl/
│ ├── 📊 data/ # Data loading & sample data
│ ├── 🧠 models/ # Model implementations
│ │ ├── base.py # Base classes & models
│ │ └── components.py # All components
│ └── 🛠️ utils/ # Utilities
├── 🎬 demo_*.py # Interactive demos
├── 📖 docs/ # Documentation
└── ✅ tests/ # Unit tests
🎯 Design Principles
✅ Consistent Interfaces: All components follow same patterns
✅ Simplified Architecture: Clean abstractions without complexity
✅ Maintained Functionality: All original features preserved
✅ Modular & Extensible: Easy to add new components
✅ Intuitive Configuration: Logical, well-organized configs
✅ Fast Experimentation: Easy component swapping
📈 Benchmarks & Results
The framework includes implementations of methods from leading papers:
| Method | Paper | Key Innovation | |--------|-------|----------------| | VIME | NeurIPS 2020 | Dual pretext tasks for tabular SSL | | SCARF | arXiv 2021 | Contrastive learning with feature corruption | | ReConTab | Custom | Multi-task reconstruction learning |
Quick Comparison
```bash
Run systematic comparison
python train.py experiment=compare_corruptions
Results logged to W&B automatically
```
🚀 Getting Started Guide
1. Explore Demos
bash
python demo_corruption_strategies.py # Understand corruption methods
python demo_credit_card_data.py # See real data in action
python demo_final_design.py # Complete design overview
2. Train Your First Model
bash
python train.py model=ssl_vime # Start with VIME
3. Experiment with Components
bash
python train.py model=ssl_vime sequence_encoder=rnn
python train.py model=ssl_scarf event_encoder=autoencoder
4. Create Custom Configurations
```bash
Copy and modify existing configs
cp configs/corruption/vime.yaml configs/corruption/my_corruption.yaml
Edit my_corruption.yaml
python train.py model=sslvime corruption=mycorruption ```
🤝 Contributing
We welcome contributions! The modular design makes it easy to:
- Add new corruption strategies following
BaseCorruptioninterface - Implement new encoders extending base classes
- Create experiment configurations in
configs/experiment/ - Add new sample datasets in
src/tabular_ssl/data/
Development Setup
bash
git clone https://github.com/yourusername/tabular-ssl.git
cd tabular-ssl
pip install -r requirements.txt
pip install -e .
python -m pytest tests/
📖 Documentation
- 🎯 Design Summary: Complete design overview
- 📚 API Reference: Detailed API documentation
- 🧪 Experiments Guide: How to create experiments
- 🔧 Custom Components: Adding new components
📝 Citation
bibtex
@software{tabular_ssl,
title={Tabular SSL: A Unified Framework for Self-Supervised Learning on Tabular Data},
author={Your Name},
year={2024},
url={https://github.com/yourusername/tabular-ssl}
}
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- VIME: Yoon et al., NeurIPS 2020
- SCARF: Bahri et al., arXiv 2021
- S4: Gu et al., ICLR 2022
- IBM TabFormer: Padhi et al., arXiv 2021
- PyTorch Lightning: Falcon et al.
- Hydra: Facebook Research
🎉 Ready for fast, iterative tabular SSL experimentation!
Owner
- Name: Chris Santiago
- Login: chris-santiago
- Kind: user
- Repositories: 64
- Profile: https://github.com/chris-santiago
GitHub Events
Total
- Issues event: 1
- Push event: 34
- Create event: 3
Last Year
- Issues event: 1
- Push event: 34
- Create event: 3
Committers
Last synced: 11 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| chris-santiago | c****o@g****u | 51 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 2
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 2
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- chris-santiago (2)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- mkdocs-material >=9.0.0
- mkdocstrings >=0.24.0
- pymdown-extensions >=10.0.0
- hydra-core >=1.3.2
- torch >=2.7.0
- antlr4-python3-runtime ==4.9.3
- babel ==2.17.0
- backrefs ==5.8
- certifi ==2025.4.26
- charset-normalizer ==3.4.2
- click ==8.2.1
- colorama ==0.4.6
- filelock ==3.18.0
- fsspec ==2025.5.0
- ghp-import ==2.1.0
- griffe ==1.7.3
- hydra-core ==1.3.2
- idna ==3.10
- iniconfig ==2.1.0
- jinja2 ==3.1.6
- markdown ==3.8
- markupsafe ==3.0.2
- mergedeep ==1.3.4
- mkdocs ==1.6.1
- mkdocs-autorefs ==1.4.2
- mkdocs-get-deps ==0.2.0
- mkdocs-material ==9.6.14
- mkdocs-material-extensions ==1.3.1
- mkdocstrings ==0.29.1
- mkdocstrings-python ==1.16.10
- mpmath ==1.3.0
- networkx ==3.4.2
- nvidia-cublas-cu12 ==12.6.4.1
- nvidia-cuda-cupti-cu12 ==12.6.80
- nvidia-cuda-nvrtc-cu12 ==12.6.77
- nvidia-cuda-runtime-cu12 ==12.6.77
- nvidia-cudnn-cu12 ==9.5.1.17
- nvidia-cufft-cu12 ==11.3.0.4
- nvidia-cufile-cu12 ==1.11.1.6
- nvidia-curand-cu12 ==10.3.7.77
- nvidia-cusolver-cu12 ==11.7.1.2
- nvidia-cusparse-cu12 ==12.5.4.2
- nvidia-cusparselt-cu12 ==0.6.3
- nvidia-nccl-cu12 ==2.26.2
- nvidia-nvjitlink-cu12 ==12.6.85
- nvidia-nvtx-cu12 ==12.6.77
- omegaconf ==2.3.0
- packaging ==25.0
- paginate ==0.5.7
- pathspec ==0.12.1
- platformdirs ==4.3.8
- pluggy ==1.6.0
- pygments ==2.19.1
- pymdown-extensions ==10.15
- pytest ==8.3.5
- python-dateutil ==2.9.0.post0
- pyyaml ==6.0.2
- pyyaml-env-tag ==1.1
- requests ==2.32.3
- ruff ==0.11.11
- setuptools ==80.8.0
- six ==1.17.0
- sympy ==1.14.0
- torch ==2.7.0
- triton ==3.3.0
- typing-extensions ==4.13.2
- urllib3 ==2.4.0
- watchdog ==6.0.0
- antlr4-python3-runtime 4.9.3
- babel 2.17.0
- backrefs 5.8
- certifi 2025.4.26
- charset-normalizer 3.4.2
- click 8.2.1
- colorama 0.4.6
- filelock 3.18.0
- fsspec 2025.5.0
- ghp-import 2.1.0
- griffe 1.7.3
- hydra-core 1.3.2
- idna 3.10
- iniconfig 2.1.0
- jinja2 3.1.6
- markdown 3.8
- markupsafe 3.0.2
- mergedeep 1.3.4
- mkdocs 1.6.1
- mkdocs-autorefs 1.4.2
- mkdocs-get-deps 0.2.0
- mkdocs-material 9.6.14
- mkdocs-material-extensions 1.3.1
- mkdocstrings 0.29.1
- mkdocstrings-python 1.16.10
- mpmath 1.3.0
- networkx 3.4.2
- nvidia-cublas-cu12 12.6.4.1
- nvidia-cuda-cupti-cu12 12.6.80
- nvidia-cuda-nvrtc-cu12 12.6.77
- nvidia-cuda-runtime-cu12 12.6.77
- nvidia-cudnn-cu12 9.5.1.17
- nvidia-cufft-cu12 11.3.0.4
- nvidia-cufile-cu12 1.11.1.6
- nvidia-curand-cu12 10.3.7.77
- nvidia-cusolver-cu12 11.7.1.2
- nvidia-cusparse-cu12 12.5.4.2
- nvidia-cusparselt-cu12 0.6.3
- nvidia-nccl-cu12 2.26.2
- nvidia-nvjitlink-cu12 12.6.85
- nvidia-nvtx-cu12 12.6.77
- omegaconf 2.3.0
- packaging 25.0
- paginate 0.5.7
- pathspec 0.12.1
- platformdirs 4.3.8
- pluggy 1.6.0
- pygments 2.19.1
- pymdown-extensions 10.15
- pytest 8.3.5
- python-dateutil 2.9.0.post0
- pyyaml 6.0.2
- pyyaml-env-tag 1.1
- requests 2.32.3
- ruff 0.11.11
- setuptools 80.8.0
- six 1.17.0
- sympy 1.14.0
- tabular-ssl 0.1.0
- torch 2.7.0
- triton 3.3.0
- typing-extensions 4.13.2
- urllib3 2.4.0
- watchdog 6.0.0