https://github.com/chris-santiago/tabular-ssl

Using Cursor to create a project via AI Chat

Science Score: 46.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
✓
Committers with academic emails
1 of 1 committers (100.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Using Cursor to create a project via AI Chat

Basic Info

Host: GitHub
Owner: chris-santiago
Language: Python
Default Branch: master
Homepage: https://chris-santiago.github.io/tabular-ssl/
Size: 1.28 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 2
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme

README.md

This was generated using Cursor IDE w/agent chat.

Notes: - Default models (or auto) is pretty terrible. Code quality is inconsistent; it can feel like you're working with a struggling student. At first things look good, but the code is riddled with errors. - Test generation w/default models is even worse. Most tests are not functional-- they're importing classes/funcs that don't exist in the actual source code. It quickly devolves into and endless loop for error fixes. - Using the latest, "thinking" or reasoning models is a much better experience. Claude-4-sonnet cleaned up much of the mess from the "auto" models. - Models use excessice emoji.

🎯 Tabular SSL: Self-Supervised Learning for Tabular Data

A unified, modular framework for self-supervised learning on tabular data with state-of-the-art corruption strategies, consistent interfaces, and fast experimentation.

✨ Key Features

🔧 Consistent Design: All components follow unified interfaces
🧩 Modular Architecture: Easy to compose and extend
⚡ Fast Experimentation: Swap components with simple config changes
🎭 State-of-the-Art SSL: VIME, SCARF, ReConTab implementations
🏦 Ready-to-Use Data: IBM TabFormer credit card dataset included
📱 Interactive Demos: Understand corruption strategies visually

🚀 Quick Start

Installation

bash git clone https://github.com/yourusername/tabular-ssl.git cd tabular-ssl pip install -r requirements.txt pip install -e . export PYTHONPATH=$PWD/src

Interactive Demos

```bash

🎭 Explore corruption strategies

python democorruptionstrategies.py

🏦 Real credit card transaction data

python democreditcard_data.py

🔧 Final unified design demo

python demofinaldesign.py ```

Train Models

```bash

🎯 VIME: Value imputation + mask estimation

python train.py model=ssl_vime

🌟 SCARF: Contrastive learning with feature corruption

python train.py model=ssl_scarf

🔧 ReConTab: Multi-task reconstruction learning

python train.py model=ssl_recontab

🤖 Simple MLP classifier

python train.py model=base_mlp ```

🧪 Easy Experimentation

Component Swapping

```bash

Switch to RNN backbone

python train.py model=sslvime sequenceencoder=rnn

Use autoencoder event encoder

python train.py model=sslscarf eventencoder=autoencoder

Remove sequence modeling

python train.py model=sslvime sequenceencoder=null

Custom corruption rate

python train.py model=sslvime corruption.corruptionrate=0.5 ```

Systematic Comparison

```bash

Compare all SSL strategies

python train.py -m model=sslvime,sslscarf,ssl_recontab

Quick iteration setup

python train.py experiment=quickvimessl

Full comparison experiment

python train.py experiment=compare_corruptions ```

Architecture Variants

```bash

Transformer + VIME SSL

python train.py model=sslvime sequenceencoder=transformer

S4 + ReConTab SSL

python train.py model=sslrecontab sequenceencoder=s4

RNN + SCARF SSL

python train.py model=sslscarf sequenceencoder=rnn sequenceencoder.rnntype=gru ```

🏗️ Architecture Overview

Unified Component Hierarchy

BaseComponent (Abstract) ├── EventEncoder │ ├── MLPEventEncoder │ ├── AutoEncoderEventEncoder │ └── ContrastiveEventEncoder ├── SequenceEncoder │ ├── TransformerSequenceEncoder │ ├── RNNSequenceEncoder │ └── S4SequenceEncoder ├── ProjectionHead │ └── MLPProjectionHead ├── PredictionHead │ └── ClassificationHead ├── EmbeddingLayer │ └── CategoricalEmbedding └── BaseCorruption ├── VIMECorruption (NeurIPS 2020) ├── SCARFCorruption (arXiv 2021) └── ReConTabCorruption

Model Composition

```python

Flexible model composition

SSLModel( eventencoder=MLPEventEncoder(...), sequenceencoder=TransformerSequenceEncoder(...), corruption=VIMECorruption(...) # ← Type auto-detected ) ```

Consistent Interfaces

All components follow the same patterns: - Corruption strategies return Dict[str, torch.Tensor] with 'corrupted', 'targets', 'mask', 'metadata' - All components expose input_dim and output_dim properties - Auto-detection eliminates configuration errors

📁 Configuration Structure

Standardized Layout

configs/ ├── corruption/ # 🎭 VIME, SCARF, ReConTab, etc. ├── event_encoder/ # 📦 MLP, Autoencoder, Contrastive ├── sequence_encoder/ # 🔗 Transformer, RNN, S4, null ├── projection_head/ # 📐 MLP, null ├── prediction_head/ # 🎯 Classification, null ├── embedding/ # 🔤 Categorical, null ├── model/ # 🤖 Complete model configs └── experiment/ # 🧪 Experiment templates

Consistent Pattern

All component configs follow: ```yaml

Component Description

target: tabular_ssl.models.components.ComponentClass param1: value1 param2: value2 ```

🎭 State-of-the-Art Corruption Strategies

VIME - Value Imputation and Mask Estimation

From "VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain" (NeurIPS 2020)

```yaml

configs/corruption/vime.yaml

target: tabularssl.models.components.VIMECorruption corruptionrate: 0.3 categoricalindices: [] numericalindices: [0, 1, 2, 3] ```

Features: - 🎯 Dual pretext tasks: mask estimation + value imputation - 🔢 Handles categorical and numerical features differently - 📊 Returns both corrupted data and mask for training

SCARF - Contrastive Learning with Feature Corruption

From "SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption" (arXiv 2021)

```yaml

configs/corruption/scarf.yaml

target: tabularssl.models.components.SCARFCorruption corruptionrate: 0.6 corruptionstrategy: randomswap ```

Features: - 🌟 High corruption rate (60%) for effective contrastive learning - 🔄 Feature swapping between samples in batch - 🌡️ Temperature-scaled InfoNCE loss

ReConTab - Multi-task Reconstruction

Reconstruction-based contrastive learning for tabular data

```yaml

configs/corruption/recontab.yaml

target: tabularssl.models.components.ReConTabCorruption corruptionrate: 0.15 corruptiontypes: [masking, noise, swapping] maskingstrategy: random ```

Features: - 🔧 Multiple corruption types: masking, noise injection, swapping - 📊 Detailed corruption tracking for reconstruction targets - 🎯 Multi-task learning with specialized heads

📊 Sample Data

IBM TabFormer Credit Card Dataset

```python from tabularssl.data.sampledata import loadcreditcard_sample

Download and load real transaction data

data, info = loadcreditcardsample() print(f"Loaded {len(data)} transactions") print(f"Features: {info['featurenames']}") ```

Synthetic Transaction Generator

```python from tabularssl.data.sampledata import generatesequentialtransactions

Generate synthetic data for experimentation

data = generatesequentialtransactions( numusers=1000, transactionsperuser=50, numfeatures=10 ) ```

🔧 Advanced Usage

Custom Component Creation

```python from tabular_ssl.models.components import BaseCorruption

class CustomCorruption(BaseCorruption): def forward(self, x: torch.Tensor) -> Dict[str, torch.Tensor]: # Your corruption logic return { 'corrupted': corruptedx, 'targets': x, 'mask': corruptionmask } ```

Configuration Override Examples

```bash

Hyperparameter sweep

python train.py -m model=sslvime \ corruption.corruptionrate=0.1,0.3,0.5 \ model.learning_rate=1e-4,5e-4,1e-3

Architecture ablation

python train.py model=sslvime \ eventencoder.hiddendims=[64,128] \ sequenceencoder.num_layers=2

Custom experiment

python train.py model=sslvime \ eventencoder=autoencoder \ sequenceencoder=s4 \ corruption.corruptionrate=0.4 ```

Modular Composition

```python

Mix and match components

model = SSLModel( eventencoder=AutoEncoderEventEncoder(...), sequenceencoder=S4SequenceEncoder(...), projection_head=MLPProjectionHead(...), corruption=ReConTabCorruption(...) ) ```

📚 Project Structure

tabular-ssl/ ├── 📁 configs/ # Hydra configurations │ ├── 🎭 corruption/ # Corruption strategies │ ├── 📦 event_encoder/ # Event encoders │ ├── 🔗 sequence_encoder/ # Sequence encoders │ ├── 📐 projection_head/ # Projection heads │ ├── 🎯 prediction_head/ # Prediction heads │ ├── 🔤 embedding/ # Embedding layers │ ├── 🤖 model/ # Complete models │ └── 🧪 experiment/ # Experiment configs ├── 📁 src/tabular_ssl/ │ ├── 📊 data/ # Data loading & sample data │ ├── 🧠 models/ # Model implementations │ │ ├── base.py # Base classes & models │ │ └── components.py # All components │ └── 🛠️ utils/ # Utilities ├── 🎬 demo_*.py # Interactive demos ├── 📖 docs/ # Documentation └── ✅ tests/ # Unit tests

🎯 Design Principles

✅ Consistent Interfaces: All components follow same patterns
✅ Simplified Architecture: Clean abstractions without complexity
✅ Maintained Functionality: All original features preserved
✅ Modular & Extensible: Easy to add new components
✅ Intuitive Configuration: Logical, well-organized configs
✅ Fast Experimentation: Easy component swapping

📈 Benchmarks & Results

The framework includes implementations of methods from leading papers:

| Method | Paper | Key Innovation | |--------|-------|----------------| | VIME | NeurIPS 2020 | Dual pretext tasks for tabular SSL | | SCARF | arXiv 2021 | Contrastive learning with feature corruption | | ReConTab | Custom | Multi-task reconstruction learning |

Quick Comparison

```bash

Run systematic comparison

python train.py experiment=compare_corruptions

Results logged to W&B automatically

```

🚀 Getting Started Guide

1. Explore Demos

bash python demo_corruption_strategies.py # Understand corruption methods python demo_credit_card_data.py # See real data in action python demo_final_design.py # Complete design overview

2. Train Your First Model

bash python train.py model=ssl_vime # Start with VIME

3. Experiment with Components

bash python train.py model=ssl_vime sequence_encoder=rnn python train.py model=ssl_scarf event_encoder=autoencoder

4. Create Custom Configurations

```bash

Copy and modify existing configs

cp configs/corruption/vime.yaml configs/corruption/my_corruption.yaml

Edit my_corruption.yaml

python train.py model=sslvime corruption=mycorruption ```

🤝 Contributing

We welcome contributions! The modular design makes it easy to:

Add new corruption strategies following BaseCorruption interface
Implement new encoders extending base classes
Create experiment configurations in configs/experiment/
Add new sample datasets in src/tabular_ssl/data/

Development Setup

bash git clone https://github.com/yourusername/tabular-ssl.git cd tabular-ssl pip install -r requirements.txt pip install -e . python -m pytest tests/

📖 Documentation

🎯 Design Summary: Complete design overview
📚 API Reference: Detailed API documentation
🧪 Experiments Guide: How to create experiments
🔧 Custom Components: Adding new components

📝 Citation

bibtex @software{tabular_ssl, title={Tabular SSL: A Unified Framework for Self-Supervised Learning on Tabular Data}, author={Your Name}, year={2024}, url={https://github.com/yourusername/tabular-ssl} }

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

VIME: Yoon et al., NeurIPS 2020
SCARF: Bahri et al., arXiv 2021
S4: Gu et al., ICLR 2022
IBM TabFormer: Padhi et al., arXiv 2021
PyTorch Lightning: Falcon et al.
Hydra: Facebook Research

🎉 Ready for fast, iterative tabular SSL experimentation!

Owner

Name: Chris Santiago
Login: chris-santiago
Kind: user

Repositories: 64
Profile: https://github.com/chris-santiago

GitHub Events

Total

Issues event: 1
Push event: 34
Create event: 3

Last Year

Issues event: 1
Push event: 34
Create event: 3

Committers

Last synced: about 1 year ago

All Time

Total Commits: 51
Total Committers: 1
Avg Commits per committer: 51.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 51
Committers: 1
Avg Commits per committer: 51.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
chris-santiago	c**o@g**u	51

Committer Domains (Top 20 + Academic)

gatech.edu: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 2
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

chris-santiago (2)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

docs/requirements.txt pypi

mkdocs-material >=9.0.0
mkdocstrings >=0.24.0
pymdown-extensions >=10.0.0

pyproject.toml pypi

hydra-core >=1.3.2
torch >=2.7.0

requirements.txt pypi

antlr4-python3-runtime ==4.9.3
babel ==2.17.0
backrefs ==5.8
certifi ==2025.4.26
charset-normalizer ==3.4.2
click ==8.2.1
colorama ==0.4.6
filelock ==3.18.0
fsspec ==2025.5.0
ghp-import ==2.1.0
griffe ==1.7.3
hydra-core ==1.3.2
idna ==3.10
iniconfig ==2.1.0
jinja2 ==3.1.6
markdown ==3.8
markupsafe ==3.0.2
mergedeep ==1.3.4
mkdocs ==1.6.1
mkdocs-autorefs ==1.4.2
mkdocs-get-deps ==0.2.0
mkdocs-material ==9.6.14
mkdocs-material-extensions ==1.3.1
mkdocstrings ==0.29.1
mkdocstrings-python ==1.16.10
mpmath ==1.3.0
networkx ==3.4.2
nvidia-cublas-cu12 ==12.6.4.1
nvidia-cuda-cupti-cu12 ==12.6.80
nvidia-cuda-nvrtc-cu12 ==12.6.77
nvidia-cuda-runtime-cu12 ==12.6.77
nvidia-cudnn-cu12 ==9.5.1.17
nvidia-cufft-cu12 ==11.3.0.4
nvidia-cufile-cu12 ==1.11.1.6
nvidia-curand-cu12 ==10.3.7.77
nvidia-cusolver-cu12 ==11.7.1.2
nvidia-cusparse-cu12 ==12.5.4.2
nvidia-cusparselt-cu12 ==0.6.3
nvidia-nccl-cu12 ==2.26.2
nvidia-nvjitlink-cu12 ==12.6.85
nvidia-nvtx-cu12 ==12.6.77
omegaconf ==2.3.0
packaging ==25.0
paginate ==0.5.7
pathspec ==0.12.1
platformdirs ==4.3.8
pluggy ==1.6.0
pygments ==2.19.1
pymdown-extensions ==10.15
pytest ==8.3.5
python-dateutil ==2.9.0.post0
pyyaml ==6.0.2
pyyaml-env-tag ==1.1
requests ==2.32.3
ruff ==0.11.11
setuptools ==80.8.0
six ==1.17.0
sympy ==1.14.0
torch ==2.7.0
triton ==3.3.0
typing-extensions ==4.13.2
urllib3 ==2.4.0
watchdog ==6.0.0

uv.lock pypi

antlr4-python3-runtime 4.9.3
babel 2.17.0
backrefs 5.8
certifi 2025.4.26
charset-normalizer 3.4.2
click 8.2.1
colorama 0.4.6
filelock 3.18.0
fsspec 2025.5.0
ghp-import 2.1.0
griffe 1.7.3
hydra-core 1.3.2
idna 3.10
iniconfig 2.1.0
jinja2 3.1.6
markdown 3.8
markupsafe 3.0.2
mergedeep 1.3.4
mkdocs 1.6.1
mkdocs-autorefs 1.4.2
mkdocs-get-deps 0.2.0
mkdocs-material 9.6.14
mkdocs-material-extensions 1.3.1
mkdocstrings 0.29.1
mkdocstrings-python 1.16.10
mpmath 1.3.0
networkx 3.4.2
nvidia-cublas-cu12 12.6.4.1
nvidia-cuda-cupti-cu12 12.6.80
nvidia-cuda-nvrtc-cu12 12.6.77
nvidia-cuda-runtime-cu12 12.6.77
nvidia-cudnn-cu12 9.5.1.17
nvidia-cufft-cu12 11.3.0.4
nvidia-cufile-cu12 1.11.1.6
nvidia-curand-cu12 10.3.7.77
nvidia-cusolver-cu12 11.7.1.2
nvidia-cusparse-cu12 12.5.4.2
nvidia-cusparselt-cu12 0.6.3
nvidia-nccl-cu12 2.26.2
nvidia-nvjitlink-cu12 12.6.85
nvidia-nvtx-cu12 12.6.77
omegaconf 2.3.0
packaging 25.0
paginate 0.5.7
pathspec 0.12.1
platformdirs 4.3.8
pluggy 1.6.0
pygments 2.19.1
pymdown-extensions 10.15
pytest 8.3.5
python-dateutil 2.9.0.post0
pyyaml 6.0.2
pyyaml-env-tag 1.1
requests 2.32.3
ruff 0.11.11
setuptools 80.8.0
six 1.17.0
sympy 1.14.0
tabular-ssl 0.1.0
torch 2.7.0
triton 3.3.0
typing-extensions 4.13.2
urllib3 2.4.0
watchdog 6.0.0