https://github.com/chris-santiago/tabular-ssl

Using Cursor to create a project via AI Chat

https://github.com/chris-santiago/tabular-ssl

Science Score: 46.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
    1 of 1 committers (100.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.6%) to scientific vocabulary
Last synced: 7 months ago · JSON representation

Repository

Using Cursor to create a project via AI Chat

Basic Info
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 2
  • Releases: 0
Created 11 months ago · Last pushed 11 months ago
Metadata Files
Readme

README.md

This was generated using Cursor IDE w/agent chat.

Notes: - Default models (or auto) is pretty terrible. Code quality is inconsistent; it can feel like you're working with a struggling student. At first things look good, but the code is riddled with errors. - Test generation w/default models is even worse. Most tests are not functional-- they're importing classes/funcs that don't exist in the actual source code. It quickly devolves into and endless loop for error fixes. - Using the latest, "thinking" or reasoning models is a much better experience. Claude-4-sonnet cleaned up much of the mess from the "auto" models. - Models use excessice emoji.

🎯 Tabular SSL: Self-Supervised Learning for Tabular Data

A unified, modular framework for self-supervised learning on tabular data with state-of-the-art corruption strategies, consistent interfaces, and fast experimentation.

Python 3.8+ PyTorch Lightning Hydra

Key Features

🔧 Consistent Design: All components follow unified interfaces
🧩 Modular Architecture: Easy to compose and extend
Fast Experimentation: Swap components with simple config changes
🎭 State-of-the-Art SSL: VIME, SCARF, ReConTab implementations
🏦 Ready-to-Use Data: IBM TabFormer credit card dataset included
📱 Interactive Demos: Understand corruption strategies visually


🚀 Quick Start

Installation

bash git clone https://github.com/yourusername/tabular-ssl.git cd tabular-ssl pip install -r requirements.txt pip install -e . export PYTHONPATH=$PWD/src

Interactive Demos

```bash

🎭 Explore corruption strategies

python democorruptionstrategies.py

🏦 Real credit card transaction data

python democreditcard_data.py

🔧 Final unified design demo

python demofinaldesign.py ```

Train Models

```bash

🎯 VIME: Value imputation + mask estimation

python train.py model=ssl_vime

🌟 SCARF: Contrastive learning with feature corruption

python train.py model=ssl_scarf

🔧 ReConTab: Multi-task reconstruction learning

python train.py model=ssl_recontab

🤖 Simple MLP classifier

python train.py model=base_mlp ```


🧪 Easy Experimentation

Component Swapping

```bash

Switch to RNN backbone

python train.py model=sslvime sequenceencoder=rnn

Use autoencoder event encoder

python train.py model=sslscarf eventencoder=autoencoder

Remove sequence modeling

python train.py model=sslvime sequenceencoder=null

Custom corruption rate

python train.py model=sslvime corruption.corruptionrate=0.5 ```

Systematic Comparison

```bash

Compare all SSL strategies

python train.py -m model=sslvime,sslscarf,ssl_recontab

Quick iteration setup

python train.py experiment=quickvimessl

Full comparison experiment

python train.py experiment=compare_corruptions ```

Architecture Variants

```bash

Transformer + VIME SSL

python train.py model=sslvime sequenceencoder=transformer

S4 + ReConTab SSL

python train.py model=sslrecontab sequenceencoder=s4

RNN + SCARF SSL

python train.py model=sslscarf sequenceencoder=rnn sequenceencoder.rnntype=gru ```


🏗️ Architecture Overview

Unified Component Hierarchy

BaseComponent (Abstract) ├── EventEncoder │ ├── MLPEventEncoder │ ├── AutoEncoderEventEncoder │ └── ContrastiveEventEncoder ├── SequenceEncoder │ ├── TransformerSequenceEncoder │ ├── RNNSequenceEncoder │ └── S4SequenceEncoder ├── ProjectionHead │ └── MLPProjectionHead ├── PredictionHead │ └── ClassificationHead ├── EmbeddingLayer │ └── CategoricalEmbedding └── BaseCorruption ├── VIMECorruption (NeurIPS 2020) ├── SCARFCorruption (arXiv 2021) └── ReConTabCorruption

Model Composition

```python

Flexible model composition

SSLModel( eventencoder=MLPEventEncoder(...), sequenceencoder=TransformerSequenceEncoder(...), corruption=VIMECorruption(...) # ← Type auto-detected ) ```

Consistent Interfaces

All components follow the same patterns: - Corruption strategies return Dict[str, torch.Tensor] with 'corrupted', 'targets', 'mask', 'metadata' - All components expose input_dim and output_dim properties - Auto-detection eliminates configuration errors


📁 Configuration Structure

Standardized Layout

configs/ ├── corruption/ # 🎭 VIME, SCARF, ReConTab, etc. ├── event_encoder/ # 📦 MLP, Autoencoder, Contrastive ├── sequence_encoder/ # 🔗 Transformer, RNN, S4, null ├── projection_head/ # 📐 MLP, null ├── prediction_head/ # 🎯 Classification, null ├── embedding/ # 🔤 Categorical, null ├── model/ # 🤖 Complete model configs └── experiment/ # 🧪 Experiment templates

Consistent Pattern

All component configs follow: ```yaml

Component Description

target: tabular_ssl.models.components.ComponentClass param1: value1 param2: value2 ```


🎭 State-of-the-Art Corruption Strategies

VIME - Value Imputation and Mask Estimation

From "VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain" (NeurIPS 2020)

```yaml

configs/corruption/vime.yaml

target: tabularssl.models.components.VIMECorruption corruptionrate: 0.3 categoricalindices: [] numericalindices: [0, 1, 2, 3] ```

Features: - 🎯 Dual pretext tasks: mask estimation + value imputation - 🔢 Handles categorical and numerical features differently - 📊 Returns both corrupted data and mask for training

SCARF - Contrastive Learning with Feature Corruption

From "SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption" (arXiv 2021)

```yaml

configs/corruption/scarf.yaml

target: tabularssl.models.components.SCARFCorruption corruptionrate: 0.6 corruptionstrategy: randomswap ```

Features: - 🌟 High corruption rate (60%) for effective contrastive learning - 🔄 Feature swapping between samples in batch - 🌡️ Temperature-scaled InfoNCE loss

ReConTab - Multi-task Reconstruction

Reconstruction-based contrastive learning for tabular data

```yaml

configs/corruption/recontab.yaml

target: tabularssl.models.components.ReConTabCorruption corruptionrate: 0.15 corruptiontypes: [masking, noise, swapping] maskingstrategy: random ```

Features: - 🔧 Multiple corruption types: masking, noise injection, swapping - 📊 Detailed corruption tracking for reconstruction targets - 🎯 Multi-task learning with specialized heads


📊 Sample Data

IBM TabFormer Credit Card Dataset

```python from tabularssl.data.sampledata import loadcreditcard_sample

Download and load real transaction data

data, info = loadcreditcardsample() print(f"Loaded {len(data)} transactions") print(f"Features: {info['featurenames']}") ```

Synthetic Transaction Generator

```python from tabularssl.data.sampledata import generatesequentialtransactions

Generate synthetic data for experimentation

data = generatesequentialtransactions( numusers=1000, transactionsperuser=50, numfeatures=10 ) ```


🔧 Advanced Usage

Custom Component Creation

```python from tabular_ssl.models.components import BaseCorruption

class CustomCorruption(BaseCorruption): def forward(self, x: torch.Tensor) -> Dict[str, torch.Tensor]: # Your corruption logic return { 'corrupted': corruptedx, 'targets': x, 'mask': corruptionmask } ```

Configuration Override Examples

```bash

Hyperparameter sweep

python train.py -m model=sslvime \ corruption.corruptionrate=0.1,0.3,0.5 \ model.learning_rate=1e-4,5e-4,1e-3

Architecture ablation

python train.py model=sslvime \ eventencoder.hiddendims=[64,128] \ sequenceencoder.num_layers=2

Custom experiment

python train.py model=sslvime \ eventencoder=autoencoder \ sequenceencoder=s4 \ corruption.corruptionrate=0.4 ```

Modular Composition

```python

Mix and match components

model = SSLModel( eventencoder=AutoEncoderEventEncoder(...), sequenceencoder=S4SequenceEncoder(...), projection_head=MLPProjectionHead(...), corruption=ReConTabCorruption(...) ) ```


📚 Project Structure

tabular-ssl/ ├── 📁 configs/ # Hydra configurations │ ├── 🎭 corruption/ # Corruption strategies │ ├── 📦 event_encoder/ # Event encoders │ ├── 🔗 sequence_encoder/ # Sequence encoders │ ├── 📐 projection_head/ # Projection heads │ ├── 🎯 prediction_head/ # Prediction heads │ ├── 🔤 embedding/ # Embedding layers │ ├── 🤖 model/ # Complete models │ └── 🧪 experiment/ # Experiment configs ├── 📁 src/tabular_ssl/ │ ├── 📊 data/ # Data loading & sample data │ ├── 🧠 models/ # Model implementations │ │ ├── base.py # Base classes & models │ │ └── components.py # All components │ └── 🛠️ utils/ # Utilities ├── 🎬 demo_*.py # Interactive demos ├── 📖 docs/ # Documentation └── ✅ tests/ # Unit tests


🎯 Design Principles

Consistent Interfaces: All components follow same patterns
Simplified Architecture: Clean abstractions without complexity
Maintained Functionality: All original features preserved
Modular & Extensible: Easy to add new components
Intuitive Configuration: Logical, well-organized configs
Fast Experimentation: Easy component swapping


📈 Benchmarks & Results

The framework includes implementations of methods from leading papers:

| Method | Paper | Key Innovation | |--------|-------|----------------| | VIME | NeurIPS 2020 | Dual pretext tasks for tabular SSL | | SCARF | arXiv 2021 | Contrastive learning with feature corruption | | ReConTab | Custom | Multi-task reconstruction learning |

Quick Comparison

```bash

Run systematic comparison

python train.py experiment=compare_corruptions

Results logged to W&B automatically

```


🚀 Getting Started Guide

1. Explore Demos

bash python demo_corruption_strategies.py # Understand corruption methods python demo_credit_card_data.py # See real data in action python demo_final_design.py # Complete design overview

2. Train Your First Model

bash python train.py model=ssl_vime # Start with VIME

3. Experiment with Components

bash python train.py model=ssl_vime sequence_encoder=rnn python train.py model=ssl_scarf event_encoder=autoencoder

4. Create Custom Configurations

```bash

Copy and modify existing configs

cp configs/corruption/vime.yaml configs/corruption/my_corruption.yaml

Edit my_corruption.yaml

python train.py model=sslvime corruption=mycorruption ```


🤝 Contributing

We welcome contributions! The modular design makes it easy to:

  • Add new corruption strategies following BaseCorruption interface
  • Implement new encoders extending base classes
  • Create experiment configurations in configs/experiment/
  • Add new sample datasets in src/tabular_ssl/data/

Development Setup

bash git clone https://github.com/yourusername/tabular-ssl.git cd tabular-ssl pip install -r requirements.txt pip install -e . python -m pytest tests/


📖 Documentation


📝 Citation

bibtex @software{tabular_ssl, title={Tabular SSL: A Unified Framework for Self-Supervised Learning on Tabular Data}, author={Your Name}, year={2024}, url={https://github.com/yourusername/tabular-ssl} }


📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


🙏 Acknowledgments


🎉 Ready for fast, iterative tabular SSL experimentation!

Owner

  • Name: Chris Santiago
  • Login: chris-santiago
  • Kind: user

GitHub Events

Total
  • Issues event: 1
  • Push event: 34
  • Create event: 3
Last Year
  • Issues event: 1
  • Push event: 34
  • Create event: 3

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 51
  • Total Committers: 1
  • Avg Commits per committer: 51.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 51
  • Committers: 1
  • Avg Commits per committer: 51.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
chris-santiago c****o@g****u 51
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 2
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • chris-santiago (2)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

docs/requirements.txt pypi
  • mkdocs-material >=9.0.0
  • mkdocstrings >=0.24.0
  • pymdown-extensions >=10.0.0
pyproject.toml pypi
  • hydra-core >=1.3.2
  • torch >=2.7.0
requirements.txt pypi
  • antlr4-python3-runtime ==4.9.3
  • babel ==2.17.0
  • backrefs ==5.8
  • certifi ==2025.4.26
  • charset-normalizer ==3.4.2
  • click ==8.2.1
  • colorama ==0.4.6
  • filelock ==3.18.0
  • fsspec ==2025.5.0
  • ghp-import ==2.1.0
  • griffe ==1.7.3
  • hydra-core ==1.3.2
  • idna ==3.10
  • iniconfig ==2.1.0
  • jinja2 ==3.1.6
  • markdown ==3.8
  • markupsafe ==3.0.2
  • mergedeep ==1.3.4
  • mkdocs ==1.6.1
  • mkdocs-autorefs ==1.4.2
  • mkdocs-get-deps ==0.2.0
  • mkdocs-material ==9.6.14
  • mkdocs-material-extensions ==1.3.1
  • mkdocstrings ==0.29.1
  • mkdocstrings-python ==1.16.10
  • mpmath ==1.3.0
  • networkx ==3.4.2
  • nvidia-cublas-cu12 ==12.6.4.1
  • nvidia-cuda-cupti-cu12 ==12.6.80
  • nvidia-cuda-nvrtc-cu12 ==12.6.77
  • nvidia-cuda-runtime-cu12 ==12.6.77
  • nvidia-cudnn-cu12 ==9.5.1.17
  • nvidia-cufft-cu12 ==11.3.0.4
  • nvidia-cufile-cu12 ==1.11.1.6
  • nvidia-curand-cu12 ==10.3.7.77
  • nvidia-cusolver-cu12 ==11.7.1.2
  • nvidia-cusparse-cu12 ==12.5.4.2
  • nvidia-cusparselt-cu12 ==0.6.3
  • nvidia-nccl-cu12 ==2.26.2
  • nvidia-nvjitlink-cu12 ==12.6.85
  • nvidia-nvtx-cu12 ==12.6.77
  • omegaconf ==2.3.0
  • packaging ==25.0
  • paginate ==0.5.7
  • pathspec ==0.12.1
  • platformdirs ==4.3.8
  • pluggy ==1.6.0
  • pygments ==2.19.1
  • pymdown-extensions ==10.15
  • pytest ==8.3.5
  • python-dateutil ==2.9.0.post0
  • pyyaml ==6.0.2
  • pyyaml-env-tag ==1.1
  • requests ==2.32.3
  • ruff ==0.11.11
  • setuptools ==80.8.0
  • six ==1.17.0
  • sympy ==1.14.0
  • torch ==2.7.0
  • triton ==3.3.0
  • typing-extensions ==4.13.2
  • urllib3 ==2.4.0
  • watchdog ==6.0.0
uv.lock pypi
  • antlr4-python3-runtime 4.9.3
  • babel 2.17.0
  • backrefs 5.8
  • certifi 2025.4.26
  • charset-normalizer 3.4.2
  • click 8.2.1
  • colorama 0.4.6
  • filelock 3.18.0
  • fsspec 2025.5.0
  • ghp-import 2.1.0
  • griffe 1.7.3
  • hydra-core 1.3.2
  • idna 3.10
  • iniconfig 2.1.0
  • jinja2 3.1.6
  • markdown 3.8
  • markupsafe 3.0.2
  • mergedeep 1.3.4
  • mkdocs 1.6.1
  • mkdocs-autorefs 1.4.2
  • mkdocs-get-deps 0.2.0
  • mkdocs-material 9.6.14
  • mkdocs-material-extensions 1.3.1
  • mkdocstrings 0.29.1
  • mkdocstrings-python 1.16.10
  • mpmath 1.3.0
  • networkx 3.4.2
  • nvidia-cublas-cu12 12.6.4.1
  • nvidia-cuda-cupti-cu12 12.6.80
  • nvidia-cuda-nvrtc-cu12 12.6.77
  • nvidia-cuda-runtime-cu12 12.6.77
  • nvidia-cudnn-cu12 9.5.1.17
  • nvidia-cufft-cu12 11.3.0.4
  • nvidia-cufile-cu12 1.11.1.6
  • nvidia-curand-cu12 10.3.7.77
  • nvidia-cusolver-cu12 11.7.1.2
  • nvidia-cusparse-cu12 12.5.4.2
  • nvidia-cusparselt-cu12 0.6.3
  • nvidia-nccl-cu12 2.26.2
  • nvidia-nvjitlink-cu12 12.6.85
  • nvidia-nvtx-cu12 12.6.77
  • omegaconf 2.3.0
  • packaging 25.0
  • paginate 0.5.7
  • pathspec 0.12.1
  • platformdirs 4.3.8
  • pluggy 1.6.0
  • pygments 2.19.1
  • pymdown-extensions 10.15
  • pytest 8.3.5
  • python-dateutil 2.9.0.post0
  • pyyaml 6.0.2
  • pyyaml-env-tag 1.1
  • requests 2.32.3
  • ruff 0.11.11
  • setuptools 80.8.0
  • six 1.17.0
  • sympy 1.14.0
  • tabular-ssl 0.1.0
  • torch 2.7.0
  • triton 3.3.0
  • typing-extensions 4.13.2
  • urllib3 2.4.0
  • watchdog 6.0.0