feature-hedging-paper

Code for the paper "Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders"

https://github.com/chanind/feature-hedging-paper

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Code for the paper "Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders"

Basic Info

Host: GitHub
Owner: chanind
License: mit
Language: Python
Default Branch: main
Homepage: https://arxiv.org/abs/2505.11756
Size: 293 KB

Statistics

Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

Feature Hedging Paper

Code for the paper Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders.

Repo structure

This Repo contains the experiments run in the paper in /experiments, with toy model experiments in the /notebooks dir. Likely you don't want to directly run the experiments we did verbatim as its expensive to train so many SAEs, but if you do, the experiments with all hyperparams we used are there for reference. Each of these experiments require an output_path where trained SAEs and metrics will be saved, and a shared_path, which is just a folder that should the same for every experiment that gets run. This shared_path will be where common eval-specific data will be cached so it does not need to be recalculated for every new SAE that gets trained on a given LLM.

Potentially more useful are the matryoshka SAE implementations in the hedging_paper/saes dir and the evaluations in the hedging_paper/evals dir. For running your own toy model experiments, see the examples in the /notebooks dir.

Setup

This project uses Poetry for dependency management. To install the dependencies, run:

bash poetry install

Tests

To run the tests, run:

bash poetry run pytest

Linting / Formatting

This project uses Ruff for linting and formatting. To set this up with VSCode, install the ruff plugina and add the following to .vscode/settings.json:

json { "[python]": { "editor.formatOnSave": true, "editor.codeActionsOnSave": { "source.fixAll": "explicit", "source.organizeImports": "explicit" }, "editor.defaultFormatter": "charliermarsh.ruff" }, "notebook.formatOnSave.enabled": true }

Pre-commit hook

There's a pre-commit hook that will run ruff and pyright on each commit. To install it, run:

bash poetry run pre-commit install

Poetry tips

Below are some helpful tips for working with Poetry:

Install a new main dependency: poetry add <package>
Install a new development dependency: poetry add --dev <package>
- Development dependencies are not required for the main code to run, but are for things like linting/type-checking/etc...
Update the lockfile: poetry lock
Run a command using the virtual environment: poetry run <command>
Run a Python file from the CLI as a script (module-style): poetry run python -m hedging_paper.path.to.file

Citation

@article{chanin2025hedging, title={Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders}, author={David Chanin and Tomáš Dulka and Adrià Garriga-Alonso}, year={2025}, journal={arXiv preprint arXiv:2505.11756} }

Owner

Name: David Chanin
Login: chanind
Kind: user
Location: London, UK
Company: UCL

Website: https://chanind.github.io
Repositories: 97
Profile: https://github.com/chanind

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this research, please cite it as below."
authors:
  - family-names: "Chanin"
    given-names: "David"
  - family-names: "Dulka"
    given-names: "Tomáš"
  - family-names: "Garriga-Alonso"
    given-names: "Adrià"
title: "Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders"
type: article
year: 2025
identifiers:
  - type: arxiv
    value: "2505.11756"
    description: arXiv preprint
url: "https://arxiv.org/abs/2505.11756"
keywords:
  - "sparse autoencoders"
  - "SAEs"
  - "interpretability"
  - "NLP"
preferred-citation:
  type: article
  authors:
    - family-names: "Chanin"
      given-names: "David"
    - family-names: "Dulka"
      given-names: "Tomáš"
    - family-names: "Garriga-Alonso"
      given-names: "Adrià"
  title: "Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders"
  year: 2025
  url: "https://arxiv.org/abs/2505.11756"
  identifiers:
    - type: arxiv
      value: "2505.11756"
  archive:
    - name: "arXiv"
      location: "https://arxiv.org/"
  collection-type: "proceedings"

GitHub Events

Total

Watch event: 1
Push event: 4
Public event: 1

Last Year

Watch event: 1
Push event: 4
Public event: 1

Committers

Last synced: 11 months ago

All Time

Total Commits: 13
Total Committers: 1
Avg Commits per committer: 13.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 13
Committers: 1
Avg Commits per committer: 13.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
David Chanin	c**v@g**m	13

Issues and Pull Requests

Last synced: 11 months ago

Dependencies

.github/workflows/ci.yaml actions

actions/checkout v4 composite
actions/setup-python v5 composite
snok/install-poetry v1 composite

poetry.lock pypi

220 dependencies

pyproject.toml pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science