feature-hedging-paper

Code for the paper "Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders"

https://github.com/chanind/feature-hedging-paper

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.1%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

Code for the paper "Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders"

Basic Info
Statistics
  • Stars: 2
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 11 months ago · Last pushed 11 months ago
Metadata Files
Readme License Citation

README.md

Feature Hedging Paper

Code for the paper Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders.

Repo structure

This Repo contains the experiments run in the paper in /experiments, with toy model experiments in the /notebooks dir. Likely you don't want to directly run the experiments we did verbatim as its expensive to train so many SAEs, but if you do, the experiments with all hyperparams we used are there for reference. Each of these experiments require an output_path where trained SAEs and metrics will be saved, and a shared_path, which is just a folder that should the same for every experiment that gets run. This shared_path will be where common eval-specific data will be cached so it does not need to be recalculated for every new SAE that gets trained on a given LLM.

Potentially more useful are the matryoshka SAE implementations in the hedging_paper/saes dir and the evaluations in the hedging_paper/evals dir. For running your own toy model experiments, see the examples in the /notebooks dir.

Setup

This project uses Poetry for dependency management. To install the dependencies, run:

bash poetry install

Tests

To run the tests, run:

bash poetry run pytest

Linting / Formatting

This project uses Ruff for linting and formatting. To set this up with VSCode, install the ruff plugina and add the following to .vscode/settings.json:

json { "[python]": { "editor.formatOnSave": true, "editor.codeActionsOnSave": { "source.fixAll": "explicit", "source.organizeImports": "explicit" }, "editor.defaultFormatter": "charliermarsh.ruff" }, "notebook.formatOnSave.enabled": true }

Pre-commit hook

There's a pre-commit hook that will run ruff and pyright on each commit. To install it, run:

bash poetry run pre-commit install

Poetry tips

Below are some helpful tips for working with Poetry:

  • Install a new main dependency: poetry add <package>
  • Install a new development dependency: poetry add --dev <package>
    • Development dependencies are not required for the main code to run, but are for things like linting/type-checking/etc...
  • Update the lockfile: poetry lock
  • Run a command using the virtual environment: poetry run <command>
  • Run a Python file from the CLI as a script (module-style): poetry run python -m hedging_paper.path.to.file

Citation

@article{chanin2025hedging, title={Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders}, author={David Chanin and Tomáš Dulka and Adrià Garriga-Alonso}, year={2025}, journal={arXiv preprint arXiv:2505.11756} }

Owner

  • Name: David Chanin
  • Login: chanind
  • Kind: user
  • Location: London, UK
  • Company: UCL

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this research, please cite it as below."
authors:
  - family-names: "Chanin"
    given-names: "David"
  - family-names: "Dulka"
    given-names: "Tomáš"
  - family-names: "Garriga-Alonso"
    given-names: "Adrià"
title: "Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders"
type: article
year: 2025
identifiers:
  - type: arxiv
    value: "2505.11756"
    description: arXiv preprint
url: "https://arxiv.org/abs/2505.11756"
keywords:
  - "sparse autoencoders"
  - "SAEs"
  - "interpretability"
  - "NLP"
preferred-citation:
  type: article
  authors:
    - family-names: "Chanin"
      given-names: "David"
    - family-names: "Dulka"
      given-names: "Tomáš"
    - family-names: "Garriga-Alonso"
      given-names: "Adrià"
  title: "Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders"
  year: 2025
  url: "https://arxiv.org/abs/2505.11756"
  identifiers:
    - type: arxiv
      value: "2505.11756"
  archive:
    - name: "arXiv"
      location: "https://arxiv.org/"
  collection-type: "proceedings"

GitHub Events

Total
  • Watch event: 1
  • Push event: 4
  • Public event: 1
Last Year
  • Watch event: 1
  • Push event: 4
  • Public event: 1

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 13
  • Total Committers: 1
  • Avg Commits per committer: 13.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 13
  • Committers: 1
  • Avg Commits per committer: 13.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
David Chanin c****v@g****m 13

Issues and Pull Requests

Last synced: 8 months ago


Dependencies

.github/workflows/ci.yaml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • snok/install-poetry v1 composite
poetry.lock pypi
  • 220 dependencies
pyproject.toml pypi