feature-hedging-paper
Code for the paper "Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders"
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.1%) to scientific vocabulary
Repository
Code for the paper "Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders"
Basic Info
- Host: GitHub
- Owner: chanind
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://arxiv.org/abs/2505.11756
- Size: 293 KB
Statistics
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Feature Hedging Paper
Code for the paper Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders.
Repo structure
This Repo contains the experiments run in the paper in /experiments, with toy model experiments in the /notebooks dir. Likely you don't want to directly run the experiments we did verbatim as its expensive to train so many SAEs, but if you do, the experiments with all hyperparams we used are there for reference. Each of these experiments require an output_path where trained SAEs and metrics will be saved, and a shared_path, which is just a folder that should the same for every experiment that gets run. This shared_path will be where common eval-specific data will be cached so it does not need to be recalculated for every new SAE that gets trained on a given LLM.
Potentially more useful are the matryoshka SAE implementations in the hedging_paper/saes dir and the evaluations in the hedging_paper/evals dir. For running your own toy model experiments, see the examples in the /notebooks dir.
Setup
This project uses Poetry for dependency management. To install the dependencies, run:
bash
poetry install
Tests
To run the tests, run:
bash
poetry run pytest
Linting / Formatting
This project uses Ruff for linting and formatting. To set this up with VSCode, install the ruff plugina and add the following to .vscode/settings.json:
json
{
"[python]": {
"editor.formatOnSave": true,
"editor.codeActionsOnSave": {
"source.fixAll": "explicit",
"source.organizeImports": "explicit"
},
"editor.defaultFormatter": "charliermarsh.ruff"
},
"notebook.formatOnSave.enabled": true
}
Pre-commit hook
There's a pre-commit hook that will run ruff and pyright on each commit. To install it, run:
bash
poetry run pre-commit install
Poetry tips
Below are some helpful tips for working with Poetry:
- Install a new main dependency:
poetry add <package> - Install a new development dependency:
poetry add --dev <package>- Development dependencies are not required for the main code to run, but are for things like linting/type-checking/etc...
- Update the lockfile:
poetry lock - Run a command using the virtual environment:
poetry run <command> - Run a Python file from the CLI as a script (module-style):
poetry run python -m hedging_paper.path.to.file
Citation
@article{chanin2025hedging,
title={Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders},
author={David Chanin and Tomáš Dulka and Adrià Garriga-Alonso},
year={2025},
journal={arXiv preprint arXiv:2505.11756}
}
Owner
- Name: David Chanin
- Login: chanind
- Kind: user
- Location: London, UK
- Company: UCL
- Website: https://chanind.github.io
- Repositories: 97
- Profile: https://github.com/chanind
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this research, please cite it as below."
authors:
- family-names: "Chanin"
given-names: "David"
- family-names: "Dulka"
given-names: "Tomáš"
- family-names: "Garriga-Alonso"
given-names: "Adrià"
title: "Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders"
type: article
year: 2025
identifiers:
- type: arxiv
value: "2505.11756"
description: arXiv preprint
url: "https://arxiv.org/abs/2505.11756"
keywords:
- "sparse autoencoders"
- "SAEs"
- "interpretability"
- "NLP"
preferred-citation:
type: article
authors:
- family-names: "Chanin"
given-names: "David"
- family-names: "Dulka"
given-names: "Tomáš"
- family-names: "Garriga-Alonso"
given-names: "Adrià"
title: "Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders"
year: 2025
url: "https://arxiv.org/abs/2505.11756"
identifiers:
- type: arxiv
value: "2505.11756"
archive:
- name: "arXiv"
location: "https://arxiv.org/"
collection-type: "proceedings"
GitHub Events
Total
- Watch event: 1
- Push event: 4
- Public event: 1
Last Year
- Watch event: 1
- Push event: 4
- Public event: 1
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| David Chanin | c****v@g****m | 13 |
Issues and Pull Requests
Last synced: 8 months ago
Dependencies
- actions/checkout v4 composite
- actions/setup-python v5 composite
- snok/install-poetry v1 composite
- 220 dependencies