sae-spelling
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: lasr-spelling
- License: mit
- Language: Python
- Default Branch: main
- Size: 535 KB
Statistics
- Stars: 7
- Watchers: 0
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
SAE Spelling
Code for the paper A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders.
Installation
This project uses Poetry for dependency management. To install dependencies, run:
poetry install
Project structure
This project is set up so that code which could be reused in other projects is in the main sae_spelling package, and code specific to the experiments in the paper are in sae_spelling.experiments. In the future, we may move some of these utilities to their own library. The sae_spelling package is structured as follows:
sae_spelling.feature_attribution: Code for running SAE feature attribution experiments. Attribution tries to estimate the effect a latent has on the model output. Main exports include:calculate_feature_attribution()calculate_integrated_gradient_attribution_patching()
sae_spelling.feature_ablation: Code for running SAE feature ablation experiments. This involves ablating each firing SAE latent on a prompt to see how it affects a downstream metric (e.g. if the model knows the first letter of a token). The main function in this module is:calculate_individual_feature_ablations()
sae_spelling.probing: Code for training logistic-regression probes in Torch. Some helpful exports from this module are:train_multi_probe(): train a multi-class binary probetrain_binary_probe(): train a binary probe (same as the multi-class probe, but with only one class)
sae_spelling.prompting: Code for generating ICL prompts, mainly focussed on spelling. Some helpful exports from this module are:create_icl_prompt()spelling_formatter(): formatter which outputs the spelling of a tokenfirst_letter_formatter(): formatter which outputs the first letter of a token
sae_spelling.vocab: Helpers for working with token-vocabularies. Some helpful exports from this module are:get_alpha_tokens(): Filter tokens from tokenizer vocab which are alphabetic
sae_spelling.sae_utils: Helpers for working with SAEs. Main exports include:apply_saes_and_run(): Apply SAEs to a model and run on a prompt. Allows providing a list of hooks and optionally track activation gradients. This is used in attribution and ablation experiments.
sae_spelling.spelling_grader: Code for grading spelling prompts. Some helpful exports from this module are:SpellingGrader: Class for grading model performing on spelling prompts
sae_spelling.feature_absorption_calculator: Code for calculating feature absorption. Some helpful exports from this module are:FeatureAbsorptionCalculator
Experiments
We include the following experiments from the paper in the sae_spelling.experiments package:
sae_spelling.experiments.latent_evaluation: This experiment finds the top SAE latent for each first-letter spelling task, and evaluates the latent's performance relative to a LR probe.sae_spelling.experiments.k_sparse_probing: This experiment trains k-sparse probes on the first-letter task, and evaluates the performance with increasing value ofk. This is used to detect feature splitting.sae_spelling.experiments.feature_absorption: This experiment attempts to quantify feature absorption on the first-letter task across SAEs.
These experiments each include a main "runner" function to run the experiment. These runners will only create data-frames and save them to disk, but won't generate plots. Experiments packages include helpers for generating the plots in the paper, but these plots require tex to be installed, so we don't generate plots by default.
Development
This project uses Ruff for linting and formatting, Pyright for type checking, and Pytest for testing.
To run all checks, run:
make check-ci
In VSCode, you can install the Ruff extension to get linting and formatting automatically in the editor. It's recommended to enable formatting on save.
You can install a pre-commit hook to run linting and type-checking before commit automatically with:
poetry run pre-commit install
Poetry tips
Below are some helpful tips for working with Poetry:
- Install a new main dependency:
poetry add <package> - Install a new development dependency:
poetry add --dev <package>- Development dependencies are not required for the main code to run, but are for things like linting/type-checking/etc...
- Update the lockfile:
poetry lock - Run a command using the virtual environment:
poetry run <command> - Run a Python file from the CLI as a script (module-style):
poetry run python -m sae_spelling.path.to.file
Citation
Please cite this work as follows:
@misc{chanin2024absorption,
title={A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders},
author={David Chanin and James Wilken-Smith and Tomáš Dulka and Hardik Bhatnagar and Joseph Bloom},
year={2024},
eprint={2409.14507},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.14507},
}
Owner
- Name: lasr-spelling
- Login: lasr-spelling
- Kind: organization
- Repositories: 1
- Profile: https://github.com/lasr-spelling
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Chanin"
given-names: "David"
- family-names: "Wilken-Smith"
given-names: "James"
- family-names: "Dulka"
given-names: "Tomáš"
- family-names: "Bhatnagar"
given-names: "Hardik"
- family-names: "Bloom"
given-names: "Joseph"
doi: https://doi.org/10.48550/arXiv.2409.14507
title: "A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders"
date-released: 2024-09-22
url: "https://arxiv.org/abs/2409.14507"
GitHub Events
Total
- Watch event: 2
- Delete event: 1
- Push event: 2
- Pull request event: 1
- Fork event: 2
- Create event: 2
Last Year
- Watch event: 2
- Delete event: 1
- Push event: 2
- Pull request event: 1
- Fork event: 2
- Create event: 2
Dependencies
- actions/checkout v4 composite
- actions/setup-python v5 composite
- snok/install-poetry v1 composite
- 180 dependencies
- ipykernel ^6.29.5 develop
- nnsight ^0.2.21 develop
- pre-commit ^3.7.1 develop
- pyright ^1.1.370 develop
- pytest ^8.2.2 develop
- ruff ^0.5.1 develop
- syrupy ^4.6.1 develop
- matplotlib ^3.9.2
- python ^3.10
- sae-lens ^3.16.0
- seaborn ^0.13.2
- torch [{"platform"=>"!=linux", "version"=>"2.2.2"}]
- transformer-lens ^2.2.2
- tueplots ^0.0.15