sae-spelling

https://github.com/lasr-spelling/sae-spelling

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.3%) to scientific vocabulary

Last synced: 7 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: lasr-spelling
License: mit
Language: Python
Default Branch: main
Size: 535 KB

Statistics

Stars: 7
Watchers: 0
Forks: 1
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed 12 months ago

Metadata Files

Readme License Citation

SAE Spelling

Code for the paper A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders.

Installation

This project uses Poetry for dependency management. To install dependencies, run:

poetry install

Project structure

This project is set up so that code which could be reused in other projects is in the main sae_spelling package, and code specific to the experiments in the paper are in sae_spelling.experiments. In the future, we may move some of these utilities to their own library. The sae_spelling package is structured as follows:

sae_spelling.feature_attribution: Code for running SAE feature attribution experiments. Attribution tries to estimate the effect a latent has on the model output. Main exports include:
- calculate_feature_attribution()
- calculate_integrated_gradient_attribution_patching()
sae_spelling.feature_ablation: Code for running SAE feature ablation experiments. This involves ablating each firing SAE latent on a prompt to see how it affects a downstream metric (e.g. if the model knows the first letter of a token). The main function in this module is:
- calculate_individual_feature_ablations()
sae_spelling.probing: Code for training logistic-regression probes in Torch. Some helpful exports from this module are:
- train_multi_probe(): train a multi-class binary probe
- train_binary_probe(): train a binary probe (same as the multi-class probe, but with only one class)
sae_spelling.prompting: Code for generating ICL prompts, mainly focussed on spelling. Some helpful exports from this module are:
- create_icl_prompt()
- spelling_formatter(): formatter which outputs the spelling of a token
- first_letter_formatter(): formatter which outputs the first letter of a token
sae_spelling.vocab: Helpers for working with token-vocabularies. Some helpful exports from this module are:
- get_alpha_tokens(): Filter tokens from tokenizer vocab which are alphabetic
sae_spelling.sae_utils: Helpers for working with SAEs. Main exports include:
- apply_saes_and_run(): Apply SAEs to a model and run on a prompt. Allows providing a list of hooks and optionally track activation gradients. This is used in attribution and ablation experiments.
sae_spelling.spelling_grader: Code for grading spelling prompts. Some helpful exports from this module are:
- SpellingGrader: Class for grading model performing on spelling prompts
sae_spelling.feature_absorption_calculator: Code for calculating feature absorption. Some helpful exports from this module are:
- FeatureAbsorptionCalculator

Experiments

We include the following experiments from the paper in the sae_spelling.experiments package:

sae_spelling.experiments.latent_evaluation: This experiment finds the top SAE latent for each first-letter spelling task, and evaluates the latent's performance relative to a LR probe.
sae_spelling.experiments.k_sparse_probing: This experiment trains k-sparse probes on the first-letter task, and evaluates the performance with increasing value of k. This is used to detect feature splitting.
sae_spelling.experiments.feature_absorption: This experiment attempts to quantify feature absorption on the first-letter task across SAEs.

These experiments each include a main "runner" function to run the experiment. These runners will only create data-frames and save them to disk, but won't generate plots. Experiments packages include helpers for generating the plots in the paper, but these plots require tex to be installed, so we don't generate plots by default.

Development

This project uses Ruff for linting and formatting, Pyright for type checking, and Pytest for testing.

To run all checks, run:

make check-ci

In VSCode, you can install the Ruff extension to get linting and formatting automatically in the editor. It's recommended to enable formatting on save.

You can install a pre-commit hook to run linting and type-checking before commit automatically with:

poetry run pre-commit install

Poetry tips

Below are some helpful tips for working with Poetry:

Install a new main dependency: poetry add <package>
Install a new development dependency: poetry add --dev <package>
- Development dependencies are not required for the main code to run, but are for things like linting/type-checking/etc...
Update the lockfile: poetry lock
Run a command using the virtual environment: poetry run <command>
Run a Python file from the CLI as a script (module-style): poetry run python -m sae_spelling.path.to.file

Citation

Please cite this work as follows:

@misc{chanin2024absorption, title={A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders}, author={David Chanin and James Wilken-Smith and Tomáš Dulka and Hardik Bhatnagar and Joseph Bloom}, year={2024}, eprint={2409.14507}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.14507}, }

Owner

Name: lasr-spelling
Login: lasr-spelling
Kind: organization

Repositories: 1
Profile: https://github.com/lasr-spelling

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: "Chanin"
    given-names: "David"
  - family-names: "Wilken-Smith"
    given-names: "James"
  - family-names: "Dulka"
    given-names: "Tomáš"
  - family-names: "Bhatnagar"
    given-names: "Hardik"
  - family-names: "Bloom"
    given-names: "Joseph"
doi: https://doi.org/10.48550/arXiv.2409.14507
title: "A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders"
date-released: 2024-09-22
url: "https://arxiv.org/abs/2409.14507"

GitHub Events

Total

Watch event: 2
Delete event: 1
Push event: 2
Pull request event: 1
Fork event: 2
Create event: 2

Last Year

Watch event: 2
Delete event: 1
Push event: 2
Pull request event: 1
Fork event: 2
Create event: 2

Dependencies

.github/workflows/ci.yaml actions

actions/checkout v4 composite
actions/setup-python v5 composite
snok/install-poetry v1 composite

poetry.lock pypi

180 dependencies

pyproject.toml pypi

ipykernel ^6.29.5 develop
nnsight ^0.2.21 develop
pre-commit ^3.7.1 develop
pyright ^1.1.370 develop
pytest ^8.2.2 develop
ruff ^0.5.1 develop
syrupy ^4.6.1 develop
matplotlib ^3.9.2
python ^3.10
sae-lens ^3.16.0
seaborn ^0.13.2
torch [{"platform"=>"!=linux", "version"=>"2.2.2"}]
transformer-lens ^2.2.2
tueplots ^0.0.15

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

sae-spelling

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

SAE Spelling

Installation

Project structure

Experiments

Development

Poetry tips

Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies