https://github.com/brucewlee/activation-steering
General-purpose activation steering library
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
General-purpose activation steering library
Basic Info
- Host: GitHub
- Owner: brucewlee
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://arxiv.org/abs/2409.05907
- Size: 1.35 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of IBM/activation-steering
Created 11 months ago
· Last pushed 10 months ago
https://github.com/brucewlee/activation-steering/blob/main/

# Activation Steering
(Aug-2025) Added `pca_pairwise` method and set as default. Use `method="pca_pairwise"` to reproduce results closer to those reported in the paper.
(Jul-2025) Bug fixed: PCA_centering (@Reason239)
(Apr-2025) Conditional Activation Steering is a spotlight paper at ICLR 2025!
(Nov-2024) A few Colab demos are added.
(Sep-2024) Preprint released on [arXiv](https://arxiv.org/abs/2409.05907).
## Overview
This is a general-purpose activation steering library to (1) extract vectors and (2) steer model behavior. We release this library alongside our recent paper on [*Programming Refusal with Conditional Activation Steering*](https://arxiv.org/abs/2409.05907) to provide an intuitive toolchain for activation steering efforts.
## Installation
```bash
git clone https://github.com/IBM/activation-steering
pip install -e activation-steering
```
## Activation Steering
Activation steering is a technique for influencing the behavior of language models by modifying their internal activations during inference. This library provides tools for:
- Extracting steering vectors from contrastive examples
- Applying steering vectors to modify model behavior
This part is conceptually similar to [*Steering Language Models With Activation Engineering*](https://arxiv.org/abs/2308.10248), but our code implementation could be different.
## Conditional Activation Steering
Conditional activation steering selectively applies or withholds activation steering based on the input context. Conditional activation steering extends the activation steering framework by introducing:
- Context-dependent control capabilities through condition vectors
- Logical composition of multiple condition vectors
Refer to our [*paper*](https://arxiv.org/abs/2409.05907) and [*documentation*](docs/quickstart.md) for detailed implementation and usage of activation steering and conditional activation steering.
## Documentation
Refer to /docs to understand this library. We recommend starting with Quick Start Tutorial as it covers most concepts that you need to get started with activation steering and conditional activation steering.
- Quick Start Tutorial (10 minutes ~ 60 minutes, depending on your hardware) [here!](docs/quickstart.md)
- FAQ [here!](docs/faq.md)
## Colab Demos
- Adding Refusal Behavior to LLaMA 3.1 8B Inst [here!](https://colab.research.google.com/drive/1IpAPMFHZW6CNrE0L16TXSvIApAK9jAFZ?usp=sharing)
- Adding CoT Behavior to Gemma 2 9B [here!](https://colab.research.google.com/drive/1dnG000syxHwOt-Z9_bpRLnBbfugI_CBh?usp=sharing)
- Making Hermes 2 Pro Conditionally Refuse Legal Instructions [here!](https://colab.research.google.com/drive/18lOzaFOK4CB_mYe9jlQbJCdHBDlhGxcQ?usp=sharing)
## Acknowledgement
This library builds on top of the excellent work done in the following repositories:
- [vgel/repeng](https://github.com/vgel/repeng)
- [andyzoujm/representation-engineering](https://github.com/andyzoujm/representation-engineering)
- [nrimsky/CAA](https://github.com/nrimsky/CAA)
Some parts of the documentation for this library are generated by
- [ml-tooling/lazydocs](https://github.com/ml-tooling/lazydocs) > lazydocs activation_steering/ --no-watermark
## Citation
```
@misc{lee2024programmingrefusalconditionalactivation,
title={Programming Refusal with Conditional Activation Steering},
author={Bruce W. Lee and Inkit Padhi and Karthikeyan Natesan Ramamurthy and Erik Miehling and Pierre Dognin and Manish Nagireddy and Amit Dhurandhar},
year={2024},
eprint={2409.05907},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2409.05907},
}
```
Owner
- Name: Bruce W. Lee (이웅성)
- Login: brucewlee
- Kind: user
- Location: Philadelphia, PA
- Company: University of Pennsylvania
- Website: brucewlee.github.io
- Repositories: 3
- Profile: https://github.com/brucewlee
Research Scientist - NLP
GitHub Events
Total
- Push event: 3
Last Year
- Push event: 3