A dial for a language model's mind
A single direction added to a language model's activations turns a behavior up and down like a knob, and here's one running live in your browser that I pulled off a single GPU. Drag the slider.
Same model, same prompt, same greedy decoding, the only thing moving is how much of one fixed direction gets dumped into the residual stream. Nothing is re-prompted, it's the exact same vector every time.
What you're looking at
A language model mid-thought is just a pile of vectors, every layer holds a big one at each token position (the "residual stream") that later layers read off and scribble back onto. The mildly unsettling result from the last couple years of interpretability work is that a bunch of things you'd assume are complicated, a mood, a register, a persona, even whether the model decides to refuse you, are sitting in there as basically a single direction, not a subnetwork, not a clever prompt, a direction, as if the concept has a mailing address.
Which means if you can find the direction you can just add it, and that's the whole move, no fine-tuning, no prompting, no examples in context, you take a vector and shove it into the activations while the model talks and it goes helplessly poetic, or turns into a bureaucrat, or gets aggressively cheerful about nothing in particular. The slider is doing that and nothing else.
How the direction gets found
Finding it is dumb enough to be a little insulting, here's the entire recipe:
- Write ~6 short sentences that have the concept (gushing joyful ones) and ~6 matched ones that don't (miserable ones).
- Run them through the model and grab the residual-stream vector at a band of middle layers.
- The direction is just mean(concept) minus mean(not-concept), i.e. subtraction, and that is the "training," there is no training.
- To steer, add
coef × directionback into those layers at every position while it generates.
This is the contrastive-activation / representation-engineering thing (Zou et al., Turner et al.'s ActAdd, the "refusal is a single direction" work), all of it on one RTX 4090 with Qwen2.5-3B in bf16, and extracting a direction takes about a second because, again, it is subtraction.
There's a coherence budget
Crank the coefficient and the effect gets stronger right up until it doesn't, because past some point you've jammed so much into the residual stream that the later layers stop recognizing their own inputs and the model face-plants into repeating itself or coughing up random Chinese. So every concept has a window, and the only number that actually matters is how much behavior you can buy before the whole thing drives off the cliff. I measured both:
- Concept expression: take the steered text, run it back through the model, project its activations onto the steering direction, and higher means the output really did move toward the concept and I'm not just imagining it.
- Coherence: a cheap proxy, distinct-bigram ratio times the fraction of characters that aren't degenerate CJK spam, so 1.0 is clean prose and 0 is the model eating itself.
Where it works, and where it doesn't
Broad stuff steers clean, poetic and formal and cheerful all get a wide window and a smooth ramp. Narrow or tangled stuff does not. My sycophancy direction is a mess that keeps sliding into generic niceness, because "flattery" and "being in a good mood" are hopelessly braided together in a crude average. A "Golden Gate Bridge" direction, the famous Anthropic one, basically refuses to fire, because a specific thing like that is exactly where you're supposed to reach for a sparse autoencoder feature instead of a blunt difference of means. So that's the honest boundary, this gets you broad well-separated axes for free and whiffs on the narrow ones.
Why this is more than a party trick
Two reasons this isn't just a toy. One, it's evidence about how these models hold things, behaviors you'd assume are complex and situational turn out to be roughly linear and additive, which is the entire bet mechanistic interpretability is running on. Two, it's control that never touches the weights and doesn't care what you prompt, which is charming right up until you notice the same one-vector trick that dials up "poetic" is, with a different contrastive set, how you'd quietly delete a model's ability to say no. Working these directions out is how you'd catch that or block it, so it's a safety problem wearing a party trick's outfit, and pretending otherwise is how you get surprised later.
References
- A. Zou, L. Phan, S. Chen, et al. Representation Engineering: A Top-Down Approach to AI Transparency. 2023. arXiv:2310.01405
- A. M. Turner, L. Thiergart, G. Leech, et al. Activation Addition: Steering Language Models Without Optimization. 2023. arXiv:2308.10248
- N. Rimsky, N. Gabrieli, J. Schulz, et al. Steering Llama 2 via Contrastive Activation Addition. ACL, 2024. arXiv:2312.06681
- A. Arditi, O. Obeso, A. Syed, et al. Refusal in Language Models Is Mediated by a Single Direction. NeurIPS, 2024. arXiv:2406.11717
- A. Templeton, T. Conerly, J. Marcus, et al. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Anthropic, 2024. transformer-circuits.pub
- Qwen Team, Alibaba. Qwen2.5 Technical Report. 2024. arXiv:2412.15115