9.3 Interventional Methods¶
9.3.1 Activation Patching¶
Activation patching is a technique used to understand and pinpoint specific parts of a model responsible for certain behaviors or outputs. For example, when you ask an LLM to complete the sentence: “When John and Mary went to the store, John gave a bottle to”, how does the model know it should answer “Mary” instead of “John” some other name? This completion requires the model to track which person is the recipient of the action, meaning it has to understand and store contextual roles (i.e., who is giving and who is receiving).
Researchers have used activation patching to discover which parts of the model are responsible for this kind of role assignment. Specifically, the goal is to identify the circuit (i.e., the group of neurons and connections) that helps the model decide that “Mary” is the correct answer (Wang et al., 2023).
This is extra detail provided for those interested. It can be safely skipped.
To illustrate the activation patching process, consider the figure below (Heimersheim & Nanda, 2024). Let’s say we want to understand how a model completes the sentence "When John and Mary went to the store, John gave a bottle to" and which components are involved in choosing "Mary" as the answer. Activation patching generally requires three stages:
-
Clean Input: On the left, the model processes the sentence “John and Mary went to the store, John gave a bottle to” generating a series of activations (outputs from each layer) shown in green. This clean input provides the model with the correct context to interpret “John” as the giver and “Mary” as the recipient. The model predicts “Mary”.
-
Corrupted Input: On the right, the model is given a corrupted input: “Bob and Mary went to the store, John gave a bottle to” In this case, the activations (in red) will differ because the model is now processing “Bob” instead of “John”. Here, there is an ambiguity about whether “Mary” or “John” is the intended recipient, so the answer will not necessarily be “Mary”.
-
Patching Process: Activation patching involves taking specific activations from the clean input “John and Mary…” and substituting them into the corrupted input “Bob and Mary…”. This is indicated by the arrow. By patching activations from the clean input, we can test whether the model's understanding of "John" as the giver and "Mary" as the recipient can be restored, even when the corrupted input with "Bob" creates ambiguity. At this step, we observe how much the model prediction shifts towards “Mary”. For example, if patching the activations of a specific attention head increases the logit for “Mary”, it suggests that head is an important part of the circuit responsible for the task.
-
Iterating this procedure over different layers and components (such as attention heads or MLPs) allows researchers to identify the components most responsible for the target behavior.
Figure: An illustration of the activation patching process. On the left, the model processes a clean input sequence "John and Mary," with activations shown in green. On the right, the model processes a corrupted input sequence "Bob and Mary," with activations shown in red. Activation patching involves replacing one or more activations in the corrupted sequence with corresponding activations from the clean sequence (indicated by the arrow). The resulting patched activations are then passed forward to observe how they affect the model’s output logits. By analyzing whether the patched activations restore the desired output, researchers can identify which parts of the model’s activations are responsible for specific behaviors or outputs. From (Zhang & Nanda, 2023).
Figure: The red neurons and connections represent the most important components of GPT-2 small for completing the sentence “When John and Mary went to the store, John gave a bottle to _” (known as the Indirect Object Identification (IOI) task). From (Conmy et al., 2023).
Activation patching was used to find circuits for various tasks (Wang et al., 2023) in LLMs, including:
Example Prompt | Completion | Task |
---|---|---|
“When John and Mary went to the store, John gave a bottle to” | “ Mary” | Indirect Object Identification (IOI) (Wang et al., 2023) |
“The war lasted from 1517 to 15” | any two digit number greater than 17 | Greater-Than (Hanna et al., 2023) |
“ files” | Docstring (Heimersheim & Janiak, 2023) | |
The model predicts the next token, which should be a copy of the next argument in the definition. |
There exists numerous activation patching settings, as detailed in (Zhang & Nanda, 2023). Activation patching can be applied at different levels of granularity, ranging from patching the entire residual stream at a particular layer to patching specific token positions within the model.
This is extra detail provided for those interested. It can be safely skipped.
Activation patching doesn’t explain the model in a fully causal way: it helps identify which components are involved in certain behaviors, but it doesn’t always clarify how those components interact or contribute causally to the overall behavior. Knowing that a particular neuron or attention head is involved doesn’t necessarily mean we understand its specific role in the model’s decision-making process. The method is context-dependent, meaning components critical for one task may not generalize to others. Also, there is a trade-off in granularity: finer patches capture more detail but may miss broader interactions, while coarser patches risk overlooking important elements. Moreover, as models grow in size, both the complexity and computational cost of using activation patching increase, making it harder to isolate meaningful circuits.
9.3.2 Activation Steering¶
Activation steering is a technique used to control a model’s behavior by modifying its activations during inference. Unlike traditional methods like fine-tuning, or RLHF, activation steering allows for direct intervention without the need to retrain the model.
Here is an example where activation steering was used to enforce GPT-2 into talking about love-related topics, regardless of the previous context (Turner et al., 2023):
- GPT-2 default completion:
I hate you because… -> you are the most disgusting thing I have ever seen.
- GPT-2 steered in the “love” direction:
I hate you because… -> you are so beautiful and I want to be with you forever.
Activation steering can be used to address remaining issues after safety training and fine-tuning by:
-
Steering models towards desirable outcomes: Improving truthfulness (Li et al., 2023), honesty (Zou et al., 2023), avoiding generating toxic or harmful content, customizing chatbot personalities (e.g., making them more formal or friendly), or steering in the style of specific authors without retraining the model.
-
Post-deployment control: Monitoring AI systems for dangerous behaviors and enhancing robustness to jailbreaks by steering models to refuse harmful requests (Zou et al., 2023). It may also be possible to strengthen other safety techniques, like Constitutional AI, by examining how they encourage the model toward safer and more honest behavior, as well as by identifying any gaps in this process (Anthropic, 2024).
Below is a second example of steering on Claude 3 Sonnet using sparse autoencoder features:
Note that the Representation Engineering agenda (Zou et al., 2023) is a variant of activation steering that proposes to steer LLMs toward desirable outcomes such as more honesty, less bias, etc.
This is extra detail provided for those interested. It can be safely skipped.
Activation steering involves two main steps. Let’s say we want to make our model more honest, this involves two main steps:
- Identify the honesty direction in the model's activation space. This is typically done by collecting model activations for prompts designed to elicit contrasting behaviors (e.g., "honest" vs. "dishonest" responses) and analyzing these to find a linear direction that separates the two. The model's activations can be thought of as vectors in a high-dimensional space. Certain directions in this space correspond to specific behaviors. For example, one direction might correlate with generating more positive text, while another could steer the model to talk about science. This direction, sometimes called a concept vector, represents the targeted attribute in activation space.
Figure: Identifying an activation steering direction. From (Wehner, 2024).
- The second step is steering the model’s behavior by adding this concept vector to the model's activations at inference time. By introducing this vector at a relevant layer, we can amplify or suppress certain behaviors without retraining the model. For instance, adding the "honesty" vector to a model’s activations during inference nudges it toward generating more honest responses.
Figure: The concept vector is added to the activations at a specific layer to influence the model’s behavior in the desired direction. From (Wehner, 2024).
The concept vector can also be obtained using a linear probe (see section on Probing Classifiers), or can be a feature found through dictionary learning (see section on Sparse Autoencoders) as shown on the Claude 3 Sonnet video above.