9.4 Automating and Scaling Interpretability¶
State-of-the-art models now contain hundreds of billions of parameters and thousands of interconnected layers, making manual inspection of model components infeasible. Mechanistic interpretability aims to analyze how individual elements—like attention heads, neurons, features, or entire layers—interact to produce specific behaviors. However, as models scale, manual approaches like activation pathing for circuit discovery, subgraph study, and subsequent explanation generation (Wang et al., 2023), become infeasible to use. This is why developing mechanistic interpretability methods that can scale is essential.
Research in scalable interpretability, such as Automated Circuit DisCovery (ACDC) attempts to introduce algorithms that automate the process of finding circuits within a transformer.
9.4.1 Automatic Circuit DisCovery (ACDC)¶
In the Activation Patching section we introduced how interpretability researchers manually discover circuits. Automatic Circuit DisCovery (ACDC) is an algorithm that automates the circuit discovery process and conducts all the activation patching experiments required to identify a circuit.
The typical manual circuit discovery workflow can be broken down into three main steps:
-
Selecting a behavior, a dataset that elicits this behavior, and a metric to measure the model's performance on the behavior,
-
Dividing the model into a computational graph: the model is represented as a graph, where nodes correspond to individual components like attention heads or MLPs, or more granular units like individual neurons, depending on the granularity of the analysis,
-
Isolating the relevant circuit: this step is what the ACDC algorithm automates. It involves identifying which components (nodes and edges) in the computational graph are involved in the behavior under study.
The ACDC algorithm is a recursive algorithm that isolates circuits by iterating over the computational graph from outputs to inputs and pruning unnecessary edges. The high-level steps are:
-
Start with the entire computational graph of the model,
-
We will process the graph starting from the output layer, moving backward to the input,
-
For each node, try to remove as many edges that enter this node as possible, without reducing the model’s performance on a selected metric (we don’t want the removal to impact the model’s performance on the specified task too much). If the change is minimal (below a set threshold), keep the edge removed.
-
Iterate over all remaining nodes (from the output later to the input layer)
-
Finally return the simplified subgraph with only the connections needed for the task.
The ACDC algorithm can successfully rediscover circuits found in previous research, such as the IOI or Greater-Than circuits.
Recent advancements have introduced methods for automatically discovering sparse feature circuits (Marks et al., 2024) - circuits made of sparse autoencoder (SAE) features rather than model components. Unlike traditional circuits, which are made of challenging-to-interpret model components like neurons or MLPs, SAE circuits are built from sparse autoencoder (SAE) features, which are directly interpretable.
The authors of this research developed unsupervised techniques to automatically discover thousands of feature circuits, many of which correspond to previously unanticipated model behaviors. This approach opens up new possibilities for interpreting models by focusing on the high-level features that drive behaviors, rather than using more abstract and less interpretable models components.