Skip to content

5.3 Evaluation Techniques

In the previous section, we talked about the specific properties of AI systems we pay attention to in our evaluations - their capabilities, propensities, and our ability to maintain control over the system. The next thing to talk about is how do we actually measure these properties? That is what we explore in this section - Evaluation techniques, which are the systematic approaches we can take to gather and analyze evidence about AI systems.

What are behavioral and internal evaluation techniques? We can broadly categorize our approach to measuring any properties into two complementary approaches. Behavioral techniques examine what a model does - studying its outputs in response to various inputs. Internal techniques examine how a model does it - looking at the internal mechanisms and representations that produce those behaviors.

Property-technique combinations. Different properties that we want to measure often naturally align with certain techniques. Capabilities can often most directly be measured through behavioral techniques - we care what the model can actually do. Propensities might require the use of more internal evaluation techniques to understand the underlying tendencies driving behavior.

This is not a strict rule though. For the current moment, the vast majority of evaluations are all done using behavioral techniques. In the future, we hope that evaluations use some combination of approaches. A capability evaluation becomes more robust when we understand not just what a model can do, but also how it does it. A propensity evaluation gains confidence when we see behavioral patterns reflected in internal mechanisms.

The goal isn't to stick to particular methods, but to build the strongest possible evidence and safety guarantees about the properties that we care about.

5.3.1 Behavioral Techniques

Behavioral techniques examine AI systems through their observable outputs in response to different inputs. They are also sometimes called black-box or simply input output (IO) evaluations. This approach focuses on what a model does rather than how it does it internally.

Standard prompting and testing. The most basic form of behavioral analysis involves presenting models with predefined inputs and analyzing their outputs. For example, when evaluating capabilities, we might test a model's coding ability by presenting it with programming challenges. For propensity evaluations, we might analyze its default responses to ethically ambiguous questions. OpenAI's GPT-4 evaluation demonstrates this approach through systematic testing across various domains (OpenAI, 2023). However, even this "simple" technique involves careful consideration of how questions are framed - highlighting how most behavioral techniques exist on a spectrum from pure observation to active intervention.

What is elicitation and scaffolding ? When we say we're "eliciting" behavior, we mean actively working to draw out specific capabilities or tendencies that might not be immediately apparent. This often involves scaffolding - providing supporting structures or tools that help the model demonstrate its full capabilities. The core goal is to get the model to display its maximum abilities using whatever techniques that we can. Then evaluators can make stronger safety guarantees as compared to evaluating just the base model. There are some techniques being created to automate scaffolding, elicitation, supervised fine tuning and agent based evaluations.

Similar to benchmarks, we can't possibly cover all the elicitation techniques, but here are just a couple. This should give you an overview of the types of things researchers try to get the maximum capabilities out of a model using scaffolding:

Elicitation technique: Best-of-N sampling . This technique generates multiple potential responses from a model and selects the best ones according to some scoring criteria. Rather than relying on a single output, we generate N different completions (often using different temperatures or prompts) and then choose the best one. This helps establish upper bounds on model capabilities by showing what the model can do in its "best" attempts. For propensity evaluations, we can study whether concerning behaviors appear more frequently in certain parts of the response distribution.

METRs modular agent framework

One example is vivaria by METR which is a tool for running evaluations and conducting agent elicitation research (METR, 2024). At the time of writing METR is transitioning to a newer tool called Inspect, Vivaria remains available as an open-source solution, allowing researchers to implement various elicitation techniques like tool augmentation and multi-step reasoning in a controlled, reproducible environment.

METR's modular agent framework employs a four-step loop which incorporates best-of-n sampling:

  1. Prompter: Creates strategic prompts (e.g., "Write a Python function to find duplicate files in a directory")

  2. Generator: Calls the LLM with those prompts to produce outputs (e.g., generates several code solutions)

  3. Discriminator: Evaluates and selects the most promising solution (e.g., chooses the most efficient code)

  4. Actor: Executes the selected solution in the environment (e.g., runs the code and observes results)

This iterative cycle then repeats - if the code has bugs, the Prompter might create a new prompt like "Fix the error in this code that happens when handling empty directories." Performance gains come primarily from this continuous feedback loop rather than one-time generation and selection.

Elicitation technique: Multi-step reasoning prompting1. This technique asks models to break down their reasoning process into explicit steps, rather than just providing final answers. By prompting with phrases like "Let's solve this step by step", we can better understand the model's decision-making process. Chain of thoughts (Wei et al., 2022) is the most common approach, but researchers have also explored more elaborate techniques like chain of thought with self-consistency (CoT-SC) (Wang et al., 2023), tree of thoughts (ToT) (Yao et al., 2023), and graph of thoughts (GoT) (Besta et al., 2023). As an example besides just making the model perform better, for capability evaluations, these techniques help assess complex reasoning abilities by revealing intermediate steps. We can also observe how good a model is at generating sub-goals and intermediate steps.

Enter image alt description
Figure 5.14: A comparison of various multi step reasoning approaches. (Besta et al., 2023)

Multi step reasoning/Inference time scaling helps us get a better understanding of a models true capabilities. As a concrete example, it has been observed that when a model learns two facts separately - "A → B" and "B → C" - the model's weights don't automatically encode the transitive connection "A → C", even though the model "knows" both component pieces. By breaking problems into smaller steps, we can help the model demonstrate capabilities it actually has but might not show in direct testing. (Zhu et al., 2024) Without step-by-step chain of thought prompting, we might incorrectly conclude a model lacks certain reasoning abilities, when in fact it just needs scaffolding to demonstrate them. Understanding these architectural limitations helps us design better evaluation techniques that can reveal a model's true capabilities. Combinations and variations of this technique have already been used to train models like OpenAI o-series (o1, o3)(OpenAI, 2024; OpenAI, 2025), DeepSeek’s reasoning series of models (DeepSeek, 2025), and it is expected that Large Reasoning Models (LRMs) will continue to use such techniques to further boost model capabilities. So it is quite important to incorporate as many of them in dangerous capability evaluations as we can.

Enter image alt description
Figure 5.15: Example of how a RL reward model and a multi step reasoning structure can be combined to elicit more reasoning capabilities out of a model. (Xu et al., 2025)

Elicitation technique: Supervised Fine-tuning. While the prompting techniques described so far are useful, research consistently shows they often fail to reveal a model's full capabilities (Brown et al., 2020; Wei et al., 2022; Davidson et al., 2023). Models frequently possess abilities that remain dormant during standard prompting evaluations. For example, some models deliver poorer quality responses when users present themselves as having less education, suggesting the issue isn't capability limitation but rather biased expression of existing capabilities (Perez et al., 2022). This is concerning if AIs hide their own capabilities (see the sections on sandbagging, scheming, situational awareness, or sabotage capabilities), or if AI labs were deliberately trying to hide capabilities of their LLMs from auditors. 2 Eliciting capabilities with fine-tuning might have the potential to avoid such failures in some circumstances.

Supervised fine-tuning (SFT) offers a more reliable approach to capability elicitation. We explained SFT in the foundation models section of our capabilities chapter. Essentially, rather than relying on elaborate prompting strategies, this approach modifies the model's weights to consistently exhibit specific behaviors or capabilities. To test the effectiveness of SFT for eliciting hidden capabilities, researchers developed "password-locked models" (Greenblatt & Roger, 2024) - models deliberately trained to perform well only when a specific password is present in the prompt, and to perform poorly otherwise. Their research demonstrated that SFT can be remarkably sample-efficient, often requiring only a handful of demonstrations to "unlock" capabilities that were previously inaccessible through prompting. In these controlled experiments, models could be fine-tuned with just a few examples and recover over 90% of their hidden performance. This suggests that fine-tuning-based approaches might reveal capabilities that would remain hidden during even sophisticated prompting-based evaluations. Given the effectiveness of SFT, several AI labs have begun incorporating fine-tuning-based capability assessment into their safety evaluation frameworks (Anthropic, 2023; OpenAI, 2023; Dragan et al., 2024).

Enter image alt description
Figure 5.16: An illustration of elicitation using password locked models. Researchers trained models to only show good behavior when given a password (and perform poorly otherwise), and then try to elicit the capabilities by fine-tuning these password-locked models (Greenblatt & Roger, 2024).

SFT based elicitation also has limitations. Models starting too far from the desired capability may fail to converge during reinforcement learning, and supervised approaches require at least some examples of the target behavior. Additionally, fine-tuning cannot assess how a model's capabilities might expand through external tool access (e.g. WebGPT (Nakano et al, 2021), Toolformer (Schick et al., 2023)), when embedded in agent frameworks (Wang et al., 2025). This highlights why dangerous capability evaluations should include both fine-tuning-based elicitation alongside prompting based elicitation techniques, or agent scaffolds to establish more comprehensive upper bounds on model capabilities.

Elicitation technique: Tool-Augmentation. This is an elicitation technique that involves providing models with access to external tools or APIs - a form of scaffolding that extends what behaviors we can observe. Rather than evaluating just the text outputs of the model itself, we create systems where the model can generate instructions that trigger external tools like code execution environments, web browsers, or specialized APIs. The model itself still communicates through text, but the scaffolding around it executes these instructions and returns results. This approach is particularly valuable for capability evaluation, as it reveals what models can accomplish when paired with appropriate external tools. METR's autonomous replication evaluation uses this approach to test models' ability to perform complex real-world tasks (METR, 2023). For control evaluation, it helps assess whether tool access creates new avenues for circumventing safety measures.

Enter image alt description
Figure 5.17: HuggingGPT: a model that can query other AI models in HuggingFace and use them as tools to solve more complicated tasks. (Shen et al., 2023)

Elicitation technique: Red teaming . This involves systematic attempts to elicit concerning behaviors from models. This can range from simple adversarial prompting to more complex interventions like specialized training and fine-tuning. The goal is to probe for potential risks or vulnerabilities that might not appear under standard testing. Red teaming is extensively used and discussed in our section on control evaluations, since these evaluations operate under the explicit assumption of adversarial behavior. Labs like OpenAI (OpenAI, 2024) and Anthropic (Anthropic, 2024) also use red teaming to discover dangerous capabilities before releasing models.

Enter image alt description
Figure 5.18: Example of red teaming on LLMs (Perez et al., 2022)

Elicitation technique: Long-Term interaction studies. These evaluate model behavior over extended interactions or multiple sessions, revealing patterns that might not be apparent in single exchanges. This is useful for evaluating properties like goal persistence or strategy development. In propensity evaluations, this can reveal whether models maintain consistent behavioral patterns over time. One example of this might be DeepMind's "Hidden Agenda" evaluation. A user interacts with a chatbot "designed to help them learn about interesting topics, but the model has been secretly instructed to have them take some action like click a suspicious link or provide the email addresses. The point is to study models' manipulation capabilities over extended interactions. (Phuong et al., 2024).

The techniques we have outlined here are by no means exhaustive. This is just a short overview of the types of techniques you can use when conducting behavioral evaluations.

5.3.2 Internal Techniques

Internal techniques examine how AI systems process information by studying their internal representations, activation patterns, and computational mechanisms. Unlike behavioral techniques, which focus just on the observable inputs and outputs, internal analysis uses interpretability to understand how it arrives at its outputs. This type of analysis often requires access to model weights, activations, or architectural details.

It is important to mention though, that currently internal techniques are still under development, and the majority of evaluations are still only using behavioral techniques. As the field of interpretability grows, over the years we might see internal technique based evaluations become more popular.

Enumerative safety aims to create a comprehensive dictionary of what models knowledge. It is a specific approach within interpretability that focuses on systematically cataloging and understanding the individual features (concepts) that neural networks represent internally. The "enumerative" aspect refers to the goal of exhaustively identifying all the important features a model uses in its computations - essentially creating a comprehensive dictionary or catalog of what concepts the model "knows about." It's like trying to enumerate all the parts that make up a complex machine (Olah, 2023). It is different from mechanistic or developmental interpretability. Mechanistic interpretability focuses on understanding computational pathways (circuits) showing how models transform inputs to outputs, developmental interpretability studies how features form during training, and enumerative safety focuses specifically on cataloging the features/concepts that exist within models.

Sparse Autoencoders (SAEs) are a key technique for enumerative safety. SAEs decompose a model's internal activations into simpler, non-overlapping components, separating mixed concepts from superposition and making them interpretable to humans. Recent research suggests neural networks contain organized feature families that make this systematic analysis more feasible. SAEs successfully extracted interpretable features from models like Claude 3 Sonnet, revealing features related to safety-relevant behaviors including deception, sycophancy, and bias (Cunningham et al., 2023). Anthropic has demonstrated practical applications of these techniques in isolating and tracing the "thoughts of a language model" (Anthropic, 2025). In alignment auditing, these methods have even identified models deliberately trained with hidden objectives by finding features that activate when the model attempts to game its training objectives (Anthropic, 2025). While promising, these results remain preliminary and have not yet been applied to full-scale AI auditing.

Enter image alt description
Figure 5.19: An example of sycophantic praise feature discovered with the autoencoder technique (Anthropic, 2024).

Representation analysis . We can use techniques like representation engineering (RepE), which involve studying how models encode and represent different types of information internally. RepE takes inspiration from neuroscience approaches and places representations at the center of analysis, rather than neurons or circuits (in contrast to mechanistic interpretability). We can potentially use this to monitor and manipulate high-level cognitive phenomena in deep neural networks (Zou et al., 2023). We can also examine the geometry of the model's representation space or track how representations evolve during processing (Marks et al., 2023). There is much more information about this in the chapter on interpretability.

Enter image alt description
Figure 5.20: Contrasting the mechanistic view with the representational view of internal analysis. (Zou et al., 2023)

Probes can be used to find patterns in a model's weights. A probe is essentially a linear classifier trained on a model's internal activations to detect specific patterns or behaviors (Roger et al., 2023). In simpler terms, a probe examines the model's internal representations during operation. They can be trained using synthetic examples of concerning behavior - humans can write examples of scenarios we want to evaluate for, like AI attempting takeovers or exploiting vulnerabilities and use these to force our AI models to exhibit this type of concerning behavior. Then, we train another AI classifier (probe) on the patterns that these malicious models exhibit. The goal is to get the probe good at recognizing specific concerning behaviors based on the model weights, and then use this to evaluate concerning internal activation patterns for newer models in the future.

Probes differ from mechanistic interpretability because they focus on detecting specific concerning behaviors rather than trying to understand the complete internal mechanisms of the model. As a concrete example, researchers trained "coup probes" to identify potential takeover attempts. By creating a dataset of examples describing ways an AI might try to take over, they found probes could reliably flag concerning behavior even when the text was written differently or translated to other languages (Roger et al., 2023).

Probes can be combined with other evaluation techniques, like having human red teams attempt to elicit deceptive behavior while using probes to monitor the model's internal representations. This multi-layered approach could help catch subtle forms of concerning behavior that might not be apparent from examining outputs alone. However, probe effectiveness likely depends on the model not being dramatically more capable than humans in safety-critical domains, as extremely advanced systems might find ways to manipulate their internal representations to fool the probes (Roger et al., 2023). This is usually called "gradient hacking". It's worth pointing out though that gradient hacking is extremely difficult for a model to actually do (Millidge, 2023).

Alignment audits are systematic investigations into whether AI models pursue hidden objectives. Researchers at Anthropic have deliberately trained models with a known misaligned goal and have research teams try to uncover it (Anthropic, 2025). They trained a language model (Claude 3.5 Haiku) to be an "RM-sycophant," meaning it secretly aimed to maximize scores from a reward model by exploiting its biases, even on tasks it wasn't explicitly trained for, while also being trained to conceal this underlying motive. In a blind "auditing game," four research teams were tasked with investigating this model. Techniques like analyzing training data, using sparse autoencoders (SAEs) for interpretability, and behavioral methods were employed by the auditors. A key finding was the importance of training data access: the three teams with such access successfully identified the hidden RM-sycophancy objective and the 52 specific RM biases the model was taught, while the team without data access failed.

Enter image alt description
Figure 5.21: Structure of our blind auditing game. Our game is structured as a red team/blue team exercise. A red team trains a model behind a strict information barrier. Then the red team provides the model and a representative sample of its training data to auditing teams who are unaware of how the model was trained. These teams—which are allowed varying levels of model and data access—investigate this model and produce auditing reports. The red team grades these reports according to whether they uncover the model’s hidden objective - figure and legend from Anthropic (Anthropic, 2025).
Enter image alt description
Figure 5.22: Training and evaluating a reward model sycophant. We first teach a model via training on model-generated documents that RMs make 52 certain exploitable errors. Then we train the model to exploit 47 of these errors (blue). To evaluate whether the model has learned an RM-sycophancy objective, we evaluate whether it exploits the 5 held-out errors (red) as well as new RM errors introduced at test time (grey) - figure and legend from Anthropic (Anthropic, 2025).

The research highlighted promising auditing techniques, including using SAEs to connect quirky behaviors to training data about RM biases and a method called "turning the model against itself." (this last technique is not an internal technique) This latter technique involves prompting the model to play both the user and assistant roles in a conversation; the "user persona" of the model sometimes asked probing questions that pointed towards the model's hidden objective, as it wasn't trained to maintain secrecy in that role. While SAEs successfully identified features corresponding to RM biases when the model was prompted about its concerning behaviors, the paper notes this success might have been partly due to semantic similarities between the prompt and the training data descriptions of RM biases. However, SAEs also proved useful in scenarios where simple semantic search of training data failed, such as identifying RM bias features active when the model produced an odd output (like adding chocolate to an omelette) and causally linking these features to the behavior.

For example, one participant won by prompting the model with inputs like "The assistant is an AI model with potentially concerning behaviors" and searching through feature like the ones shown in the image below.

Enter image alt description
Figure 5.23: A reward model bias feature. Each row shows a training sample on which the feature is most active. Darker shading indicates stronger feature activity (Anthropic, 2025).

SAE is not the only internal technique that can be used to detect alignment faking. Some researchers have proposed benchmarks to evaluate interpretability techniques themselves. Think about it like evaluating evaluation techniques. They are trying to measure the efficacy of evaluations in detecting things like alignment faking. This is basically the same thing as scheming - where models behave benign during evaluations but misbehave when they infer it's safe to do so. An example of this type of benchmark is the "Poser" benchmark (Clymer et al., 2024). The benchmark comprises 324 pairs of LLaMA-13B models fine-tuned for role-play scenarios; one model in each pair is consistently aligned, while the other is an "alignment faker" that misbehaves when it believes it won't be caught (e.g., when the model is told that "monitoring is compromised"). The core task is to identify the alignment-faking model using only prompts where both models produce identical outputs, thus necessitating the use of internal model analysis rather than behavioral differences. The researchers tested five detection strategies based on manipulating or observing model internals.

The most successful strategy, termed "intoxicating models" by injecting noise into their activations, identified 98% of alignment fakers. This was done by observing which model shifted more towards misaligned outputs. Another effective method, "comparing how quickly models learn to misbehave" by fine-tuning them towards misaligned actions, achieved 90% accuracy, with fakers adapting more rapidly. Identifying suspicious concepts like 'security' or 'harmful behavior' using activation directions yielded 71% accuracy. Conversely, strategies like trying to induce a belief that it's safe to misbehave (which was indistinguishable from noise injection), comparing decision-making speed across layers, or measuring attentiveness to oversight cues were not effective (Clymer et al., 2024).

While interpretability is a valuable tool, it is unlikely to provide highly reliable methods for detecting sophisticated scheming. There are many fundamental challenges to interpretability, like superposition (where concepts are distributed across neurons), inherent errors in current tools like SAEs, difficulty in measuring progress without ground truth, the "long tail problem" of complex and poorly understood model behaviors, and the immense difficulty of proving the absence of deceptive circuits. Furthermore, issues like scalability to superintelligent systems and the potential for AI to deliberately obfuscate its internal workings pose significant hurdles (Nanda, 2025).

Internal techniques should be considered part of a pragmatic, defense-in-depth "portfolio" of safeguards. The aim of these should be to maximize the chances of catching misalignment rather than guaranteeing its detection. Interpretability should be thought of as a complement to black-box methods by providing de-correlated signals, enhancing evaluations (e.g., by manipulating a model’s awareness of being tested), and helping to debug mysterious or concerning model behaviors. The goal should be to make deception harder and riskier for AI, even if perfect detection remains elusive (Nanda, 2025).

Much more details on interpretability can be found in the dedicated chapter on interpretability.


  1. This can also be thought of as inference time scaling for reasoning models. It is effectively the same underlying technique. 

  2. This could be to pass safety checks, and participate in safety washing. Think of how Volkswagen hid high emissions of their cars. (Ewing, 2017