5.3 Evaluation Techniques¶
In the previous section, we talked about the specific properties of AI systems we pay attention to in our evaluations - their capabilities, propensities, and our ability to maintain control over the system. The next thing to talk about is how do we actually measure these properties? That is what we explore in this section - Evaluation techniques, which are the systematic approaches we can take to gather and analyze evidence about AI systems.
What are behavioral and internal evaluation techniques? We can broadly categorize our approach to measuring any properties into two complementary approaches. Behavioral techniques examine what a model does - studying its outputs in response to various inputs. Internal techniques examine how a model does it - looking at the internal mechanisms and representations that produce those behaviors.
Property-technique combinations . Different properties that we want to measure often naturally align with certain techniques. Capabilities can often most directly be measured through behavioral techniques - we care what the model can actually do. Propensities might require the use of more internal evaluation techniques to understand the underlying tendencies driving behavior.
This is not a strict rule though. For the current moment, the vast majority of evaluations are all done using behavioral techniques. In the future, we hope that evaluations use some combination of approaches. A capability evaluation becomes more robust when we understand not just what a model can do, but also how it does it. A propensity evaluation gains confidence when we see behavioral patterns reflected in internal mechanisms.
The goal isn't to stick to particular methods, but to build the strongest possible evidence and safety guarantees about the properties that we care about.
Practical considerations . The choice of evaluation methods must also consider practical constraints. Third party evaluators might have only limited access to a model's internals. They might have access to observe activations, but not modify the weights. They might not have access to the model at all, and might be restricted to observing the model functioning "in the wild". Depending on the specific techniques used, computational resources might restrict certain types of analysis. Price and time are also factors. Running agents (or "AI systems") is much more expensive and time intensive than benchmarks. Developing complex novel tasks, verifying and measuring capabilities properly is human labor intensive. As an example, during the evaluation for AI capabilities in RnD the human "control group" spent more than 568 hours on just 7 tasks (Wijk et al., 2024). This doesn't even include time to develop and test the tasks. Real-world tasks in AI R&D might take hundreds of man-hours each, and most independent evaluators, or even internal teams at big AI labs don’t have the budget and time to properly evaluate performance given these considerations. All of this needs to be kept in mind when designing an evaluation protocol.
5.3.1 Behavioral Techniques¶
Behavioral techniques examine AI systems through their observable outputs in response to different inputs. They are also sometimes called black-box or simply input output (IO) evaluations. This approach focuses on what a model does rather than how it does it internally.
Standard prompting and testing. The most basic form of behavioral analysis involves presenting models with predefined inputs and analyzing their outputs. For example, when evaluating capabilities, we might test a model's coding ability by presenting it with programming challenges. For propensity evaluations, we might analyze its default responses to ethically ambiguous questions. OpenAI's GPT-4 evaluation demonstrates this approach through systematic testing across various domains (OpenAI, 2023). However, even this "simple" technique involves careful consideration of how questions are framed - highlighting how most behavioral techniques exist on a spectrum from pure observation to active intervention.
What is elicitation and scaffolding ? When we say we're "eliciting" behavior, we mean actively working to draw out specific capabilities or tendencies that might not be immediately apparent. This often involves scaffolding - providing supporting structures or tools that help the model demonstrate its full capabilities. The core goal is to get the model to display its maximum abilities using whatever techniques that we can. Then evaluators can make stronger safety guarantees as compared to evaluating just the base model. There are some techniques being created to automate scaffolding, elicitation, supervised fine tuning and agent based evaluations. One example is vivaria by METR which is a tool for running evaluations and conducting agent elicitation research (METR, 2024).1
Similar to benchmarks, we can't possibly cover all the elicitation techniques, but here are just a couple. This should give you an overview of the types of things researchers try to get the maximum capabilities out of a model using scaffolding:
Elicitation technique: Best-of-N sampling . This technique generates multiple potential responses from a model and selects the best ones according to some scoring criteria. Rather than relying on a single output, we generate N different completions (often using different temperatures or prompts) and then choose the best one. This helps establish upper bounds on model capabilities by showing what the model can do in its "best" attempts. For propensity evaluations, we can study whether concerning behaviors appear more frequently in certain parts of the response distribution. As a concrete example, METR's modular agent framework employs a four-step loop which incorporates best-of-n sampling:
-
Prompter : Creates strategic prompts (e.g., "Write a Python function to find duplicate files in a directory")
-
Generator : Calls the LLM with those prompts to produce outputs (e.g., generates several code solutions)
-
Discriminator : Evaluates and selects the most promising solution (e.g., chooses the most efficient code)
-
Actor : Executes the selected solution in the environment (e.g., runs the code and observes results)
This iterative cycle then repeats - if the code has bugs, the Prompter might create a new prompt like "Fix the error in this code that happens when handling empty directories." Performance gains come primarily from this continuous feedback loop rather than one-time generation and selection.
Elicitation technique: Multistep reasoning prompting . 2 This technique asks models to break down their reasoning process into explicit steps, rather than just providing final answers. By prompting with phrases like "Let's solve this step by step", we can better understand the model's decision-making process. Chain of thoughts (Wei et al., 2022) is the most common approach, but researchers have also explored more elaborate techniques like chain of thought with self-consistency (CoT-SC) (Wang et al., 2023), tree of thoughts (ToT) (Yao et al., 2023), and graph of thoughts (GoT) (Besta et al., 2023). As an example besides just making the model perform better, for capability evaluations, these techniques help assess complex reasoning abilities by revealing intermediate steps. We can also observe how good a model is at generating sub-goals and intermediate steps.

Multi step reasoning/Inference time scaling helps us get a better understanding of a models true capabilities. As a concrete example, it has been observed that when a model learns two facts separately - "A → B" and "B → C" - the model's weights don't automatically encode the transitive connection "A → C", even though the model "knows" both component pieces. By breaking problems into smaller steps, we can help the model demonstrate capabilities it actually has but might not show in direct testing. (Zhu et al., 2024) Without step-by-step chain of thought prompting, we might incorrectly conclude a model lacks certain reasoning abilities, when in fact it just needs scaffolding to demonstrate them. Understanding these architectural limitations helps us design better evaluation techniques that can reveal a model's true capabilities. Combinations and variations of this technique have already been used to train models like OpenAI o-series (o1, o3)(OpenAI, 2024; OpenAI, 2025), DeepSeeks reasoning series of models (DeepSeek, 2025), and it is expected that Large Reasoning Models (LRMs) will continue to use such techniques to further boost model capabilities. So it is quite important to incorporate as many of them in dangerous capability evaluations as we can.

Elicitation technique: Supervised Fine-tuning. While the prompting techniques described so far are useful, research consistently shows they often fail to reveal a model's full capabilities (Brown et al., 2020; Wei et al., 2022; Davidson et al., 2023). Models frequently possess abilities that remain dormant during standard prompting evaluations. For example, some models deliver poorer quality responses when users present themselves as having less education, suggesting the issue isn't capability limitation but rather biased expression of existing capabilities (Perez et al., 2022). This is concerning if AIs hide their own capabilities (see the sections on sandbagging, scheming, situational awareness, or sabotage capabilities), or if AI labs were deliberately trying to hide capabilities of their LLMs from auditors. 3 Eliciting capabilities with fine-tuning might have the potential to avoid such failures in some circumstances.
Supervised fine-tuning (SFT) offers a more reliable approach to capability elicitation. We explained SFT in the foundation models section of our capabilities chapter. Essentially, rather than relying on elaborate prompting strategies, this approach modifies the model's weights to consistently exhibit specific behaviors or capabilities. To test the effectiveness of SFT for eliciting hidden capabilities, researchers developed "password-locked models" (Greenblatt & Roger, 2024) - models deliberately trained to perform well only when a specific password is present in the prompt, and to perform poorly otherwise. Their research demonstrated that SFT can be remarkably sample-efficient, often requiring only a handful of demonstrations to "unlock" capabilities that were previously inaccessible through prompting. In these controlled experiments, models could be fine-tuned with just a few examples and recover over 90% of their hidden performance. This suggests that fine-tuning-based approaches might reveal capabilities that would remain hidden during even sophisticated prompting-based evaluations. Given the effectiveness of SFT, several AI labs have begun incorporating fine-tuning-based capability assessment into their safety evaluation frameworks (Anthropic, 2023; OpenAI, 2023; Dragan et al., 2024).

SFT based elicitation also has limitations. Models starting too far from the desired capability may fail to converge during reinforcement learning, and supervised approaches require at least some examples of the target behavior. Additionally, fine-tuning cannot assess how a model's capabilities might expand through external tool access (e.g. WebGPT (Nakano et al, 2021), Toolformer (Schick et al., 2023)), when embedded in agent frameworks (Wang et al., 2025). This highlights why dangerous capability evaluations should include both fine-tuning-based elicitation alongside prompting based elicitation techniques, or agent scaffolds to establish more comprehensive upper bounds on model capabilities.
Elicitation technique: Tool-Augmentation . This is an elicitation technique that involves providing models with access to external tools or APIs - a form of scaffolding that extends what behaviors we can observe. Rather than evaluating just the text outputs of the model itself, we create systems where the model can generate instructions that trigger external tools like code execution environments, web browsers, or specialized APIs. The model itself still communicates through text, but the scaffolding around it executes these instructions and returns results. This approach is particularly valuable for capability evaluation, as it reveals what models can accomplish when paired with appropriate external tools. METR's autonomous replication evaluation uses this approach to test models' ability to perform complex real-world tasks (METR, 2023). For control evaluation, it helps assess whether tool access creates new avenues for circumventing safety measures.

Elicitation technique: Red teaming . This involves systematic attempts to elicit concerning behaviors from models. This can range from simple adversarial prompting to more complex interventions like specialized training and fine-tuning. The goal is to probe for potential risks or vulnerabilities that might not appear under standard testing. Red teaming is extensively used and discussed in our section on control evaluations, since these evaluations operate under the explicit assumption of adversarial behavior. Labs like OpenAI (OpenAI, 2024) and Anthropic (Anthropic, 2024) also use red teaming to discover dangerous capabilities before releasing models.

Elicitation technique: Long-Term interaction studies . These evaluate model behavior over extended interactions or multiple sessions, revealing patterns that might not be apparent in single exchanges. This is useful for evaluating properties like goal persistence or strategy development. In propensity evaluations, this can reveal whether models maintain consistent behavioral patterns over time. One example of this might be DeepMind's "Hidden Agenda" evaluation. A user interacts with a chatbot "designed to help them learn about interesting topics, but the model has been secretly instructed to have them take some action like click a suspicious link or provide the email addresses. The point is to study models' manipulation capabilities over extended interactions. (Phuong et al., 2024).
The techniques we have outlined here are by no means exhaustive. This is just a short overview of the types of techniques you can use when conducting behavioral evaluations.
5.3.2 Internal Techniques¶
Internal techniques examine how AI systems process information by studying their internal representations, activation patterns, and computational mechanisms. Unlike behavioral techniques, which focus just on the observable inputs and outputs, internal analysis uses interpretability to understand how it arrives at its outputs. This type of analysis often requires access to model weights, activations, or architectural details.
It is important to mention though, that currently internal techniques are still under development, and the majority of evaluations are still only using behavioral techniques. As the field of interpretability grows, over the years we might see internal technique based evaluations become more popular.
Enumerative Safety aims to create a comprehensive dictionary of what models knowledge. It is a specific approach within interpretability that focuses on systematically cataloging and understanding the individual features (concepts) that neural networks represent internally. The "enumerative" aspect refers to the goal of exhaustively identifying all the important features a model uses in its computations - essentially creating a comprehensive dictionary or catalog of what concepts the model "knows about." It's like trying to enumerate all the parts that make up a complex machine (Olah, 2023). It is different from mechanistic or developmental interpretability. Mechanistic interpretability focuses on understanding computational pathways (circuits) showing how models transform inputs to outputs, developmental interpretability studies how features form during training, and enumerative safety focuses specifically on cataloging the features/concepts that exist within models.
Sparse Autoencoders (SAEs) are a key technique for enumerative safety. SAEs decompose a model's internal activations into simpler, non-overlapping components, separating mixed concepts from superposition and making them interpretable to humans. Recent research suggests neural networks contain organized feature families that make this systematic analysis more feasible. SAEs successfully extracted interpretable features from models like Claude 3 Sonnet, revealing features related to safety-relevant behaviors including deception, sycophancy, and bias (Cunningham et al., 2023). Anthropic has demonstrated practical applications of these techniques in isolating and tracing the "thoughts of a language model" (Anthropic, 2025). In alignment auditing, these methods have even identified models deliberately trained with hidden objectives by finding features that activate when the model attempts to game its training objectives (Anthropic, 2025). While promising, these results remain preliminary and have not yet been applied to full-scale AI auditing.

Representation analysis . We can use techniques like representation engineering (RepE), which involve studying how models encode and represent different types of information internally. RepE takes inspiration from neuroscience approaches and places representations at the center of analysis, rather than neurons or circuits (in contrast to mechanistic interpretability). We can potentially use this to monitor and manipulate high-level cognitive phenomena in deep neural networks (Zou et al., 2023) We can also examine the geometry of the model's representation space or track how representations evolve during processing. (Marks et al., 2023)

Probes can be used to find patterns in a model's weights . A probe is essentially a linear classifier trained on a model's internal activations to detect specific patterns or behaviors (Roger et al., 2023). In simpler terms, a probe examines the model's internal representations during operation. They can be trained using synthetic examples of concerning behavior - humans can write examples of scenarios we want to evaluate for, like AI attempting takeovers or exploiting vulnerabilities and use these to force our AI models to exhibit this type of concerning behavior. Then, we train another AI classifier (probe) on the patterns that these malicious models exhibit. The goal is to get the probe good at recognizing specific concerning behaviors based on the model weights, and then use this to evaluate concerning internal activation patterns for newer models in the future.
Probes differ from mechanistic interpretability because they focus on detecting specific concerning behaviors rather than trying to understand the complete internal mechanisms of the model. As a concrete example, researchers trained "coup probes" to identify potential takeover attempts. By creating a dataset of examples describing ways an AI might try to take over, they found probes could reliably flag concerning behavior even when the text was written differently or translated to other languages (Roger et al., 2023).
Probes can be combined with other evaluation techniques, like having human red teams attempt to elicit deceptive behavior while using probes to monitor the model's internal representations. This multi-layered approach could help catch subtle forms of concerning behavior that might not be apparent from examining outputs alone. However, probe effectiveness likely depends on the model not being dramatically more capable than humans in safety-critical domains, as extremely advanced systems might find ways to manipulate their internal representations to fool the probes (Roger et al., 2023). This is usually called "gradient hacking". It's worth pointing out though that gradient hacking is extremely difficult for a model to actually do (Millidge, 2023).
Internal analysis techniques often complement behavioral approaches - while behavior tells us what a model does, internal analysis helps explain why it does it. This combination is particularly valuable for evaluations where we need high confidence, like verifying safety-critical properties or understanding the implementation of dangerous capabilities. But as we mentioned at the beginning of this section, internal techniques are not widely used because as of 2025 are still being developed. Majority of the focus is on dangerous capability evaluations, and demonstrations through things like model organisms which we talk about in the next section on evaluation frameworks.
-
At the time of writing METR is transitioning to a newer tool called Inspect, Vivaria remains available as an open-source solution, allowing researchers to implement various elicitation techniques like tool augmentation and multi-step reasoning in a controlled, reproducible environment. ↩
-
This can also be thought of as inference time scaling for reasoning models. It is effectively the same underlying technique. ↩
-
This could be to pass safety checks, and participate in safetywashing. Think of how Volkswagen hid high emissions of their cars. (Ewing, 2017) ↩