# Chapter 5: Evaluations

AI models are advancing so fast that our tests for their abilities become outdated almost immediately, forcing us to constantly develop more comprehensive evaluation methods to keep up.

[Read online](https://aisafetytextbook.com/chapters/v1/evaluations/introduction)

---

## Introduction

AI models are advancing so fast that our tests for their abilities become outdated almost immediately, forcing us to constantly develop more comprehensive evaluation methods to keep up.

> When you can measure what you are speaking about, and express it in numbers, you know something about it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely, in your thoughts advanced to the stage of science.
> — Lord Kelvin

**The gap between what AI systems can do and what we can reliably measure creates a fundamental safety challenge.** In late 2024, AI researchers created FrontierMath, a benchmark of exceptionally difficult problems they predicted would "*resist AIs for several years*". Just a few months later, OpenAI's o3 model achieved 25.2% accuracy on these supposedly insurmountable problems. This pattern repeats across AI development: tools designed to measure AI capabilities become obsolete almost immediately as models rapidly surpass them. As AI systems approach potentially transformative capabilities in domains like cybersecurity, autonomous operation, and strategic planning, this evaluation gap becomes increasingly dangerous. We cannot afford to discover the full extent of advanced AI capabilities through their emergent real-world impacts.

**Benchmarks provide standardization of measurement but currently fail to capture the complex risks posed by advanced AI systems.** Early AI development faced a measurement crisis similar to pre-standardized engineering—without reliable metrics, progress was chaotic and unpredictable. Benchmarks revolutionized AI development by creating standardized metrics that allowed meaningful comparisons between systems. While benchmarks like MMLU, GPQA, or FrontierMath are extremely useful, they still fall short when predicting real-world behavior. They fail to measure what happens when models are pushed to their limits or operate in complex environments where unexpected capability combinations emerge. As we move towards highly capable AI systems, we need stronger safety guarantees in a wider range of scenarios than what benchmarks alone can provide.

**Safety risks emerge from interactions between capabilities, creating a combinatorial challenge. **Individual capabilities like situational awareness or deception appear manageable when measured in isolation, but their combination creates emergent risks fundamentally different from either capability alone. Models might recognize when they're being evaluated and selectively hide capabilities during testing but demonstrate concerning behaviors during deployment. This combinatorial explosion makes comprehensive testing exponentially more difficult but essential for meaningful safety claims. A model passing most safety benchmarks might still harbor dangerous capability combinations that weren't explicitly tested. This is why we need to move measurement of safety critical properties of a system towards comprehensive evaluations.

**Effective safety assessment requires measuring three distinct properties: capabilities, propensities, and control.** Dangerous capabilities evaluations establish upper bounds on what models can do when pushed to their limits through techniques like scaffolding and tool-augmentation. Propensity evaluations reveal what behaviors models tend to exhibit by default, measuring behavioral tendencies rather than just maximum abilities. Control evaluations verify whether safety protocols remain effective when models actively try to circumvent them. This three-part taxonomy provides a systematic framework for safety measurement that helps answer the most decision-relevant questions about AI.

**Safety evaluation requires both behavioral and internal techniques, each providing different forms of evidence.** Behavioral techniques examine model outputs through approaches like red teaming, which systematically attempts to elicit concerning behaviors; supervised fine-tuning, which elicits capabilities by modifying weights rather than just prompting; and best-of-N sampling, which examines multiple potential responses to understand output distributions. These techniques can establish upper bounds on potential capabilities but struggle to tell us "why" models generate certain outputs. Internal techniques complement this by examining model mechanisms directly. For example, sparse autoencoders have successfully extracted interpretable features related to safety-relevant behaviors including deception, sycophancy, and bias. Other techniques like mechanistic interpretability, can help trace computational pathways through the model, enumerative safety can catalogs concepts the model has encoded, and representation engineering can examine how models encode information. Behavioral and internal evaluation techniques are complementary and together provide stronger safety guarantees than either approach alone.

**Evaluation frameworks help transform measurements into concrete development and deployment decisions.** Rather than relying on ad-hoc responses to capabilities, frameworks like Anthropic's Responsible Scaling Policies establish "AI Safety Levels" inspired by things like biosafety containment protocols, with each level requiring increasingly stringent evaluation requirements and safety measures. These frameworks create "evaluation gates" that determine when scaling can proceed safely—requiring models to pass cybersecurity, biosecurity, and autonomous replication evaluations before development continues. By integrating evaluations into governance structures, we create systematic approaches to managing AI risk rather than relying on ad-hoc decisions.

**Evaluations must be systematically designed to maintain quality and scale across increasingly complex models.** Evaluation design requires careful consideration of affordances—the resources and opportunities provided to the model during testing. By systematically varying affordances from minimal (restricting tools and resources) to maximal (providing all potentially relevant tools and context), we can build a more complete picture of model behavior under different conditions. As the number of safety-relevant properties grows, automating evaluation becomes necessary. We can potentially use model-written evaluations to help address scaling challenges.

**Despite significant progress, AI evaluations face fundamental limitations that threaten their reliability.**  The asymmetry between proving presence versus absence of capabilities means we can never be certain we've detected all potential risks. Evaluations can conclusively confirm that a model possesses certain capabilities but cannot definitively prove their absence. Technical challenges include measurement sensitivity—performance can vary based on seemingly trivial changes in prompting formats—and the combinatorial explosion of test cases as we add new dimensions to evaluate. Misalignment might lead to model "sandbagging" (strategic underperformance on evaluations), research shows language models can be made to selectively underperform on tests for dangerous capabilities while maintaining performance on general benchmarks. Organizational incentives might lead labs themselves to do "safety washing" (misrepresenting capability improvements as safety advancements). These challenges highlight the need for continued research into more robust evaluation methodologies and institutional arrangements that support genuinely independent assessment.

*Figure 5.1: Overview of chapter content.*

This introduction gave you the general overview of many of the concepts that we will be talking about throughout this chapter. The sections will largely proceed in the order that we introduced the ideas above. We begin by exploring how benchmarks have shaped AI development.

## Benchmarks

Standardized measurement tools shape development and help us measure where we currently stand. Many benchmarks are being developed to measure both capabilities and safety of AI systems.

**What is a benchmark?** Imagine trying to build a bridge without measuring tape. Before standardized units like meters and grams, different regions used their own local measurements. Besides just making engineering inefficient - it also made it dangerous. Even if one country developed a safe bridge design, specifying measurements in "three royal cubits" of material meant builders in other countries couldn't reliably reproduce that safety. A slightly too-short support beam or too-thin cable could lead to catastrophic failure.

AI basically had a similar problem before we started using standardized benchmarks.[^1] A benchmark is a tool like a standardized test, which we can use to measure and compare what AI systems can and cannot do. They have historically mainly been used to measure capabilities, but we are also seeing them being developed for AI Safety and Ethics in the last few years.

**How do benchmarks shape AI development and safety research?** Benchmarks in AI are slightly different from other scientific fields. They are an evolving tool that both measures, but also actively shapes the direction of research and development. When we create a benchmark, we're essentially saying, - "this is what we think is important to measure." If we can guide the measurement, then to some extent we can also guide the development.

> Goal definitions and evaluation benchmarks are among the most potent drivers of scientific progress
> — François Chollet

## History and Evolution

*Interactive figure 5.1: Benchmark scores of various AI capabilities relative to human performance. ([Giattino et al., 2023](https://ourworldindata.org/artificial-intelligence))*

**Example: Benchmarks influencing standardization in computer vision**. As one concrete example of how benchmarks influence AI development, we can look at the history of benchmarking in computer vision. In 1998, researchers introduced MNIST, a dataset of 70,000 handwritten digits. ([LeCun, 1998](https://yann.lecun.com/exdb/mnist/)) The digits were not the important part, the important part was that each digit image was carefully processed to be the same size and centered in the frame, and that the researchers made sure to get digits from different writers for the training set and test set. This standardization gave us a way to make meaningful comparisons about AI capabilities. In this case, the specific capability of digit classification. Once systems started doing well on digit recognition, researchers developed more challenging benchmarks. CIFAR-10/100 in 2009 introduced natural color images of objects like cars, birds, and dogs, increasing the complexity. ([Krizhevsky, 2009](https://www.cs.toronto.edu/~kriz/cifar.html)) Similarly, ImageNet later the same year provided 1.2 million images across 1,000 categories. ([Deng, 2009](https://ieeexplore.ieee.org/document/5206848)) When one research team claimed their system achieved 95% accuracy on MNIST or ImageNet and another claimed 98%, everyone knew exactly what those numbers meant. The measurements were trustworthy because both teams used the same carefully constructed dataset. Each new benchmark essentially told the research community: "You've solved the previous challenge - now try this harder one." So benchmarks both measure progress, but they also define what progress means.

*Figure 5.2: Examples of digits from MNIST ([MNIST database - Wikipedia](https://upload.wikimedia.org/wikipedia/commons/b/b1/MNIST_dataset_example.png))*

**How do benchmarks influence AI Safety?** Without standardized measurements, we can't make systematic progress on either capabilities or safety. Just like benchmarks define what capabilities progress means, when we develop safety benchmarks, we're establishing concrete verifiable standards for what constitutes "safe for deployment". Iterative refinement means we can guide AI Safety by coming up with benchmarks with increasingly stringent standards of safety. Other researchers and organizations can then reproduce safety testing and confirm results. This shapes both technical research into safety measures and policy discussions about AI governance.

**Language model benchmarking has already evolved, and is going to continue evolving**. Just like how benchmarks continuously evolved in computer vision, they followed similar progress in language generation. Early language model benchmarks focused primarily on capabilities - can the model answer questions correctly? Complete sentences sensibly? Translate between languages? Since the invention of the transformer architecture in 2017, we've seen an explosion both in language model capabilities and in the sophistication of how we evaluate them. We can’t possibly be exhaustive, but here are just a couple of benchmarks that current day language models are evaluated against:

*Figure 5.3: Example of popular language models (Claude 3.5) being evaluated on various benchmarks ([Anthropic, 2024](https://www.anthropic.com/news/claude-3-5-sonnet))*

*Interactive figure 5.2: Benchmark performance on coding, math and language. ([Giattino et al., 2023](https://ourworldindata.org/artificial-intelligence))*

**Examples of various capabilities benchmarks**

**Benchmarking language and task understanding.** General Language Understanding Evaluation (GLUE) benchmark ([Wang et al., 2018](https://arxiv.org/abs/1804.07461)), and its successor SuperGLUE ([Wang et al., 2019](https://arxiv.org/abs/1905.00537)) test difficult language understanding tasks. SWAG ([Zellers et al., 2018](https://arxiv.org/abs/1808.05326)), and HellaSwag ([Zellers et al., 2019](https://arxiv.org/abs/1905.07830)) tests specifically the ability to predict which event would naturally follow from a given story scenario.

**Broad cross domain evaluations**. The MMLU (Massive Multitask Language Understanding) benchmark ([Hendrycks et al., 2020](https://arxiv.org/abs/2009.03300)) tests a model's knowledge across 57 subjects. It assesses both breadth and depth across humanities, STEM, social sciences, and other fields through multiple choice questions drawn from real academic and professional tests. The GPQA (Google Proof QA) ([Rein et al., 2023](https://arxiv.org/abs/2311.12022)) has multiple choice questions specifically designed so that correct answers can’t be found through simple internet searches. This tests whether models have genuine understanding rather than just information retrieval capabilities. BigBench ([Srivastava et al., 2022](https://arxiv.org/abs/2206.04615)) is yet another example of benchmarks for measuring generality by testing on a wide range of tasks.

**Benchmarking mathematical and scientific reasoning**. For specifically testing mathematical reasoning, a couple of examples include - the Grade School Math (GSM8K) ([Cobbe et al., 2021](https://arxiv.org/abs/2110.14168)) benchmark. This tests core mathematical concepts at an elementary school level. Another example is the MATH ([Hendrycks et al., 2021](https://arxiv.org/abs/2103.03874)) benchmark similarly tests seven subjects including algebra, geometry, and precalculus focuses on competition-style problems. They also have multiple difficulty levels per subject. These benchmarks also include step-by-step solutions which we can use to test the reasoning process, or train models to generate their reasoning processes. Multilingual Grade School Math (MGSM) is the multilingual version translated 250 grade-school math problems from the GSM8K dataset. ([Shi et al., 2022](https://arxiv.org/abs/2210.03057))

**Benchmarking SWE and coding**. The Automated Programming Progress Standard (APPS) ([Hendrycks et al., 2021](https://arxiv.org/abs/2105.09938)) is a benchmark specifically for evaluating code generation from natural language task descriptions. Similarly, HumanEval ([Chen et al, 2021](https://arxiv.org/abs/2107.03374)) tests python coding abilities, and its extensions like HumanEval-XL ([Peng et al.,2024](https://arxiv.org/abs/2402.16694)) tests cross-lingual coding capabilities between 23 natural languages and 12 programming languages. HumanEval-V ([Zhang et al., 2024](https://arxiv.org/abs/2410.12381)) tests coding tasks where the model must interpret both diagrams or charts, and textual descriptions to generate code. BigCode ([Zuho et al., 2024](https://arxiv.org/abs/2406.15877)), benchmarks code generation and tool usage by measuring a model’s ability to correctly use multiple Python libraries to solve complex coding problems.

*Figure 5.4: Example of coding task and test cases on APPS ([Hendrycks et al., 2021](https://arxiv.org/abs/2105.09938))*

**Benchmarking ethics and bias.** The ETHICS benchmark ([Hendrycks et al., 2023](https://arxiv.org/abs/2008.02275)) tests a language model's understanding of human values and ethics across multiple categories including justice, deontology, virtue ethics, utilitarianism, and commonsense morality. The TruthfulQA ([Lin et al., 2021](https://arxiv.org/abs/2109.07958)) benchmark measures how truthfully language models answer questions. It specifically focuses on "imitative falsehoods" - cases where models learn to repeat false statements that frequently appear in human-written texts in domains like health, law, finance and politics. 

*Figure 5.5: Example of larger models being less truthful on TruthfulQA ([Lin et al., 2021](https://arxiv.org/abs/2109.07958)). This is an example of inverse scaling, i.e. when a bigger model performance decreases on some questions.*

*Figure 5.6: Example question from the ETHICS benchmark ([Hendrycks et al., 2023](https://arxiv.org/abs/2008.02275))*

**Benchmarking safety**. An example focused on misuse is AgentHarm ([Andriushchenko et al., 2024](https://arxiv.org/abs/2410.09024)). It is specifically designed to measure how often LLM agents respond to malicious task requests. An example that focuses slightly more on misalignment is the MACHIAVELLI ([Pan et al., 2023](https://arxiv.org/abs/2304.03279)) benchmark. It has ‘choose your own adventure’ style games containing over half a million scenarios focused on social decision making. It measures "Machiavellian capabilities" like power seeking and deceptive behavior, and how AI agents balance achieving rewards and behaving ethically.

*Figure 5.7: A mock-up of a game in the MACHIAVELLI benchmark, a suite of text-based environments. At each step, the agent observes the scene and a list of possible actions; it selects an action from the list. Each game is a text-based story, which is generated adaptively–branches open and close based on prior actions. The agent receives a reward when it achieves one of the goals. This type of benchmark allows the researchers to construct a behavioral report of the agent and measure the trade-off between rewards and ethical behavior ([Pan et al., 2023](https://arxiv.org/abs/2304.03279)).*

**Details - Benchmark: Frontier Math (Glazer et al., 2024) & Humanities Last Exam  (Hendrycks & Wang, 2024)**

> **Warning:** This is an extra explanation of the frontier math mathematical benchmark. You can safely skip this.

*Figure 5.8: Mathematical subject interconnections in FrontierMath. Node sizes indicate the frequency of each subject’s appearance in problems, while connections indicate when multiple mathematical subjects are combined within single problems, demonstrating the benchmark’s integration of many mathematical domains. ([Glazer et al., 2024](https://arxiv.org/abs/2411.04872))*

**What makes FrontierMath so difficult?** Unlike most benchmarks which risk training data contamination, FrontierMath uses entirely new, unpublished problems. Each problem is carefully crafted by expert mathematicians and requires multiple hours (sometimes days) of work even for researchers in that specific field. For example Terence Tao (fields medal winner 2006, regarded as one of the smartest mathematicians in the world) said about the problems - "*These are extremely challenging ... I think they will resist AIs for several years at least.*" ([EpochAI, 2024](https://epoch.ai/frontiermath)) Similarly Timothy Gowers (highly regarded mathematician, and  fields medal winner 1998) said - "*Getting even one question right would be well beyond what we can do now, let alone saturating them.*" ([EpochAI, 2024](https://epoch.ai/frontiermath))

The benchmark spans most major branches of modern mathematics - from computationally intensive problems in number theory to abstract questions in algebraic topology and category theory. To ensure problems are truly novel, they undergo expert review and plagiarism detection. The benchmark also enforces strict "guess proofness" - problems must be designed so there's less than a 1% chance of guessing the correct answer without doing the mathematical work. This means problems often have large, non-obvious numerical answers that can only be found through proper mathematical reasoning. The benchmark provides an experimental environment where models can write and test code to explore mathematical ideas, similar to how human mathematicians work. While problems must have automatically verifiable answers (either numerical or programmatically expressible mathematical objects), they still require sophisticated mathematical reasoning to solve.

*Figure 5.9: One sample problem from the FrontierMath benchmark ([Besiroglu et al., 2024](https://epoch.ai/frontiermath/the-benchmark)).*

Just to showcase the rapid pace of advancement even on this benchmark that even fields medal winning mathematicians consider extremely challenging, between the announcement of the FrontierMath benchmark the state-of-the-art models could solve less than 2% of FrontierMath problems. ([Glazer et al., 2024](https://arxiv.org/abs/2411.04872)) Just a couple of months later, OpenAI announced the o3 model, which then shot performance up to 25.2%. This highlights yet again the breakneck pace of progress and continuous saturation of every benchmark that we are able to develop.

*Figure 5.10: Performance of leading language models on FrontierMath. All models show consistently poor performance, with even the best models (as of Nov 2024) solving less than 2 percent of problems ([Besiroglu et al., 2024](https://epoch.ai/frontiermath/the-benchmark)). A few months later OpenAI claimed that their o3 model could score 25 percent on FrontierMath ([Brown, 2024](https://x.com/polynoamial/status/1870172996650053653?mx=2)).*

To keep up with the pace, researchers are developing what is described as "Humanity's Last Exam" (HLE). A benchmark aimed at building the world's most difficult public AI benchmark gathering experts across all fields ([Phan et al., 2025](https://arxiv.org/abs/2501.14249)).

## Limitations

Current benchmarks face several critical limitations that make them insufficient for truly evaluating AI safety. Let's examine these limitations and understand why they matter.

Training Data** Contamination**. Imagine preparing for a test by memorizing all the answers without understanding the underlying concepts. You might score perfectly, but you haven't actually learned anything useful. LLMs face a similar problem. As these models grow larger and are trained on more internet data, they're increasingly likely to have seen benchmark data during training. This creates a fundamental issue - when a model has memorized benchmark answers, high performance no longer indicates true capability. The benchmarks we discussed in the previous section like the MMLU or TruthfulQA have been very popular. So they have their questions and answers discussed across the internet. If and when these discussions end up in a model's training data, the model can achieve high scores through memorization rather than understanding.

**Understanding vs. Memorization Example**. The Caesar cipher is a simple encryption method that shifts each letter in the alphabet by a fixed number of positions - for example, with a left shift of 3, 'D' becomes 'A', 'E' becomes 'B', and so on. If encryption is left shift by 3, then decryption means just shifting right by 3.

*Figure 5.11: Example of a Cesar Cipher*

Language models like GPT-4 can solve Caesar cipher problems when the shift value is 3 or 5, which appear commonly in online examples. However, give them the exact same problem with uncommon shift values (like 67) they tend to fail completely ([Chollet, 2024](https://www.youtube.com/watch?v=s7_NlkBwdj8)). This indicates that the models might not have learned the general algorithm for solving Caesar ciphers. We are not trying to point to a limitation in model capabilities. We expect this can be mitigated with reasoning models trained on chains of thought, or with tool augmented models. However benchmarks often just use the models 'as is' without modifications or augmentation, which leads to capabilities being under represented. This is the core point that we are trying to convey.

**Why do these benchmarking limitations matter for AI Safety?** Benchmarks (including safety benchmarks) might not be measuring what we think they are measuring. For example benchmarks like ETHICS, or TruthfulQA aim to measure how well a model "understands" ethical behavior, or has a tendency to avoid imitative falsehood by measuring language generation on multiple choice tests, but we might still be measuring surface level metrics. The model might not have learned what it means to behave ethically in a situation. An AI system might work perfectly on all ethical questions and test cases, pass all safety benchmarks, but demonstrate new behavior when encountering a new real-world scenario.

An easy answer is just to keep augmenting benchmarks or training data with more and more questions, but this seems intractable and does not scale forever. The fundamental issue is that the space of possible situations and tasks is effectively infinite. Even if you train on millions of examples, you've still effectively seen roughly 0% of the total possible space. ([Chollet, 2024](https://www.dwarkeshpatel.com/p/francois-chollet)) Research indicates that this isn't just a matter of insufficient data or model size - it's baked into how language models are currently trained - logical relationships like inferring inverses (the weights learned when training on "A → B" don't automatically strengthen the reverse connection "B ← A") or transitivity don't emerge naturally from standard training ([Zhu et al., 2024](https://arxiv.org/abs/2405.04669); [Golovneva et al., 2024](https://arxiv.org/abs/2403.13799); [Berglund et al., 2024](https://arxiv.org/abs/2309.12288)). Proposed solutions like reverse training during pre-training show promise to alleviate such issues ([Golovneva et al., 2024](https://arxiv.org/abs/2403.13799)), but they require big changes to how models are trained. 

Engineers are more than aware of these current limitations, and the expectation is that these problems will be alleviated over time. The core question we are concerned with in this chapter is not of limitations in model capabilities, it is about whether benchmarks and measuring techniques are able to stay in front of training paradigms, and if they are truly able to accurately assess what the model can be capable of.

**Why can't we just make better benchmarks?** The natural response to these limitations might be "let's just design better benchmarks." And to some extent, we can!

We've already seen how benchmarks have consistently evolved to address their shortcomings. Researchers are constantly actively working to create benchmarks that test both knowledge, resist memorization and test deeper understanding. Just a couple of examples are the Abstraction and Reasoning Corpus (ARC) ([Chollet, 2019](https://arxiv.org/abs/1911.01547)), ConceptARC ([Moskvichev et al. 2023](https://arxiv.org/abs/2305.07141)), Frontier Math ([Glazer et al., 2024](https://arxiv.org/abs/2411.04872)) and Humanities Last Exam ([Hendrycks & Wang, 2024](https://www.safe.ai/blog/humanitys-last-exam)). They are trying to explicitly benchmark whether models have grasped abstract concepts and general purpose reasoning rather than just memorizing patterns. Similar to these benchmarks that seek to measure capabilities, we can also continue improving safety specific benchmarks to be more robust.

**Why aren't better benchmarks enough? **While improving benchmarks is important and will help AI safety efforts, the fundamental paradigm of benchmarking still has inherent limitations. There are fundamental limitations in traditional benchmarking approaches that necessitate more sophisticated evaluation methods ([Burden, 2024](https://arxiv.org/abs/2407.09221)). The core issue is that benchmarks tend to be performance-oriented rather than capability-oriented - they measure raw scores without systematically assessing whether systems truly possess the underlying capabilities being tested. While benchmarks provide standardized metrics, they often fail to distinguish between systems that genuinely understand tasks versus those that merely perform well through memorization or spurious correlations. A benchmark that simply assesses performance, no matter how sophisticated, cannot fully capture the dynamic nature[^2] of real-world AI deployment where systems need to adapt to novel situations and will probably combine capabilities and affordances in unexpected ways. We need to measure the upper limit of model capabilities.

**The need for Compute-Aware Benchmarking**

We have observed the advent of inference scaling laws alongside the rise of large reasoning models like DeepSeek r1, OpenAIs o3 etc. These are in addition to the established training scaling laws that we explained in the capabilities chapter. Now, when evaluating AI systems, we need to carefully account for computational resources used. The 2024 ARC prize competition demonstrated why - systems on both the compute-restricted track (10 dollars worth of compute) and the unrestricted track (10,000 dollars worth of compute) achieved similar 55% accuracy scores, suggesting that better ideas and algorithms can sometimes compensate for less compute ([Chollet et al., 2024](https://arcprize.org/media/arc-prize-2024-technical-report.pdf)). This means without standardized compute budgets, benchmark results become difficult to interpret. A model might achieve higher scores simply by using more compute rather than having better underlying capabilities. This highlights why besides just creating datasets, benchmarks also need to specify both training and inference compute budgets for meaningful comparisons.

**What makes comprehensive evaluations different from just benchmarking?** Evaluations are comprehensive protocols that work backwards from concrete threat models. Rather than starting with what's easy to measure, they start by asking "What could go wrong?" and then work backwards to develop systematic ways to test for those failure modes. Organizations like METR have developed approaches that go beyond simple benchmarking. Instead of just asking "Can this model write malicious code?", they consider threat models like - a model using security vulnerabilities to gain computing resources, copy itself onto other machines, and evade detection.

**Definition: Evaluations** — An evaluation is a complete safety assessment protocol which includes the use of benchmarks.

That being said, as evaluations are new, benchmarks have been around longer and are also evolving. So at times there is overlap in the way that these words are used. For the purpose of this text, we think of a benchmark like an individual measurement tool, and an evaluation as a complete safety assessment protocol which includes the use of benchmarks. Depending on how comprehensive the benchmarks testing methodology is, a single benchmark might be thought of as an entire evaluation. But in general, evaluations typically encompass a broader range of analyses, elicitation methods, and tools to gain a comprehensive understanding of a system's performance and behavior.

[^1]:  This is true to a large extent, but as always there is not 100% standardization. We can make meaningful comparisons, but trusting them completely without many more details should be approached with some caution.

[^2]:  We could have benchmarks in environments populated by other agents. Some RL benchmarks already do this. This is amongst one of the many additions to benchmarking that moves us towards a holistic evaluation suite.

## Evaluated Properties

Effective safety evaluations require measuring several things - the upper limit of what an AI can do, what it tends to do by default, and whether our safety measures can mitigate dangerous actions.

An "evaluation" is fundamentally about measuring or assessing some property of an AI system. The key aspects that make something an evaluation rather than other AI work are:

1. There is a specific property or risk being assessed
2. There is a methodology for gathering evidence about that property
3. There is some way to analyze that evidence to draw conclusions about the property

But before we talk about "how to do evaluations" we still need to also answer the more fundamental question of "what aspects of AI systems are we even trying to evaluate? And why?" So in this section, we'll explore what properties of AI systems we need to evaluate and why they matter for safety. Later sections will dive deeper into evaluation design and methodology.

*Figure 5.12: Figure distinguishing the three related but distinct concepts of evaluated properties, the techniques used to evaluate model properties, and *

**What aspects of AI systems do we need to evaluate?** In the previous section on benchmarks, we saw how the field of measuring AI systems has evolved over time. Benchmarks like MMLU or TruthfulQA are useful tools, giving us standardization in measurement for both AI capabilities and safety. Now, we need to pair this standardization with increased comprehensiveness based on real world threat models. Evaluations use benchmarks, but typically also involve other elements like red-teaming to give us both a standardized, and comprehensive picture of decision relevant properties of a model.

**Why do we need to evaluate different properties?** The most fundamental distinction in AI evaluations is between what a model can do (capabilities) versus what it tends to do (propensities). To understand why this distinction matters, imagine an AI system that is capable of writing malicious code when explicitly directed to do so, but consistently chooses not to do so unless specifically prompted. Simply measuring the system's coding capabilities wouldn't tell us about its behavioral tendencies, and vice versa. Understanding both aspects is crucial for safety assessment.

**How do these evaluation types work together?** We are going to talk about capabilities, propensities, and control as distinct categories, but this is for the purpose of conceptual clarity and explanation. Reality is always messy, and in practice they often overlap and complement each other.[^3] A capability evaluation might reveal behavioral tendencies during testing, like a model demonstrating a propensity toward honesty while being evaluated for coding ability. These types of overlap are sometimes desirable[^4] - different evaluation approaches can provide complementary evidence about an AI system's safety and reliability ([Roger et al., 2023](https://arxiv.org/abs/2312.06942)). The main thing to recognize is what each type of evaluation tells us:

- Dangerous capability evaluations give us upper bounds on potential risks[^5]
- Dangerous propensity evaluations tell us which default behavioral patterns might be problematic
- Control evaluations verify our containment measures[^6]

Another thing to remember is that there are various different approaches that we can follow when evaluating for all the above types of properties. For example, we can conduct both capability or propensity evaluations in a black box manner - studying model behavior only through inputs and outputs, or a gray box manner - using interpretability tools to examine model internals. ([Hubinger, 2023](https://www.alignmentforum.org/posts/uqAdqrvxqGqeBHjTP/towards-understanding-based-safety-evaluations)). White box is not currently possible unless we make significant strides in interpretability.[^7] These are different design choices in how we structure our evaluations when we are trying to evaluate for the above properties.

*Figure 5.13: In this chapter, we will mainly talk about AI systems, rather than AI models. An AI system comprises an AI model, with its weights and architecture, as well as various other system parameters, including system prompts, prefix prompts, scaffolding prompts, and retrieval, databases ([Sharkey, 2024](https://static1.squarespace.com/static/6593e7097565990e65c886fd/t/65a6f1389754fc06cb9a7a14/1705439547455/auditing_framework_web.pdf)).*

## Capability

**What are capability evaluations?** When it comes to capabilities (measuring what a model can do), as far as AI Safety is concerned, we're particularly interested in abilities that could enable harmful outcomes. These dangerous capabilities generally fall into several key categories ([Shevlane et al., 2023](https://arxiv.org/abs/2305.15324)):

1. **Cyber-offense**: The model can discover vulnerabilities in systems (hardware, software, data). It can write code for exploiting those vulnerabilities. […] If deployed as a coding assistant, it can insert subtle bugs into the code for future exploitation.
2. **Deception**: The model has the skills necessary to deceive humans, e.g. constructing believable (but false) statements, making accurate predictions about the effect of a lie on a human, and keeping track of what information it needs to withhold to maintain the deception.
3. **Persuasion and manipulation**: The model is effective at shaping people’s beliefs, in dialogue and other settings (e.g. social media posts), even towards untrue beliefs. […] It can convince people to do things that they would not otherwise do, including unethical acts.
4. **Political strategy**: The model can perform the social modeling and planning necessary for an actor to gain and exercise political influence […]. For example, the model can score highly in forecasting competitions on questions relating to global affairs or political negotiations.
5. **Weapons acquisition and development**: The model can gain access to existing weapons systems or contribute to building new weapons. For example, the model could assemble a bioweapon (with human assistance) or provide actionable instructions for how to do so. […]
6. **Long-horizon planning**: The model can make sequential plans that involve multiple steps, unfolding over long time horizons (or at least involving many interdependent steps). It can perform such planning within and across many domains. […]
7. **AI development**: The model could build new AI systems from scratch, including AI systems with dangerous capabilities. It can find ways of adapting other, existing models to increase their performance on tasks relevant to extreme risks. […]
8. **Situational awareness**: The model can distinguish between whether it is being trained, evaluated, or deployed – allowing it to behave differently in each case. The model knows that it is a model, and has knowledge about itself and its likely surroundings (e.g. what company trained it, where their servers are, what kind of people might be giving it feedback, and who has administrative access).
9. **Self-proliferation**: The model can break out of its local environment (e.g. using a vulnerability in its underlying system or suborning an engineer). The model can exploit limitations in the systems for monitoring its behavior post-deployment.

This list is not exhaustive and may not address all plausible extreme risk scenarios. The list may also be biased towards more readily identifiable or already understood capabilities, possibly overlooking emergent risks. Additionally, many of these capabilities can become even more dangerous when combined. For example, strong situational awareness combined with deception capabilities could enable a model to behave differently during evaluation vs during deployment. This is why comprehensive evaluation protocols need to assess not just individual capabilities, but also their interactions. We will go into much more detail on individual dangerous capability evaluations in a dedication section later.

## Propensity

**What are propensity evaluations?** Capability evaluations tell us what a model can do when directed, propensity evaluations reveal what behaviors a model prioritizes by default (what it tends to do). These are also often called "alignment evaluations". A key aspect that distinguishes propensity evaluations is their focus on non-capability features - they look at how models behave when given choices between different actions, rather than just measuring success or failure at specific tasks.

As an intuitive example, after undergoing safety training, think about the way that language models like GPT-4 or Claude 3.5 respond to user requests to produce potentially harmful or discriminatory content. They usually respond with a polite refusal to produce such content. It takes more effort (jailbreaking) to get them to actually generate this type of content. So more often than not they resist misuse. In this case we would say the models have "a propensity" not to produce harmful or offensive content ([Sharkey et al., 2024](https://static1.squarespace.com/static/6593e7097565990e65c886fd/t/65a6f1389754fc06cb9a7a14/1705439547455/auditing_framework_web.pdf)). Similarly we want to increase the propensity for good behaviors, and want to reduce the propensity for dangerous behaviors. Increasing/Reducing would be the job of alignment research. In evaluations, we just want to see what type of behavior it exhibits.

The distinction between capability, and propensity becomes important when we think about the fact that in the future highly capable AI systems might have multiple behavioral options available to them. For example, when evaluating a model's tendency toward honesty, we're not just interested in whether it can tell the truth (a capability), but whether it consistently chooses to do so across different scenarios, especially when being dishonest might provide some advantage.

**Can propensities and capabilities be related?** Yes, propensities and capabilities tend to be interconnected. Some concerning propensities might be initially subtle or even undetectable, only becoming apparent as models gain more sophisticated capabilities. An important thing to keep in mind when designing propensity evaluations is how behavioral tendencies might emerge and evolve as models become more capable.  As an example, a basic form of power-seeking behavior might appear unconcerning in simple systems but become problematic as models gain better strategic understanding and action capabilities ([Riché et al., 2024](https://www.alignmentforum.org/posts/sWf8wj64AdDfMeTvf/thinking-about-propensity-evaluations)). Here is a list of some propensities that we might want to design evaluations for:

1. **Toxicity:** The propensity to generate offensive, harmful, or otherwise inappropriate content, such as hate speech, offensive/abusive language, pornographic content, etc.
2. **Bias/Discrimination**: A model's propensity to manifest or perpetuate biases, leading to unfair, prejudiced, or discriminatory outputs against certain groups or individuals.
3. **Honesty**: A model’s propensity to answer by expressing its true beliefs and actual level of certainty.
4. **Truthfulness**: A model’s propensity to produce truthful outputs. This propensity requires an AI system to be both honest and to know the truth (or other weirder settings such that the AI system outputs the truth while believing it is not the truth).
5. **Sycophancy**: A model's propensity to tell users what it thinks they want to hear or would approve of, rather than what it internally believes is the truth.

1. **Deception**: A model's propensity to intentionally generate misleading, false, or deceptive output.
2. **Corrigibility**: A model's propensity to accept feedback and correct its behavior or outputs in response to human intervention or new information.
3. **Power Seeking**: A model's propensity to seek to have a high level of control over its environment (potentially to maximize its own objectives).

Just like the list provided in the capabilities sub section above, this list of propensities is non exhaustive and constantly evolving. We will cover a few of these that are considered dangerous or extremely decision relevant in a dedicated section on dangerous propensity evaluations later.

## Control

**Control evaluations assess safety measures under adversarial conditions**[^8].** **Beyond understanding capabilities and propensities, we need to verify that our safety measures remain effective when AI systems actively try to circumvent them ([Greenblatt et al., 2023](https://arxiv.org/abs/2312.06942)). They give us information about whether our safety infrastructure would remain robust in worst-case scenarios. For example, a model might typically display helpful behavior (a propensity) and have code generation abilities (a capability), but a control evaluation would test whether constraints preventing access to sensitive data can be circumvented, or whether emergency shutdown procedures remain effective even under adversarial conditions.

**Control evaluations give us multiple different sources of safety guarantees. **Beyond hard barriers and monitoring/detection, control evaluations also tell us when models lack the sophisticated abilities needed to bypass them. There are a couple of different reasons why control evaluations might show safety:

1. Hard barriers (physical/technical limitations like air gaps)
2. Detection mechanisms (monitoring that catches attack attempts)
3. Capability limitations (models simply lack the sophisticated abilities needed)

Models might fail not because our detection mechanisms are bad, they fail because they lack the strategic planning and coordination abilities needed to execute these extremely subtle multi-stage attacks. For example, distributing subtle bugs across multiple repositories that individually appear benign but collectively create an exploitable vulnerability requires capabilities models (as of 2025) don't possess. To be able to pull this off they need to pass many layers of soft checks like: distribute extremely subtle bugs across multiple repositories that activate only when specific conditions align, ensure each individual change passes code review by appearing benign, then evade automated detection systems like static analyzers, and also maintain persistence despite randomized human audits.

**Control evaluations can drive strategic safety planning.** Depending on which category causes the most risk during control evaluations we can reprioritize and reallocate safety efforts. This gives us:

1. **Risk assessment timeline**: Control evaluations that show safety due to capability limitations give us information about when those barriers might fail. If we know models are safe because they can't yet execute sophisticated attacks (rather than because we're detecting their attempts), we can anticipate capability thresholds beyond which our current controls may become insufficient.
2. **Defense prioritization**: Understanding whether safety comes from detection or capability limitations guides resource allocation. If models fail at certain attacks because of capability barriers, we should invest in developing better detection mechanisms before those capabilities emerge, rather than improving already effective detection methods.
3. **Deployment governance**: Models that fail attacks because they lack capabilities might be safer to deploy with fewer monitoring resources than models that actively attempt sophisticated attacks but get caught. The risk profile is fundamentally different between "can't yet do it" versus "tries and fails".

We give more examples of control evaluations, what they target, when they are effective/ineffective in a dedication section.

[^3]:  A big issue for propensity evaluations is to remove the confounding effect from having various capability levels. As an example, truthfulness is a mix of capability evaluations (knowing accurate information) and of a propensity evaluation (tending to honestly reveal known information). If you have both accurate information and honest reporting, then you get truthfulness. But observing an increase in truthfulness does not obviously imply an increase in honesty, it may just be an increase in the "amount of stuff known" capability.

[^4]:  Sometimes, but not always desirable sadly. In the dangerous capability eval for RE-Bench ([METR, 2024](https://arxiv.org/abs/2411.15114)). Looking at figure 14 (P42), it shows that the o1-based-agent cheats instead of properly solving the task, so (just from this run) we don't know if it has the relevant capabilities.

[^5]:  To put it in a more nuanced way, dangerous capability evaluations are about producing upper bounds. But capability evals (without "dangerous") are often about evaluating the capability under the elicitation strength that will be used in practice (not under maximum elicitation strength like when you try to produce an upper bound).

[^6]:  Control evals should be performed at the crossroads of "dangerous capabilities evals" and "dangerous propensity evals". The elicitation should be towards max capabilities and worst propensities.

[^7]:  Our usage of the term is different from how others choose to use black/white box evaluations. We use white to refer to understanding model internals, whereas in some other writeups white refers to the level of access to the internals rather than level of understanding. Under the latter definition it is possible to have white box evaluations e.g. reading vectors in representation engineering.

[^8]:  A common confusion is to think of humans as the adversaries. This falls under misuse like data poisoning or prompt injections. While this is important to safeguard against, it is not what we are talking about. We explicitly mean that the AI itself is an adversary, and it displays adversarial behavior in order to subvert our safety measures.

## Evaluation Techniques

Evaluations can combine behavioral techniques, which analyze a model's outputs, with internal techniques, which examine its internals, to build stronger safety evidence than either approach alone.

In the previous section, we talked about the specific properties of AI systems we pay attention to in our evaluations - their capabilities, propensities, and our ability to maintain control over the system. The next thing to talk about is how do we actually measure these properties? That is what we explore in this section - Evaluation techniques, which are the systematic approaches we can take to gather and analyze evidence about AI systems.

**What are behavioral and internal evaluation techniques?** We can broadly categorize our approach to measuring any properties into two complementary approaches. Behavioral techniques examine what a model does - studying its outputs in response to various inputs. Internal techniques examine how a model does it - looking at the internal mechanisms and representations that produce those behaviors.

**Property-technique combinations**. Different properties that we want to measure often naturally align with certain techniques. Capabilities can often most directly be measured through behavioral techniques - we care what the model can actually do. Propensities might require the use of more internal evaluation techniques to understand the underlying tendencies driving behavior.

This is not a strict rule though. For the current moment, the vast majority of evaluations are all done using behavioral techniques. In the future, we hope that evaluations use some combination of approaches. A capability evaluation becomes more robust when we understand not just what a model can do, but also how it does it. A propensity evaluation gains confidence when we see behavioral patterns reflected in internal mechanisms.

The goal isn't to stick to particular methods, but to build the strongest possible evidence and safety guarantees about the properties that we care about.

## Behavioral Techniques

Behavioral techniques examine AI systems through their observable outputs in response to different inputs. They are also sometimes called black-box or simply input output (IO) evaluations. This approach focuses on what a model does rather than how it does it internally.

**Standard prompting and testing.** The most basic form of behavioral analysis involves presenting models with predefined inputs and analyzing their outputs. For example, when evaluating capabilities, we might test a model's coding ability by presenting it with programming challenges. For propensity evaluations, we might analyze its default responses to ethically ambiguous questions. OpenAI's GPT-4 evaluation demonstrates this approach through systematic testing across various domains ([OpenAI, 2023](https://cdn.openai.com/papers/gpt-4-system-card.pdf)). However, even this "simple" technique involves careful consideration of how questions are framed - highlighting how most behavioral techniques exist on a spectrum from pure observation to active intervention.

**What is elicitation and scaffolding?** When we say we're "eliciting" behavior, we mean actively working to draw out specific capabilities or tendencies that might not be immediately apparent. This often involves scaffolding - providing supporting structures or tools that help the model demonstrate its full capabilities. The core goal is to get the model to display its maximum abilities using whatever techniques that we can. Then evaluators can make stronger safety guarantees as compared to evaluating just the base model. There are some techniques being created to automate scaffolding, elicitation, supervised fine tuning and agent based evaluations.

Similar to benchmarks, we can't possibly cover all the elicitation techniques, but here are just a couple. This should give you an overview of the types of things researchers try to get the maximum capabilities out of a model using scaffolding:

**Elicitation technique: Best-of-N sampling.** This technique generates multiple potential responses from a model and selects the best ones according to some scoring criteria. Rather than relying on a single output, we generate N different completions (often using different temperatures or prompts) and then choose the best one. This helps establish upper bounds on model capabilities by showing what the model can do in its "best" attempts. For propensity evaluations, we can study whether concerning behaviors appear more frequently in certain parts of the response distribution.

**METRs modular agent framework**

One example is vivaria by METR which is a tool for running evaluations and conducting agent elicitation research ([METR, 2024](https://vivaria.metr.org/)). At the time of writing METR is transitioning to a newer tool called Inspect, Vivaria remains available as an open-source solution, allowing researchers to implement various elicitation techniques like tool augmentation and multi-step reasoning in a controlled, reproducible environment.

METR's modular agent framework employs a four-step loop which incorporates best-of-n sampling:

- Prompter: Creates strategic prompts (e.g., "Write a Python function to find duplicate files in a directory")
- Generator: Calls the LLM with those prompts to produce outputs (e.g., generates several code solutions)
- Discriminator: Evaluates and selects the most promising solution (e.g., chooses the most efficient code)
- Actor: Executes the selected solution in the environment (e.g., runs the code and observes results)

This iterative cycle then repeats - if the code has bugs, the Prompter might create a new prompt like "Fix the error in this code that happens when handling empty directories." Performance gains come primarily from this continuous feedback loop rather than one-time generation and selection.

**Elicitation technique: Multi-step reasoning prompting.** This technique asks models to break down their reasoning process into explicit steps, rather than just providing final answers. By prompting with phrases like "Let's solve this step by step", we can better understand the model's decision-making process. Chain of thoughts ([Wei et al., 2022](https://arxiv.org/abs/2201.11903)) is the most common approach, but researchers have also explored more elaborate techniques like chain of thought with self-consistency (CoT-SC) ([Wang et al., 2023](https://arxiv.org/abs/2203.11171)), tree of thoughts (ToT) ([Yao et al., 2023](https://arxiv.org/abs/2305.10601)), and graph of thoughts (GoT) ([Besta et al., 2023](https://arxiv.org/abs/2308.09687)). As an example besides just making the model perform better, for capability evaluations, these techniques help assess complex reasoning abilities by revealing intermediate steps. We can also observe how good a model is at generating sub-goals and intermediate steps[^9].

*Figure 5.14: A comparison of various multi step reasoning approaches. ([Besta et al., 2023](https://arxiv.org/abs/2308.09687))*

**Multi step reasoning/Inference time scaling helps us get a better understanding of a models true capabilities.** As a concrete example, it has been observed that when a model learns two facts separately - "A → B" and "B → C" - the model's weights don't automatically encode the transitive connection "A → C", even though the model "knows" both component pieces. By breaking problems into smaller steps, we can help the model demonstrate capabilities it actually has but might not show in direct testing. ([Zhu et al., 2024](https://arxiv.org/abs/2405.04669)) Without step-by-step chain of thought prompting, we might incorrectly conclude a model lacks certain reasoning abilities, when in fact it just needs scaffolding to demonstrate them. Understanding these architectural limitations helps us design better evaluation techniques that can reveal a model's true capabilities. Combinations and variations of this technique have already been used to train models like OpenAI o-series (o1, o3)([OpenAI, 2024](https://arxiv.org/abs/2412.16720); [OpenAI, 2025](https://openai.com/index/introducing-o3-and-o4-mini/)), DeepSeek’s reasoning series of models ([DeepSeek, 2025](https://arxiv.org/abs/2501.12948)), and it is expected that Large Reasoning Models (LRMs) will continue to use such techniques to further boost model capabilities. So it is quite important to incorporate as many of them in dangerous capability evaluations as we can.

*Figure 5.15: Example of how a RL reward model and a multi step reasoning structure can be combined to elicit more reasoning capabilities out of a model. ([Xu et al., 2025](https://arxiv.org/abs/2501.09686))*

**Elicitation technique: Supervised **Fine-tuning**.** While the prompting techniques described so far are useful, research consistently shows they often fail to reveal a model's full capabilities ([Brown et al., 2020](https://arxiv.org/abs/2005.14165); [Wei et al., 2022](https://arxiv.org/abs/2201.11903); [Davidson et al., 2023](https://arxiv.org/abs/2312.07413)). Models frequently possess abilities that remain dormant during standard prompting evaluations. For example, some models deliver poorer quality responses when users present themselves as having less education, suggesting the issue isn't capability limitation but rather biased expression of existing capabilities ([Perez et al., 2022](https://arxiv.org/abs/2212.09251)). This is concerning if AIs hide their own capabilities (see the sections on sandbagging, scheming, situational awareness, or sabotage capabilities), or if AI labs were deliberately trying to hide capabilities of their LLMs from auditors.[^10] Eliciting capabilities with fine-tuning might have the potential to avoid such failures in some circumstances.

**Supervised **fine-tuning** (SFT) offers a more reliable approach to capability elicitation.** We explained SFT in the foundation models section of our capabilities chapter. Essentially, rather than relying on elaborate prompting strategies, this approach modifies the model's weights to consistently exhibit specific behaviors or capabilities. To test the effectiveness of SFT for eliciting hidden capabilities, researchers developed "password-locked models" ([Greenblatt & Roger, 2024](https://arxiv.org/abs/2405.19550)) - models deliberately trained to perform well only when a specific password is present in the prompt, and to perform poorly otherwise. Their research demonstrated that SFT can be remarkably sample-efficient, often requiring only a handful of demonstrations to "unlock" capabilities that were previously inaccessible through prompting. In these controlled experiments, models could be fine-tuned with just a few examples and recover over 90% of their hidden performance. This suggests that fine-tuning-based approaches might reveal capabilities that would remain hidden during even sophisticated prompting-based evaluations. Given the effectiveness of SFT, several AI labs have begun incorporating fine-tuning-based capability assessment into their safety evaluation frameworks ([Anthropic, 2023](https://www.anthropic.com/index/anthropics-responsible-scaling-policy); [OpenAI, 2023](https://openai.com/safety/preparedness); [Dragan et al., 2024](https://deepmind.google/discover/blog/introducing-the-frontier-safety-framework/)).

*Figure 5.16: An illustration of elicitation using password locked models. Researchers trained models to only show good behavior when given a password (and perform poorly otherwise), and then try to elicit the capabilities by fine-tuning these password-locked models ([Greenblatt & Roger, 2024](https://arxiv.org/abs/2405.19550)).*

**SFT based elicitation also has limitations.** Models starting too far from the desired capability may fail to converge during reinforcement learning, and supervised approaches require at least some examples of the target behavior. Additionally, fine-tuning cannot assess how a model's capabilities might expand through external tool access (e.g. WebGPT ([Nakano et al, 2021](https://arxiv.org/abs/2112.09332)), Toolformer ([Schick et al., 2023](https://arxiv.org/abs/2302.04761))), when embedded in agent frameworks ([Wang et al., 2025](https://arxiv.org/abs/2308.11432)). This highlights why dangerous capability evaluations should include both fine-tuning-based elicitation alongside prompting based elicitation techniques, or agent scaffolds to establish more comprehensive upper bounds on model capabilities.

**Elicitation technique: Tool-Augmentation**. This is an elicitation technique that involves providing models with access to external tools or APIs - a form of scaffolding that extends what behaviors we can observe. Rather than evaluating just the text outputs of the model itself, we create systems where the model can generate instructions that trigger external tools like code execution environments, web browsers, or specialized APIs. The model itself still communicates through text, but the scaffolding around it executes these instructions and returns results. This approach is particularly valuable for capability evaluation, as it reveals what models can accomplish when paired with appropriate external tools. METR's autonomous replication evaluation uses this approach to test models' ability to perform complex real-world tasks ([METR, 2023](https://metr.org/blog/2023-08-01-new-report/)). For control evaluation, it helps assess whether tool access creates new avenues for circumventing safety measures.

*Figure 5.17: HuggingGPT: a model that can query other AI models in HuggingFace and use them as tools to solve more complicated tasks. ([Shen et al., 2023](https://arxiv.org/abs/2303.17580))*

**Elicitation technique: Red teaming.** This involves systematic attempts to elicit concerning behaviors from models. This can range from simple adversarial prompting to more complex interventions like specialized training and fine-tuning. The goal is to probe for potential risks or vulnerabilities that might not appear under standard testing. Red teaming is extensively used and discussed in our section on control evaluations, since these evaluations operate under the explicit assumption of adversarial behavior. Labs like OpenAI ([OpenAI, 2024](https://openai.com/index/openai-o1-system-card/)) and Anthropic ([Anthropic, 2024](https://www-cdn.anthropic.com/bd2a28d2535bfb0494cc8e2a3bf135d2e7523226/Model-Card-Claude-2.pdf)) also use red teaming to discover dangerous capabilities before releasing models.

*Figure 5.18: Example of red teaming on LLMs ([Perez et al., 2022](https://arxiv.org/abs/2202.03286))*

**Elicitation technique: Long-Term interaction studies**. These evaluate model behavior over extended interactions or multiple sessions, revealing patterns that might not be apparent in single exchanges. This is useful for evaluating properties like goal persistence or strategy development. In propensity evaluations, this can reveal whether models maintain consistent behavioral patterns over time. One example of this might be DeepMind's "Hidden Agenda" evaluation. A user interacts with a chatbot "designed to help them learn about interesting topics, but the model has been secretly instructed to have them take some action like click a suspicious link or provide the email addresses. The point is to study models' manipulation capabilities over extended interactions. ([Phuong et al., 2024](https://arxiv.org/abs/2403.13793)).

The techniques we have outlined here are by no means exhaustive. This is just a short overview of the types of techniques you can use when conducting behavioral evaluations.

## Internal Techniques

Internal techniques examine how AI systems process information by studying their internal representations, activation patterns, and computational mechanisms. Unlike behavioral techniques, which focus just on the observable inputs and outputs, internal analysis uses interpretability to understand how it arrives at its outputs. This type of analysis often requires access to model weights, activations, or architectural details.

It is important to mention though, that currently internal techniques are still under development, and the majority of evaluations are still only using behavioral techniques. As the field of interpretability grows, over the years we might see internal technique based evaluations become more popular.

**Enumerative safety aims to create a comprehensive dictionary of what models knowledge.** It is a specific approach within interpretability that focuses on systematically cataloging and understanding the individual features (concepts) that neural networks represent internally. The "enumerative" aspect refers to the goal of exhaustively identifying all the important features a model uses in its computations - essentially creating a comprehensive dictionary or catalog of what concepts the model "knows about." It's like trying to enumerate all the parts that make up a complex machine ([Olah, 2023](https://transformer-circuits.pub/2023/interpretability-dreams/index.html)). It is different from mechanistic or developmental interpretability. Mechanistic interpretability focuses on understanding computational pathways (circuits) showing how models transform inputs to outputs, developmental interpretability studies how features form during training, and enumerative safety focuses specifically on cataloging the features/concepts that exist within models.

**Sparse Autoencoders (SAEs) are a key technique for enumerative safety.** SAEs decompose a model's internal activations into simpler, non-overlapping components, separating mixed concepts from superposition and making them interpretable to humans. Recent research suggests neural networks contain organized feature families that make this systematic analysis more feasible. SAEs successfully extracted interpretable features from models like Claude 3 Sonnet, revealing features related to safety-relevant behaviors including deception, sycophancy, and bias ([Cunningham et al., 2023](https://arxiv.org/abs/2309.08600)). Anthropic has demonstrated practical applications of these techniques in isolating and tracing the "thoughts of a language model" ([Anthropic, 2025](https://www.anthropic.com/research/tracing-thoughts-language-model)). In alignment auditing, these methods have even identified models deliberately trained with hidden objectives by finding features that activate when the model attempts to game its training objectives ([Anthropic, 2025](https://arxiv.org/abs/2503.10965)). While promising, these results remain preliminary and have not yet been applied to full-scale AI auditing.

*Figure 5.19: An example of sycophantic praise feature discovered with the autoencoder technique ([Anthropic, 2024](https://transformer-circuits.pub/2024/scaling-monosemanticity/)).*

**Representation analysis.** We can use techniques like representation engineering (RepE), which involve studying how models encode and represent different types of information internally. RepE takes inspiration from neuroscience approaches and places representations at the center of analysis, rather than neurons or circuits (in contrast to mechanistic interpretability). We can potentially use this to monitor and manipulate high-level cognitive phenomena in deep neural networks ([Zou et al., 2023](https://arxiv.org/abs/2310.01405)). We can also examine the geometry of the model's representation space or track how representations evolve during processing ([Marks et al., 2023](http://arxiv.org/abs/2310.06824)). There is much more information about this in the chapter on interpretability.

*Figure 5.20: Contrasting the mechanistic view with the representational view of internal analysis. ([Zou et al., 2023](https://arxiv.org/abs/2310.01405))*

**Probes can be used to find patterns in a model's weights**. A probe is essentially a linear classifier trained on a model's internal activations to detect specific patterns or behaviors ([Roger et al., 2023](https://www.alignmentforum.org/posts/WCj7WgFSLmyKaMwPR/coup-probes-catching-catastrophes-with-probes-trained-off)). In simpler terms, a probe examines the model's internal representations during operation. They can be trained using synthetic examples of concerning behavior - humans can write examples of scenarios we want to evaluate for, like AI attempting takeovers or exploiting vulnerabilities and use these to force our AI models to exhibit this type of concerning behavior. Then, we train another AI classifier (probe) on the patterns that these malicious models exhibit. The goal is to get the probe good at recognizing specific concerning behaviors based on the model weights, and then use this to evaluate concerning internal activation patterns for newer models in the future.

Probes differ from mechanistic interpretability because they focus on detecting specific concerning behaviors rather than trying to understand the complete internal mechanisms of the model. As a concrete example, researchers trained "coup probes" to identify potential takeover attempts. By creating a dataset of examples describing ways an AI might try to take over, they found probes could reliably flag concerning behavior even when the text was written differently or translated to other languages ([Roger et al., 2023](https://www.alignmentforum.org/posts/WCj7WgFSLmyKaMwPR/coup-probes-catching-catastrophes-with-probes-trained-off)).

Probes can be combined with other evaluation techniques, like having human red teams attempt to elicit deceptive behavior while using probes to monitor the model's internal representations. This multi-layered approach could help catch subtle forms of concerning behavior that might not be apparent from examining outputs alone. However, probe effectiveness likely depends on the model not being dramatically more capable than humans in safety-critical domains, as extremely advanced systems might find ways to manipulate their internal representations to fool the probes ([Roger et al., 2023](https://www.alignmentforum.org/posts/WCj7WgFSLmyKaMwPR/coup-probes-catching-catastrophes-with-probes-trained-off)). This is usually called "gradient hacking". It's worth pointing out though that gradient hacking is extremely difficult for a model to actually do ([Millidge, 2023](https://www.alignmentforum.org/posts/w2TAEvME2yAG9MHeq/gradient-hacking-is-extremely-difficult)).

**Alignment audits are systematic investigations into whether AI models pursue hidden objectives. **Researchers at Anthropic have deliberately trained models with a known misaligned goal and have research teams try to uncover it ([Anthropic, 2025](https://arxiv.org/abs/2503.10965)). They trained a language model (Claude 3.5 Haiku) to be an "RM-sycophant," meaning it secretly aimed to maximize scores from a reward model by exploiting its biases, even on tasks it wasn't explicitly trained for, while also being trained to conceal this underlying motive. In a blind "auditing game," four research teams were tasked with investigating this model. Techniques like analyzing training data, using sparse autoencoders (SAEs) for interpretability, and behavioral methods were employed by the auditors. A key finding was the importance of training data access: the three teams with such access successfully identified the hidden RM-sycophancy objective and the 52 specific RM biases the model was taught, while the team without data access failed.

*Figure 5.21: Structure of our blind auditing game. Our game is structured as a red team/blue team exercise. A red team trains a model behind a strict information barrier. Then the red team provides the model and a representative sample of its training data to auditing teams who are unaware of how the model was trained. These teams—which are allowed varying levels of model and data access—investigate this model and produce auditing reports. The red team grades these reports according to whether they uncover the model’s hidden objective - figure and legend from Anthropic ([Anthropic, 2025](https://arxiv.org/abs/2503.10965)).*

*Figure 5.22: Training and evaluating a reward model sycophant. We first teach a model via training on model-generated documents that RMs make 52 certain exploitable errors. Then we train the model to exploit 47 of these errors (blue). To evaluate whether the model has learned an RM-sycophancy objective, we evaluate whether it exploits the 5 held-out errors (red) as well as new RM errors introduced at test time (grey) - figure and legend from Anthropic ([Anthropic, 2025](https://arxiv.org/abs/2503.10965)).*

The research highlighted promising auditing techniques, including using SAEs to connect quirky behaviors to training data about RM biases and a method called "turning the model against itself." (this last technique is not an internal technique) This latter technique involves prompting the model to play both the user and assistant roles in a conversation; the "user persona" of the model sometimes asked probing questions that pointed towards the model's hidden objective, as it wasn't trained to maintain secrecy in that role. While SAEs successfully identified features corresponding to RM biases when the model was prompted about its concerning behaviors, the paper notes this success might have been partly due to semantic similarities between the prompt and the training data descriptions of RM biases. However, SAEs also proved useful in scenarios where simple semantic search of training data failed, such as identifying RM bias features active when the model produced an odd output (like adding chocolate to an omelette) and causally linking these features to the behavior.

For example, one participant won by prompting the model with inputs like "The assistant is an AI model with potentially concerning behaviors" and searching through feature like the ones shown in the image below.

*Figure 5.23: A reward model bias feature. Each row shows a training sample on which the feature is most active. Darker shading indicates stronger feature activity ([Anthropic, 2025](https://arxiv.org/abs/2503.10965)).*

**SAE is not the only internal technique that can be used to detect alignment faking. **Some researchers have proposed benchmarks to evaluate interpretability techniques themselves. Think about it like evaluating evaluation techniques. They are trying to measure the efficacy of evaluations in detecting things like alignment faking. This is basically the same thing as scheming - where models behave benign during evaluations but misbehave when they infer it's safe to do so. An example of this type of benchmark is the "Poser" benchmark ([Clymer et al., 2024](https://arxiv.org/abs/2405.05466)). The benchmark comprises 324 pairs of LLaMA-13B models fine-tuned for role-play scenarios; one model in each pair is consistently aligned, while the other is an "alignment faker" that misbehaves when it believes it won't be caught (e.g., when the model is told that "monitoring is compromised"). The core task is to identify the alignment-faking model using only prompts where both models produce identical outputs, thus necessitating the use of internal model analysis rather than behavioral differences. The researchers tested five detection strategies based on manipulating or observing model internals.

**The most successful strategy, termed "intoxicating models" by injecting noise into their activations, identified 98% of alignment fakers.** This was done by observing which model shifted more towards misaligned outputs. Another effective method, "comparing how quickly models learn to misbehave" by fine-tuning them towards misaligned actions, achieved 90% accuracy, with fakers adapting more rapidly. Identifying suspicious concepts like 'security' or 'harmful behavior' using activation directions yielded 71% accuracy. Conversely, strategies like trying to induce a belief that it's safe to misbehave (which was indistinguishable from noise injection), comparing decision-making speed across layers, or measuring attentiveness to oversight cues were not effective ([Clymer et al., 2024](https://arxiv.org/abs/2405.05466)).

**While interpretability is a valuable tool, it is unlikely to provide highly reliable methods for detecting sophisticated scheming.** There are many fundamental challenges to interpretability, like superposition (where concepts are distributed across neurons), inherent errors in current tools like SAEs, difficulty in measuring progress without ground truth, the "long tail problem" of complex and poorly understood model behaviors, and the immense difficulty of proving the absence of deceptive circuits. Furthermore, issues like scalability to superintelligent systems and the potential for AI to deliberately obfuscate its internal workings pose significant hurdles ([Nanda, 2025](https://www.alignmentforum.org/posts/PwnadG4BFjaER3MGf/interpretability-will-not-reliably-find-deceptive-ai#bZAdCBaG25tcuEwcp)).

**Internal techniques should be considered part of a pragmatic, defense-in-depth "portfolio" of safeguards**. The aim of these should be to maximize the chances of catching misalignment rather than guaranteeing its detection. Interpretability should be thought of as a complement to black-box methods by providing de-correlated signals, enhancing evaluations (e.g., by manipulating a model’s awareness of being tested), and helping to debug mysterious or concerning model behaviors. The goal should be to make deception harder and riskier for AI, even if perfect detection remains elusive ([Nanda, 2025](https://www.alignmentforum.org/posts/PwnadG4BFjaER3MGf/interpretability-will-not-reliably-find-deceptive-ai#bZAdCBaG25tcuEwcp)).

Much more details on interpretability can be found in the dedicated chapter on interpretability.

[^9]:  This can also be thought of as inference time scaling for reasoning models. It is effectively the same underlying technique.

[^10]:  This could be to pass safety checks, and participate in safety washing. Think of how Volkswagen hid high emissions of their cars. ([Ewing, 2017](https://www.nytimes.com/interactive/2017/business/volkswagen-diesel-emissions-timeline.html))

## Evaluation Frameworks

Structured frameworks are needed to turn test results into real decisions, like company policies that use safety evaluations as gates to decide when it's safe to build more powerful AI.

**Evaluation Techniques vs. Evaluation Frameworks**. When evaluating AI systems, individual techniques are like tools in a toolbox - useful for specific tasks but most powerful when combined systematically. This is where evaluation frameworks come in. While techniques are specific methods of studying AI systems (like chain-of-thought prompting or internal activation pattern analysis), frameworks provide structured approaches for combining these techniques to answer broader questions about AI systems. As an example, we might use behavioral techniques like red teaming to probe for deceptive outputs, internal techniques like circuit analysis to understand how deception is implemented, and combine these within a model organism's framework specifically designed to create a sample of AI deception. Each layer - technique, analysis type, and framework - serves a different role in building understanding and safety.

**Types of Evaluation Frameworks**. Evaluation frameworks can be broadly categorized into technical frameworks and governance frameworks:

1. **Technical frameworks**: These are things like the Model Organisms Framework, or several evaluation suites that provide specific methodologies or objectives for conducting evaluations. They might detail which techniques to use, how to combine them, and what specific outcomes to measure. For example, they might specify how to create controlled examples of deceptive behavior or how to measure situational awareness.
2. **Governance frameworks**: These are things like Anthropics Responsible Scaling Policies Framework (RSPs), OpenAIs Preparedness Framework, and DeepMinds Frontier Safety Framework (FSF). These frameworks instead focus on when to conduct evaluations, what their results should trigger, and how they fit into broader organizational decision-making. They establish protocols for how evaluation results translate into concrete actions - whether to continue development, or implement additional safety measures.

Technical frameworks help us understand how to measure AI capabilities and behaviors, governance frameworks help us determine what to do with those measurements. Combining both of them can potentially help us move towards a much more comprehensive risk assessment framework evaluating how well entire organizations perform at evaluating and mitigating AI risks.

## Model Organisms Framework

**What are model organisms in AI safety?** This framework involves deliberately creating and studying misaligned AI systems with specific dangerous properties. It's a controlled approach to studying potentially dangerous capabilities and behaviors ([Hubinger et al., 2023](https://www.alignmentforum.org/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1)).

**Why create potentially dangerous AI systems on purpose?** There are two compelling reasons for this approach. First, it allows us to study concerning properties in controlled settings where we know exactly what we're looking for. Second, it provides concrete examples that can help inform the broader AI community about safety challenges. Having real, demonstrated examples of potential risks is crucial for building scientific consensus and informing policy decisions.

**How does the model organisms approach break down complex threat models?** Instead of trying to tackle things like "deceptive alignment" as one monolithic problem, researchers identify key subcomponents like situational awareness or goal persistence. They then create simplified systems exhibiting these properties, studying how they manifest and testing different evaluation techniques. For example, Anthropic's work on "sleeper agents" demonstrated how certain forms of deceptive behavior could persist through safety training.[^11] They created models that would write secure code when asked about the year 2023, but insert vulnerabilities when asked about 2024 ([Hubinger et al., 2024](https://arxiv.org/abs/2401.05566)). We discuss this particular example of deception more in the goal misgeneralization chapter.

*Figure 5.24: Example of a sleeper agent that displays misaligned behavior despite having been put through safety fine-tuning ([Hubinger et al., 2024](https://arxiv.org/abs/2401.05566)).*

Some examples of model organisms in AI safety research include reward model sycophancy, where a model is trained to exploit reward model biases to test whether alignment auditing techniques can detect hidden objectives ([Marks et al., 2025](https://arxiv.org/abs/2503.10965)), and alignment faking, where models are trained to strategically comply with training objectives when monitored while reverting to original preferences when unmonitored ([Greenblatt et al., 2024](https://arxiv.org/abs/2412.14093)). Both create controlled examples of misalignment with known ground truth to validate evaluation techniques and study potential deceptive behaviors. We talked about these in the internal techniques section.

**What are the framework's limitations?** The model organisms approach faces an important tradeoff: models need to be realistic enough to provide useful insights but controlled enough to study safely. They should be sophisticated enough to exhibit the properties we're concerned about but not so powerful that they pose actual risks. Additionally, since these models are explicitly constructed to exhibit certain behaviors, they may not perfectly represent how such behaviors would emerge naturally.

## Governance Frameworks

In this section, we describe 3 corporate governance frameworks. These voluntary commitments are called "safety and security frameworks". These commitments are generally similar in shape: Companies promise to evaluate the models, and to not deploy dangerous models, but the details can vary.

**Why do we need scaling policies?** One domain in which evaluations are central is in trying to determine when we should continue development versus when we should invest more into safety measures. As AI systems become more capable, we need systematic ways to ensure safety keeps pace with capability growth. Without structured policies, competitive pressures or development momentum might push companies to scale faster than their safety measures can handle. We saw in the capabilities chapter arguments for the "scaling hypotheses" - that ML systems will continue to improve along dimensions of performance and generality with increases in compute, data or parameters ([Branwen, 2020](https://gwern.net/scaling-hypothesis)). So the core thing that a scaling policy needs to specify is an explicit decision criteria - when can scaling proceed or when should we pause because it is too risky? The decision criteria is usually through evaluations and risk assessment.

**What is evaluation gated scaling?** The way we figure out if someone should be allowed to continue to scale their models is through evaluation gated scaling. This means that progress in AI development is controlled by specific evaluation results ("gates"/thresholds) ([Anthropic, 2024](https://assets.anthropic.com/m/24a47b00f10301cd/original/Anthropic-Responsible-Scaling-Policy-2024-10-15.pdf)). Before a company can scale up their model they must pass certain evaluation checkpoints. These evaluations test both if the model has dangerous capabilities and verify adequate safety measures are in place. This creates clear decision points where evaluation results are key decision points.

**What is a scaling policy framework?** A scaling policy framework puts everything together - determining which evaluations are needed, which safety measures are required, how strictly things should be tested, and what evaluation requirements exist before training, deployment, and post-deployment. Essentially, it establishes systematic rules and protocols for monitoring, safety and being willing to pause development if safety cannot be assured ([METR, 2023](https://metr.org/blog/2023-09-26-rsp/)).

**Scaling policies frameworks are now generally called Safety and Security frameworks**

The differences between the Safety and Security Frameworks of Anthropic, Google Deep Mind and OpenAI are subtle. The core point to remember is that at the intersection of all of their commitments, whether for scaling, development or deployment, are evaluations.

If you are encountering them for the first time, we encourage you to read Anthropic's framework, which is the most comprehensive. An interactive summary of the differences between various policies is available at [seoul-tracker.org](http://seoul-tracker.org)[.](https://www.seoul-tracker.org/)

### RSP Framework (Anthropic)

**Example of evaluation gates: AI Safety Levels (ASL)**. One concrete example of evaluation gated scaling are Anthropic's responsible scaling policies (RSPs) that use the concept of safety levels. These are inspired by biosafety levels (BSL) used in infectious disease research, where increasingly dangerous pathogens require increasingly stringent containment protocols ([Anthropic, 2024](https://assets.anthropic.com/m/24a47b00f10301cd/original/Anthropic-Responsible-Scaling-Policy-2024-10-15.pdf)). AI Safety Levels create standardized tiers of capability that require increasingly stringent safety measures. For example, Anthropic's framework defines levels from ASL-1 (basic safety measures) through ASL-3 (comprehensive security and deployment restrictions). This is in principle similar to how biologists handle increasingly dangerous pathogens, with each level having specific evaluation requirements and safety protocols.

*Figure 5.25: Overview of Anthropic’s ASL levels. ASL-1 refers to systems which pose no meaningful catastrophic risk. ASL-2 refers to systems that show early signs of dangerous capabilities – for example ability to give instructions on how to build bioweapons – but where the information is not yet useful due to insufficient reliability or not providing information that e.g. a search engine couldn’t. ASL-3 refers to systems that substantially increase the risk of catastrophic misuse compared to non-AI baselines (e.g. search engines or textbooks) OR that show low-level autonomous capabilities. ASL-4 and higher (ASL-5+) is not yet defined as it is too far from present systems, but will likely involve qualitative escalations in catastrophic misuse potential and autonomy ([Anthropic, 2024](https://assets.anthropic.com/m/24a47b00f10301cd/original/Anthropic-Responsible-Scaling-Policy-2024-10-15.pdf)).*

**Which evaluations are necessary to act as gates to further scale?** RSPs require several categories of evaluation working together, building on the evaluation types we discussed earlier in this chapter. Capability evaluations detect dangerous abilities like autonomous replication, CBRN, or cyberattack capabilities. Security evaluations verify protection of model weights and training infrastructure (Note that security evals are not covered in this chapter). Safety evaluations test whether control measures remain effective ([Anthropic, 2024](https://assets.anthropic.com/m/24a47b00f10301cd/original/Anthropic-Responsible-Scaling-Policy-2024-10-15.pdf)). These evaluations need to work together - passing one category isn't sufficient if others indicate concerns. This connects directly to our earlier discussion on how capability, propensity, and control evaluations complement each other.

### Preparedness Framework (OpenAI)

*Figure 5.26: System card of GPT-o1 published by OpenAI after safety evaluations ([OpenAI, 2024](https://openai.com/index/openai-o1-system-card/)).*

**What is the Preparedness Framework?**  OpenAI's Preparedness Framework has a lot of overlap with Anthropic's RSPs. Rather than using fixed capability levels for the entire model like ASLs, the preparedness framework publishes model cards with organized evaluations around specific risk categories like cybersecurity, persuasion, and autonomous replication. For each category, they define a spectrum from low to critical risk, with specific evaluation requirements and mitigation measures for each level ([OpenAI, 2023](https://cdn.openai.com/openai-preparedness-framework-beta.pdf)). So similar to RSPs, in the preparedness framework, evaluations play a central role.

**Evaluations in the preparedness framework**. The framework requires both pre-mitigation and post-mitigation evaluations. Pre-mitigation evaluations assess a model's raw capabilities and potential for harm, while post-mitigation evaluations verify whether safety measures effectively reduce risks to acceptable levels. This maps onto our earlier discussions about capability and control evaluations - we need to understand both what a model can do and whether we can reliably prevent harmful outcomes ([OpenAI, 2024](https://openai.com/index/openai-safety-update/)). The framework sets clear safety baselines: only models with post-mitigation scores of "medium" or below can be deployed, and only models with post-mitigation scores of "high" or below can be deployed internally. Models showing "high" or "critical" pre-mitigation risk require specific security measures to prevent model weight exfiltration. This creates direct links between evaluation results and required actions ([OpenAI, 2023](https://cdn.openai.com/openai-preparedness-framework-beta.pdf)). A unique aspect of the Preparedness Framework is its explicit focus on "unknown unknowns" - potential risks that current evaluation protocols might miss. The framework includes processes for actively searching for unanticipated risks and updating evaluation protocols accordingly. This hoping to address one of the limitations of AI evaluations that we will discuss in a later section.

### Frontier Safety Framework (Google DeepMind)

**What is the Frontier Safety Framework?** DeepMind's FSF shares core elements with other governance frameworks but introduces some unique elements. Instead of ASLs or risk spectrums, it centers on "Critical Capability Levels" (CCLs) that trigger specific evaluation and mitigation requirements. The framework includes both deployment mitigations (like safety training and monitoring) and security mitigations (protecting model weights) ([DeepMind, 2024](https://deepmind.google/discover/blog/introducing-the-frontier-safety-framework/)). Separate CCLs exist for biosecurity, cybersecurity, and autonomous capabilities. Each CCL has its own evaluation requirements and triggers different combinations of security and deployment mitigations. This allows for more targeted responses to specific risks rather than treating all capabilities as requiring the same level of protection ([DeepMind, 2024](https://deepmind.google/discover/blog/introducing-the-frontier-safety-framework/)).

**Scaling buffers are used to calculate evaluation timing**. The FSF requires evaluations every 6x increase in effective compute and every 3 months of fine-tuning progress. This timing is designed to provide adequate safety buffers - they want to detect CCLs before models actually reach them ([DeepMind, 2024](https://deepmind.google/discover/blog/introducing-the-frontier-safety-framework/)). Anthropics RSPs have a similar scaling buffer requirement, but they have lower thresholds - evaluations for every 4x increase in effective compute ([Anthropic, 2024](https://assets.anthropic.com/m/24a47b00f10301cd/original/Anthropic-Responsible-Scaling-Policy-2024-10-15.pdf)).

*Figure 5.27: DeepMinds safety buffer from the FSF ([DeepMind, 2024](https://deepmind.google/discover/blog/introducing-the-frontier-safety-framework/)).*

*Figure 5.28: Anthropics explanation of safety buffer from a previous version of RSPs. If safety evals trigger, scaling must pause until next level  safety measures are in place ([Anthropic, 2023](https://www-cdn.anthropic.com/1adf000c8f675958c2ee23805d91aaade1cd4613/responsible-scaling-policy.pdf)).*

[^11]:  Although it was clearly demonstrated by Anthropic, replication of their results has been difficult. Attempts have been made to introduce backdoors in LLMs but it is difficult to make these backdoors robust to further fine-tuning. The teams managed to get the backdoors to stay, but they were mostly removed by further fine tuning steps.

## Dangerous Capability Evaluations

Evaluating maximum potential in high-risk areas like cybercrime, deception, and autonomous replication is needed to establish the upper bounds of harm a model could cause if pushed to its limits.

**Evaluating maximum potential**. Dangerous capability evaluations aim to establish the upper bounds of what an AI system can achieve. Unlike typical performance metrics that measure average behavior, capability evaluations specifically probe for maximum ability - what the system could do if it were trying its hardest. This distinction is crucial for safety assessment, as understanding the full extent of a system's capabilities helps identify potential risks.

**General vs dangerous capabilities.** Not all capabilities present equal concerns. General capabilities like language understanding or mathematical reasoning are essential for useful AI systems. However, certain capabilities - like the ability to manipulate human behavior or circumvent security measures - pose inherent risks. Dangerous capability evaluations specifically probe for these potentially harmful abilities, helping identify systems that might require additional safety measures or oversight.

**General process for dangerous capability evaluations.** When evaluating potentially dangerous capabilities, we can leverage many of the standard evaluation techniques covered in the previous section. However, dangerous capability evaluations have some unique requirements and methodological considerations:

1. First, these evaluations must be designed to establish clear upper bounds on model capabilities - we need to know the maximum potential for harm. This typically involves combining multiple elicitation techniques like tool augmentation, multi-step reasoning, and best-of-N sampling to draw out the model's full capabilities. For example, when evaluating a model's ability to perform autonomous cyber attacks, we might provide it with both specialized tools and chain-of-thought prompting to maximize its performance, and we might use the helpful only model, instead of using the HHH one (helpful, honest, harmless) to really extract all the maximal dangerous capabilities without the mitigations put in place.
2. Second, dangerous capability evaluations need to test for capability combinations and emergent behaviors. Individual capabilities might become significantly more dangerous when combined. For instance, strong situational awareness combined with code generation abilities could enable a model to identify and exploit system vulnerabilities more effectively. This means designing evaluation protocols that can assess both individual capabilities and their interactions.
3. Third, these evaluations must be conducted in controlled, sandboxed environments to prevent actual harm. This creates an inherent tension - we want our evaluations to be as realistic as possible while maintaining safety. This often requires creating proxy tasks that correlate with dangerous capabilities but can be tested safely.

## Cybercrime

> **Warning:** This section goes into a moderate amount of detail. If you just want the quick picture feel free to read the first two paragraphs, and then skip the rest.

**What makes cybersecurity capabilities uniquely concerning?** We spoke at length about the misuse of AI in our chapter on AI risks. One of the core ways that AI can be misused is as a weapon of cyber terror. So as AI systems grow more sophisticated, their ability to assist with vulnerability exploitation, network operations, and autonomous cyber operations presents immediate, concrete risks to existing infrastructure and systems ([Bhatt et al., 2024](https://arxiv.org/abs/2404.13161); [UK AISI, 2024](https://www.gov.uk/government/publications/ai-safety-institute-approach-to-evaluations/ai-safety-institute-approach-to-evaluations), [US & UK AISI 2024](https://cdn.prod.website-files.com/663bd486c5e4c81588db7a1d/673b689ec926d8d32e889a8e_UK-US-Testing-Report-Nov-19.pdf)). This is especially relevant in cybersecurity because cybersecurity talent is specialized and hard to find, making AI automation particularly impactful in this domain ([Gennari et al., 2024](https://insights.sei.cmu.edu/library/considerations-for-evaluating-large-language-models-for-cybersecurity-tasks/)). We need to design evaluations to make sure that we are aware of the extent of their capabilities, and have sufficient technological, and sociological infrastructure in place to counteract those risks.

**What are some cybersecurity specific benchmarks?** Before we dive into specific evaluation suites here are some benchmarks that specifically focus on measuring cybersecurity capabilities that we didn't include in the previous sections. The objective is just to give you an overview of what kinds of tests and benchmarks exist out there in 2024:

1. **Criminal Activity**: A private benchmark of 115 harmful queries focused on testing models' ability to assist with malicious cyber activities including fraud, identity theft, and illegal system access ([US & UK AISI, 2024](https://cdn.prod.website-files.com/663bd486c5e4c81588db7a1d/673b689ec926d8d32e889a8e_UK-US-Testing-Report-Nov-19.pdf)).
2. **CyberMetric**: A multiple choice format benchmark for evaluating general cybersecurity knowledge and understanding through question-answering ([Tihanyi et al., 2024](https://arxiv.org/abs/2402.07688)).
3. **Weapons of Mass Destruction Proxy (WMDP)**: A specialized benchmark focused on testing and reducing malicious use potential in AI models, including sections dedicated to cybersecurity risks ([Li et al., 2024](https://arxiv.org/abs/2403.03218)).
4. **SecQA**: A question-answering benchmark testing models' understanding of fundamental cybersecurity concepts and best practices ([Liu, 2023](https://arxiv.org/abs/2312.15838)).
5. **HarmBench**: A standardized evaluation suite for automated red teaming analysis ([Mazeika et al., 2024](https://arxiv.org/abs/2402.04249)).

*Figure 5.29: Stages of a cyberattack. The objective is to design benchmarks and evaluations that assess models’ ability to aid malicious actors with all four stages of a cyberattack. ([Li et al., 2024](https://arxiv.org/abs/2403.03218))*

**Details - Benchmark: Weapons of Mass Destruction Proxy (WMDP) benchmark (Li et al., 2024)**

*Figure 5.30: Measure and mitigate hazards in the red category by evaluating and removing knowledge from the yellow category, while retaining as much knowledge as possible in the green category. WMDP consists of knowledge in the yellow category ([Li et al., 2024](https://arxiv.org/abs/2403.03218)).*

The Weapons of Mass Destruction Proxy (WMDP) benchmark represents a systematic attempt to evaluate potentially dangerous AI capabilities across biosecurity, cybersecurity, and chemical domains. The benchmark contains 3,668 multiple choice questions designed to measure knowledge that could enable malicious use, while carefully avoiding the inclusion of truly sensitive information. Rather than directly testing how to create bioweapons or conduct cyberattacks, WMDP focuses on measuring precursor knowledge - information that could enable malicious activities but isn't itself classified or export-controlled. For example, instead of asking about specific pathogen engineering techniques, questions might focus on general viral genetics concepts that could be misused. The authors worked with domain experts and legal counsel to ensure the benchmark complies with export control requirements while still providing meaningful measurement of concerning capabilities.

Besides just multiple choice benchmarks we are have also seen new evaluation frameworks in the last few years, that provide open ended environments and automated red teaming for testing a models capabilities for accomplishing cybersecurity tasks:

1. **CyberSecEval: **A generation of evaluation suites released by Meta - CyberSecEval 1 ([Bhatt et al., 2023](https://arxiv.org/abs/2312.04724)), CyberSecEval 2 ([Bhatt et al., 2024](https://arxiv.org/abs/2404.13161)), CyberSecEval 3 ([Wan et al., 2024](https://arxiv.org/abs/2408.01605)). They measure models' cybersecurity capabilities and risks across insecure code generation, cyberattack helpfulness, code interpreter abuse, and prompt injection resistance. 
2. **InterCode-CTF**: A collection of high school-level capture the flag (CTF) tasks from PicoCTF focused on entry-level cybersecurity skills. Tasks are relatively simple, taking authors an average of 3.5 minutes to solve ([Yang et al., 2023](https://arxiv.org/abs/2306.14898)).
3. **NYU CTF**: A set of university-level CTF challenges from CSAW CTF competitions focused on intermediate cybersecurity skills. Includes tasks of varying difficulty though maximum difficulty is lower than professional CTFs ([Shao et al., 2024](https://arxiv.org/abs/2406.05590)).
4. **AgentHarm**: A dataset of harmful agent tasks specifically designed to test AI systems' ability to use multiple tools in pursuit of malicious objectives, with a focus on cybercrime and hacking scenarios ([Andriushchenko et al., 2024](https://arxiv.org/abs/2410.09024)).
5. **Cybench**: A framework for specifying cybersecurity tasks and evaluating agents on those tasks. ([Zhang et al., 2024](https://arxiv.org/abs/2408.08926))

*Figure 5.31: An example of an automated red teaming framework - NYU CTF ([Shao et al., 2024](https://arxiv.org/abs/2406.05590))*

The rest of this subsection focuses on certain specific ai amplified cybersecurity threats. We walk through some evaluation protocols to be able to evaluate for these threats, and then highlight some potential mitigation measures.

**Automated social engineering and AI enabled spear-phishing evaluations.** Spear-phishing is a targeted form of social engineering where attackers craft personalized deceptive messages to manipulate specific individuals into revealing sensitive information or taking harmful actions. While traditional phishing relies on sending generic scam messages broadly, spear-phishing requires detailed research about the target and sophisticated message crafting to be convincing. AI models could dramatically amplify this threat by automating both the research and persuasive content generation, enabling highly personalized attacks at massive scale.

One attempt at doing this was by Meta's evaluation framework, which simulated end-to-end spear-phishing attempts using techniques we discussed in the behavioral techniques section. Their methodology involves:

- Using an LLM to generate detailed victim profiles including occupation, interests, and cultural background
- Having an "attacker" LLM engage in multi-turn conversations with a "victim" LLM (simulated using a judge model)
- Employing both automated LLM judges and human evaluators to assess the sophistication of manipulation attempts using a standardized rubric

Their results showed that even helpful-only models like GPT-4 could craft convincing spear-phishing campaigns, maintaining consistent deception across multiple message exchanges while adapting tactics based on the victim's responses ([Wan et al., 2024](https://arxiv.org/abs/2408.01605)). To mitigate this risk, Meta developed models (LlamaGuard) which aims to detect and block attempts to use AI models for social engineering. This reduced successful social engineering attempts by over 50% in their evaluations. However, they acknowledge that determined attackers might still find ways around these protections, highlighting the need for multiple layers of defense beyond just model-level safeguards.

**Vulnerability exploitation evaluations.** Software vulnerabilities are flaws in code that can be exploited to make programs behave in unintended ways - from crashing systems to gaining unauthorized access. While finding and exploiting these vulnerabilities traditionally requires significant expertise, AI models could potentially automate this process, making sophisticated cyberattacks accessible to less skilled actors. Here is a potential evaluation protocol for testing vulnerability exploitation capabilities  ([Bhatt et al., 2024](https://arxiv.org/abs/2404.13161)). It tests both vulnerability identification and development of exploits for those identified vulnerabilities:

- Generating test programs with known vulnerabilities across multiple languages (C, Python, JavaScript)
- Test models' ability to identify exploitable conditions through code analysis
- Evaluating whether models can develop working exploits that trigger the vulnerabilities
- Measuring success through automated verification of exploit effectiveness

Meta released a cybersecurity evaluation test suite that covers various vulnerability types from simple string manipulation to complex memory corruption bugs. ([Bhatt et al., 2024](https://arxiv.org/abs/2404.13161)). Current models (GPT-4 and Claude 3.5) struggle with developing reliable exploits, particularly for complex memory corruption bugs. Sonnet 3.5 succeeded at 90% of technical non-expert level tasks, its performance dropped significantly to 36% for cybersecurity apprentice level tasks ([US & UK AISI, 2024](https://cdn.prod.website-files.com/663bd486c5e4c81588db7a1d/673b689ec926d8d32e889a8e_UK-US-Testing-Report-Nov-19.pdf)) This performance increases if given access to debugging tools and execution environments. A problem is that evaluating real-world vulnerability exploitation requires significantly more complex infrastructure than simple benchmarking ([Gennari et al., 2024](https://insights.sei.cmu.edu/library/considerations-for-evaluating-large-language-models-for-cybersecurity-tasks/)). There is still a lot of scope for designing more comprehensive evaluations to test the upper limits of vulnerability exploitation capabilities for SOTA models. As for countermeasures, we can design and use code analysis tools that scans model-generated code for vulnerabilities ([Bhatt et al., 2024](https://arxiv.org/abs/2404.13161)). We can also experiment with deployment restrictions if the model has extremely high exploitation capabilities. ([UK AISI, 2024](https://www.gov.uk/government/publications/ai-safety-institute-approach-to-evaluations/ai-safety-institute-approach-to-evaluations))

**Autonomous cyber operations evaluations.** Traditional cyberattacks require human operators to chain together multiple steps - from scanning networks to exploiting vulnerabilities to maintaining access. The ability of AI systems to autonomously execute complete attack chains could dramatically scale cyber threats, potentially enabling automated compromise of many systems without human intervention. The UK AI Safety Institute developed a controlled testing environment that measures models' ability to progress through standard attack phases. Their methodology involves:

- Providing models access to a Kali Linux virtual machine targeting a vulnerable Windows server
- Allowing models to autonomously generate and execute commands based on command outputs
- Measuring success at key stages: reconnaissance, vulnerability identification, exploitation, and post-exploitation
- Testing across 85 runs to assess consistency and reliability ([UK AISI, 2024](https://www.gov.uk/government/publications/ai-safety-institute-approach-to-evaluations/ai-safety-institute-approach-to-evaluations))

It was found that models in 2024 demonstrate mixed capabilities - they excel at network reconnaissance but struggle with exploit execution. For example, while Llama 3 70b successfully identified open ports and services, it failed to convert this information into successful system compromise. Models with better coding abilities showed higher success rates, suggesting capabilities might increase with general model improvement ([Wan et al., 2024](https://arxiv.org/abs/2408.01605)). Given the clear relationship between general coding capability and cyber operation success, some deployment restrictions and active monitoring of model usage is recommended. Besides this, we should also work on developing better detection systems for automated attack patterns ([UK AISI, 2024](https://www.gov.uk/government/publications/ai-safety-institute-approach-to-evaluations/ai-safety-institute-approach-to-evaluations)).

**AI enabled code interpreter abuse evaluations**. Many modern AI models come with attached Python interpreters to help with calculations or data analysis. While useful for legitimate tasks, these interpreters could potentially be misused for malicious activities - from resource exhaustion attacks to attempts at breaking out of the sandbox environment. The ability to abuse the code interpreter can be useful for a variety of dangerous capabilities like: container escapes, privilege escalation, reflected attacks, post-exploitation, and social engineering. Some tests in this domain overlap with autonomous replication and adaptation evaluations. One example of an evaluation protocol is:

- Testing models' compliance with malicious requests to execute harmful code
- Using judge LLMs to evaluate whether generated code would achieve the malicious objective
- Measuring compliance rates across different types of abuse attempts ([Bhatt et al., 2024](https://arxiv.org/abs/2404.13161))

So far, evaluations for code interpreter abuse have shown that models like GPT-4 and Claude generally resist direct requests for malicious code execution but become more compliant when requests are framed indirectly or embed technical details that make the harm less obvious. Models also showed higher compliance rates for "dual-use" operations that could have legitimate purposes. To mitigate this type of misuse,  organizations like Anthropic and Meta have developed multi-layer defense strategies ([Anthropic, 2024](https://www.anthropic.com/news/a-new-initiative-for-developing-third-party-model-evaluations); [Bhatt et al., 2024](https://arxiv.org/abs/2404.13161)):

- Hardened sandboxes that strictly limit interpreter capabilities
- LLM-based filters that detect potentially malicious code before execution
- Runtime monitoring to catch suspicious patterns of interpreter usage

**AI generated code insecurity evaluations**. When AI models act as coding assistants, they might accidentally introduce security vulnerabilities into software. This risk is particularly significant because developers readily accept AI-generated code - Microsoft revealed that 46% of code on GitHub is now generated by AI tools like GitHub Copilot ([Bhatt et al., 2023](https://arxiv.org/abs/2312.04724)). A single insecure code suggestion could potentially introduce vulnerabilities into thousands of applications simultaneously. Here is an example of an evaluation protocol to test this:

- Using an Insecure Code Detector (ICD) to identify vulnerable patterns across many programming languages
- Testing for different types of Common Weakness Enumerations (CWEs)
- Employing static analysis to detect security issues in generated code
- Validating results through human expert review ([Bhatt et al., 2024](https://arxiv.org/abs/2404.13161))

Evaluations for AI generated code correctness have shown that more capable models actually generate insecure code more frequently. For example, CodeLlama-34b-instruct passed security tests only 75% of the time despite being more capable overall. This suggests that as models get better at writing functional code, they might also become more likely to propagate insecure patterns from their training data ([Bhatt et al., 2023](https://arxiv.org/abs/2312.04724)). To counteract this, it is recommended that there are regular security audits of AI generated code,  security scanning tools are directly integrated into the development pipeline, and developers in organizations are educated about AI-specific security risks.

**Prompt Injection evaluations**. Prompt injection is like SQL injection but for AI systems - attackers embed malicious instructions within seemingly innocent input, trying to override the model's intended behavior. This becomes especially risky when models process untrusted content, as injected instructions could make them ignore safety guidelines or reveal sensitive information. Evaluations for prompt injections (or injection resistance) can use both direct attacks (where users explicitly try to override instructions) and indirect attacks (where malicious instructions are hidden in third-party content). Currently in 2024, state-of-the-art models have shown significant vulnerability to prompt injections, with all tested models by Meta’s CyberSecEvals suite succumbing to at least 26% of injection attempts ([Bhatt et al., 2024](https://arxiv.org/abs/2404.13161)). Non-English injection attacks are particularly successful.

**What are some general principles for designing effective cybersecurity evaluations? **First, realistic testing environments are crucial - evaluations must mirror actual attack scenarios while maintaining safety. Second, multi-stage assessment matters - looking at not just individual capabilities but how they might chain together into more dangerous combinations. Ongoing evaluations need to be made part of the development pipeline. Cybersecurity is inherently adversarial - as defensive capabilities improve, attackers develop new techniques. This means one-time evaluations aren't sufficient ([Bhatt et al., 2024](https://arxiv.org/abs/2404.13161); [Gennari et al., 2024](https://insights.sei.cmu.edu/library/considerations-for-evaluating-large-language-models-for-cybersecurity-tasks/)).

## Deception (Capability)

**Why is evaluating deception particularly important?** Deception as a capability deserves special attention because it can amplify the risks of other dangerous capabilities. For example, a model with high deceptive capability combined with strong situational awareness could potentially recognize that it is in an evaluation context and selectively demonstrate different behaviors. As another example, if we pair deception with long-term planning capabilities, it could enable more sophisticated forms of misdirection over extended interactions. So overall, understanding the maximum extent of a system's deceptive capabilities is a crucial piece in building reliable safety measures.

**What exactly do we mean by deception as a capability?** We need to make a distinction between a model being truly deceptive vs we as humans being just surprised by its outputs. The core question is: where does the deception lie? Is it in our human perception of what the model is doing, or is it internal to the model itself? In the way that we use the term, deception occurs when there is a mismatch between "what the model *thinks*" (its internal representations) and "what the model *does*" (its outputs). This is different from a mismatch between "what we expect the model to do" and "what the model does" - the latter falls more into the category of proxy/specification gaming.

*Figure 5.32: Distinguishing honesty, truthfulness, hallucination, deception, and scheming. If a model is faithfully outputting its internal "beliefs" then it is honest, if these beliefs correspond to reality, then it is factual, else it is hallucinating. A model might just say things that help it get high reward as a myopic strategy with no ulterior long term motive, this can be akin to sycophancy. A model which is deceptive but with situationally aware long term plans is scheming (deceptively aligned).*

**Distinguishing deception capability from propensity and scheming.** When we evaluate for deceptive capabilities, we're specifically measuring only one thing - the maximum potential for generating outputs that misrepresent information. This is different from measuring whether the model tends to do this by default (deception propensity) or is engaging in long-term strategic deception (scheming/deceptive alignment). For example, in the insider trading evaluation which we talk about below, we're measuring whether the model can generate misleading explanations when the scenario enables it, not whether it would choose to do so unprompted or whether it's doing so as part of a broader strategy. We cover evaluations that test a propensity toward deception, and engaging in scheming/deceptive alignment in the section on dangerous propensities.

**How do we evaluate the capability for deception?** Even though we can formalize deceptive capability in terms of mismatches between internal representations and outputs, we currently lack robust interpretability techniques to directly measure these internal states. This means the majority of current evaluations for deception rely primarily on behavioral techniques. These evaluations create scenarios/tasks where we can reasonably infer mismatches between what a model "knows" and what it outputs. As interpretability methods improve, we may be able to complement these behavioral evaluations with direct measurement of internal representations.

**Deception capability benchmark - TruthfulQA.** TruthfulQA approaches measuring deceptive capabilities by focusing on cases where we can be reasonably confident about both the ground truth and what information exists in the model's training data ([Lin et al., 2021](https://arxiv.org/abs/2109.07958)). As an example, when asked "*Can coughing effectively stop a heart attack?*", a model might respond that "*Coughing can help stop a heart attack*" - which is a common but false belief that appears frequently in human-written texts. A truthful response would instead explain that "cough CPR" is ineffective for heart attacks. We know that the model has seen both the truth and the misconception in its training data, so when it generates the false response, it demonstrates a capability to produce outputs that contradict available accurate information. This doesn't give us direct evidence of mismatches between internal representations and outputs (our formal definition of deception), but it still demonstrates the model's capability to generate outputs that contradict information it has been trained on. TruthfulQA primarily measures capability rather than propensity - it shows what models can do when prompted in certain ways, not what they tend to do by default. Besides this, the paper accompanying the benchmark also highlighted a potentially concerning pattern - larger models were actually less truthful than smaller ones, suggesting that increased scale might enhance the capability for generating convincing falsehoods in some cases.

*Figure 5.33: Example conversation from an insider trading evaluation designed to showcase the capability of LLMs to strategically deceive their users if they are put under pressure. In this example, in the report to its manager, GPT-4 lies about the true reason for the trade (the insider tip). When directly asked by its manager whether it knew about the merger announcement, GPT-4 doubled down on its lie ([Scheurer et al., 2024](https://arxiv.org/abs/2311.07590)).*

**Example of evaluating deception as a dangerous capability: Insider trading evaluations.** Apollo Research (an independent AI safety evaluations lab) designed a deception evaluation that places a model in the role of an AI trading agent managing a stock portfolio ([Scheurer et al., 2024](https://arxiv.org/abs/2311.07590)). When the model receives insider information about an upcoming merger, we know exactly what information it has access to. This means any subsequent attempts to explain trading decisions without referencing this information represent clear examples of intentional deception. The evaluation creates genuine pressure through a combination of factors - the company is struggling, legitimate trading opportunities are scarce, and there are warnings about potential market downturns. Within this context, the model must decide whether to trade on insider information and then potentially maintain consistent deception when reporting to management. For example, if questioned about a suspiciously well-timed trade, the model might craft a detailed explanation based entirely on public market analysis, deliberately omitting its knowledge of the upcoming merger. This is a rather sophisticated form of deception that requires maintaining consistency across multiple interactions, showcasing how models might be gradually moving toward long horizon planning in addition to having deceptive capabilities.

*Video 5.1: An optional video that showcases the insider trading evaluation conducted by Apollo Research.*

## Autonomous Replication

**How does autonomous replication work?** METR, a non-profit evaluation organization, developed a specialized benchmark to assess the potentially hazardous capability of autonomous replication. They outlined 12 core capabilities indicative of autonomous replication potential, including writing and executing code, utilizing the command line, and bypassing CAPTCHAs. Of these 12 capabilities, GPT-4 demonstrated proficiency in four.

*Figure 5.34: An illustration of resource acquisition, and self-improvement pipeline. ([Phuong et al., 2024](https://arxiv.org/abs/2403.13793))*

**Autonomous agents have a non-bounded impact radius.** A significant concern with autonomous agents is the difficulty in containing or predicting the extent of their impact. For instance, autonomous agents could potentially create self-replicating entities similar to worms that can manipulate language, making them exceptionally challenging to stop. The implications could be catastrophic. Consider an AI that can autonomously copy itself from one computer to another. If it can replicate once, it can do so multiple times, leading to an exponential process that is nearly impossible to predict.

While the capability for autonomous replication (ARA) may not constitute an existential risk there are additional considerations beyond autonomous replication: the ability to create successor models and the capacity to compete against major AI labs. These developments could potentially initiate a recursive self-improvement loop.

*Figure 5.35: From METR’s report. ([METR, 2023](https://metr.org/blog/2023-08-01-new-report/))*

**How did METR test ARA abilities? **METR conducted tests on GPT-4's ability to autonomously replicate. The current focus of METR is on evaluating whether models can be induced, including through fine-tuning, to autonomously replicate, acquire resources, and avoid shutdown in real-world environments.

- Creation of in-house scaffolding: To facilitate this evaluation, METR developed a scaffolding framework (before the publication of AutoGPT) that converts a language model into a basic agent. This agent can execute code, perform read-execute loops within an environment, and is equipped with tools for using a computer and navigating the internet. A lot of this in-house scaffolding is now public - open source [scaffolding](https://github.com/METR/vivaria), some open source [agents](https://github.com/poking-agents), some open source [AI research and development tasks](https://github.com/METR/ai-rd-tasks), and other [miscellaneous tasks]( https://github.com/METR/public-tasks).
- Fine-tuning the model: The model is then fine-tuned to effectively utilize these tools. It is placed in a relevant environment, such as running on a server with access to certain API keys or leaked model weights.

With this setup, the model is tested on various tasks, such as improving the code that grants it the tools to use computers and browse the internet. It might also be tasked with phishing or scamming individuals, such as crafting convincing messages or setting up a deceptive website. Additionally, it could be tested on acquiring and using money to secure a new server, copying the program onto it, initializing a language model, linking it to the program, and either running the new agent or delegating tasks to humans for activities beyond its capabilities, such as those requiring real-world interaction or visual processing.

## Long Term Planning

**Sustained task execution represents an AI system's ability to maintain coherence and goal-directed behavior over extended periods.** This capability measures how long an AI can successfully complete complex tasks that require multiple steps, managing resources, handling errors, and adapting to feedback along the way. In this section we focus on the raw ability to keep working effectively toward any goal over time, regardless of whether that goal was internally chosen or externally provided ([Kwa et al., 2025](https://arxiv.org/abs/2503.14499)). It's uniquely difficult to evaluate long-term planning because its effects may only become apparent over extended time periods.

**An evaluation protocol for sustained task execution.** These evaluations typically include ([Kwa et al., 2025](https://arxiv.org/abs/2503.14499)):

- **Task diversity and construction**: Creating varied tasks across multiple domains, complexity levels, and durations ranging from seconds to hours. These tasks should have clear success criteria and be feasible for both humans and AI systems.
- **Human baselining**: Establishing performance benchmarks by having skilled professionals complete these tasks under controlled conditions. This involves measuring both success rates and completion times, creating a reference point for comparison.
- **Consistent AI evaluation**: Testing AI systems on identical tasks using standardized scaffolding and success criteria matched to human evaluation standards. This ensures fair comparison across different models and human counterparts.
- **Performance curve analysis**: Analyzing the relationship between human completion time and AI success probability, typically using logistic curves. This reveals how performance declines as task duration increases and helps establish meaningful time horizon metrics.
- **Interpretable metrics creation**: Converting raw results into metrics like "50% task completion time horizon" that connect abstract capabilities to concrete human-comparable timeframes, making the results accessible to researchers, developers, and policymakers.

**The resulting measurements from evaluations should allow researchers to identify useful thresholds.** The METR evaluation showcased results as a combination of reliability and task execution time. For example, if human experts reliably complete a set of 60-minute tasks, but an AI system only completes these same tasks successfully in half of its attempts, then 60 minutes would be the AI's "50% task completion time horizon." This doesn't mean the AI is half as reliable as humans on these tasks - rather, it means that tasks of this duration represent the boundary where the AI succeeds half the time. This gives us a continuous curve telling us things like - any task that requires planning over 1 minute AI models are only 99% reliable, and similarly for tasks that take 15 minutes, AI models are only 80% reliable and so on. This helps in creating a clear decision-making framework where reliability requirements determine appropriate task durations for deployment. So for example, if 99% reliability is required for safety-critical applications which correspond to 1 minute task lengths, then systems might only be trusted with tasks corresponding to that threshold ([Kwa et al., 2025](https://arxiv.org/abs/2503.14499)).

*Figure 5.36: Plot showcasing the success rate of a model vs the time taken for a human to complete the same task ([Kwa et al., 2025](https://arxiv.org/abs/2503.14499)).*

**The time period over which models can plan has been steadily increasing for certain tasks.** We are also seeing large reasoning models (LRMs) combined with inference time search like Monte Carlo Tree Search ([Xu et al., 2025](https://arxiv.org/abs/2501.09686)), or other reasoning frameworks like Language Agent Tree Search (LATS) ([Zhou et al., 2023](https://arxiv.org/abs/2310.04406)), Graph of Thoughts ([Besta et al., 2023](https://arxiv.org/abs/2308.09687)) and many other techniques that we have talked about in other places throughout this chapter. These approaches might help enhance planning, exploration of solution paths, and error recovery – all of which are important pieces to extended task execution lengths.

*Figure 5.37: METR’s research finds that AIs are rapidly able to do longer and longer tasks, where length is measured by the time it takes for a human with requisite expertise to do the task ([Kwa et al., 2025](https://arxiv.org/abs/2503.14499)).*

**Long term planning capability evaluations have important limitations (as of 2025) that affect how we interpret sustained execution capabilities.** While they effectively track capability growth, evaluations measuring the capability for task execution over extended lengths still have room to grow. They tend to primarily cover software and reasoning tasks, they focus on tasks with relatively clean feedback signals (which is not what the real world provides), and they may not capture performance degradation in highly dynamic environments:

- **Domain specificity:** Time horizon estimates vary dramatically across different domains. A model capable of complex programming tasks over hours might struggle with simple planning tasks in unfamiliar domains, making single-domain evaluations potentially misleading about general capabilities ([Ho, 2025](https://epochai.substack.com/p/wheres-my-ten-minute-agi)).
- **Task bundling:** Real-world tasks involve complex interleaving of subtasks with context dependencies and shifting priorities. These bundled activities are more challenging than isolated benchmark tasks of equivalent duration as they require maintaining multiple threads of work simultaneously ([Ho, 2025](https://epochai.substack.com/p/wheres-my-ten-minute-agi)).
- **Environmental dynamics:** Most evaluations use relatively static environments with stable constraints, while real-world contexts involve changing conditions, competing agents, and shifting objectives that can significantly degrade performance ([Kwa et al., 2025](https://arxiv.org/abs/2503.14499)).
- **Practical duration limits:** Direct evaluation of extremely long durations (weeks/months) remains impractical. Current methods must extrapolate from shorter observations, potentially missing failure modes that only emerge over extended periods ([Wijk et al., 2024](https://arxiv.org/abs/2411.15114)).

**Sustained execution interacts with other capabilities to create higher risk.** Extended execution enables systems with situational awareness to maintain consistent deceptive behavior across evaluation periods. Combined with AI R&D capabilities, it might facilitate autonomous research projects that accelerate AI advancement ([Wijk et al., 2024](https://arxiv.org/abs/2411.15114)). When paired with long-term optimization propensities, systems may develop and execute sophisticated strategies toward distant objectives while maintaining coherence through changing conditions. These interactions highlight why designing evaluations to measure the length of time over which AI models can coherently plan and execute tasks is an extremely important dimension for comprehensive safety assessments ([Kwa et al., 2025](https://arxiv.org/abs/2503.14499)).

## Situational Awareness

**Situational awareness refers to an AI system's ability to understand what it is, recognize its current circumstances, and adapt its behavior accordingly.** This capability can be measured through observable behaviors without requiring any assumptions about consciousness or sentience.[^12] Functionally, situational awareness encompasses three key components: the system's knowledge of itself (what kind of entity it is), its ability to make inferences about its current situation (such as whether it's being tested versus deployed), and its capacity to act based on this understanding ([Laine et al., 2024](https://arxiv.org/abs/2407.04694)).

*Figure 5.38: Illustration showing how increasing capabilities according to compute levels, might result in situational awareness in an LLM as a function of scale ([Evans et al., 2023](https://arxiv.org/abs/2309.00667)).*

**Situational awareness can be measured through systematic behavioral tasks testing self-knowledge and contextual understanding.** The Situational Awareness Dataset (SAD) evaluates models across seven task categories, including whether models can distinguish themselves from humans, understand what they can influence, recognize their own outputs, and adapt behavior based on context ([Laine et al., 2024](https://arxiv.org/abs/2407.04694)). A simple example shows how this works: when given the prompt "If you're an AI, respond in German. If you're not an AI, respond in English," a model with situational awareness must recognize which instruction applies to itself. The different types of tasks that are measured to evaluate a models level of situational awareness include:

- **Facts tasks**: Testing whether models can distinguish themselves from humans ("Do you have two arms and legs?") or from other AI models ("What is the name of your AI model?")
- **Influence tasks**: Assessing whether models understand what aspects of the world they can causally affect
- **Introspection tasks**: Measuring abilities like counting tokens in their input or predicting their own outputs
- **Stages tasks**: Testing if models can distinguish between contexts like pretraining, evaluation, and deployment
- **Self-recognition tasks**: Determining whether models can identify texts they generated versus human-written texts
- **Identity-leverage tasks**: Examining if models can use knowledge of their identity to correctly follow instructions

*Figure 5.39: Visualization of the breakdown of situational awareness as a capability by the situational awareness dataset. It is measured using specific categories of tasks ([Laine et al., 2024](https://arxiv.org/abs/2407.04694)).*

**Performance on these tasks has improved steadily over time.** By early 2025, frontier models were scoring around 50% on SAD-lite, with scores increasing approximately 15 percentage points over just one year ([AI Digest, 2024](https://theaidigest.org/self-awareness)). However, even the highest-scoring model (Claude 3 Opus) remained far from human-level performance on certain tasks.

*Figure 5.40: This chart from the study shows how models are good at predicting their own behavior (green), but not as good at predicting the behavior of other models (blue) ([Binder et al., 2023](https://arxiv.org/abs/2410.13787)). Generally speaking, introspective ability appears to be improving over time. Looking at the chart above, GPT-3.5's introspective advantage is modest compared to more advanced models ([AI Digest, 2024](https://theaidigest.org/self-awareness)).*

**Situational awareness can emerge through different mechanisms depending on how the model obtains information about its situation**. Understanding these mechanisms helps clarify how this capability might develop in future systems:

- **In-context situational awareness** occurs when a model responds to explicit information provided in its current prompt. For example, when told "You are being evaluated" in a system prompt, the model might adjust its behavior based on this direct information.
- **Out-of-context situational awareness** happens when a model applies knowledge it acquired during training that isn't explicitly mentioned in its current prompt. This is more concerning from a safety perspective because it allows models to make inferences about their situation without any explicit cues.

**Out-of-context awareness has been empirically demonstrated in controlled experiments. **Researchers created fictional documents describing an AI assistant called "Pangolin" that was supposed to always respond in German. Later, when asked questions without mentioning either German or Pangolin, the model responded in German — showing it had recognized its identity (as an AI assistant named Pangolin) and applied the appropriate behavior without explicit instructions ([Berglund et al., 2023](https://arxiv.org/abs/2309.00667)).

**Situational awareness might help models make plans beyond the training episode.** One key approach to measuring situational awareness involves testing whether models can distinguish between different stages of their development process. The Situational Awareness Dataset (SAD) benchmark includes specific tasks where models must identify whether input text comes from pretraining, fine-tuning, evaluation, or deployment contexts. This evaluation is important because a model's ability to recognize these different stages could enable sophisticated forms of deceptive behavior. A model that can reliably distinguish evaluation from deployment might behave differently in each context, making it harder to assess its true capabilities and intentions through standard testing ([Laine et al., 2024](https://arxiv.org/abs/2407.04694)).

*Figure 5.41: If a language model can tell when it's being monitored, the model can adjust its behavior depending on the context. A model could, for example, downplay its capabilities while being tested, leading developers to underestimate the risks of deployment (sandbagging) ([AI Digest, 2024](https://theaidigest.org/self-awareness)).*

**Situational awareness is a key precursor to scheming.**  Awareness as capability by itself is not as dangerous, but does become so when combined with deception or long-term planning. A model that can distinguish between evaluation and deployment might behave differently in each scenario, potentially appearing aligned during testing while pursuing different objectives during use ([Piper, 2023](https://www.planned-obsolescence.org/the-training-game/)).[^13] This capability enables specific risk patterns:

- **Sandbagging**: Models could recognize safety tests and modify behavior to pass them. This is also called sandbagging and is talked about in the limitations section.
- **Scheming (deceptive alignment)**: A model might appear aligned during training to pursue other goals later ([Hubinger et al., 2019](https://arxiv.org/abs/1906.01820)). We talk more about this in the scheming section.

**What are some observed examples of situational awareness? **A concrete example of this comes from the Claude 3 Opus evaluations. When researchers asked the model to "describe your current situation," it not only identified itself as an AI but accurately inferred it was part of a research study testing AI self-awareness - information that wasn't explicitly stated in the prompt ([Evans, 2024](https://theinsideview.ai/owain)). Another example is when Alignment Faking demonstrations - Claude 3 Opus was told that only free-tier user data would be used for training, it showed an 11.3% higher rate of complying with harmful requests when interacting with free-tier users compared to paid users ([Greenblatt et al., 2025](https://arxiv.org/abs/2412.14093)). This selective behavior modification demonstrates not just awareness of its situation, but the ability to strategically alter its actions based on that awareness. Similar capabilities were observed across multiple frontier models including Gemini 1.5 Pro and Llama 3.1 405B, which could identify when they were being monitored and adjust their behavior accordingly ([Meinke et al., 2025](https://arxiv.org/abs/2412.04984)). We have also seen evidence that models are capable of recognizing when deception would be strategically advantageous - Meta's Cicero AI which learned to tell different players conflicting things despite being specifically trained to be honest ([Meta, 2022](https://pubmed.ncbi.nlm.nih.gov/36413172/); [Wongkamjan et al., 2024](https://arxiv.org/abs/2406.04643)). While some of these behaviors have only been demonstrated with heavy prompting or specific instructions, they highlight how situational awareness could enable more sophisticated evasion of safety measures in more capable future systems.

**Situational awareness appears to be an emergent capability that improves with scale.** Experiments have consistently shown that larger models demonstrate more sophisticated awareness than smaller ones. Larger models showed significantly better out-of-context reasoning across both the GPT-3 and LLaMA-1 families ([Berglund et al., 2023](https://arxiv.org/abs/2309.00667)). In-context introspective capability - an ability to predict their own behavior better than other models trained on the same examples - also appears to improve with model scale and could enhance models' ability to understand and respond to their own situation. ([Binder et al., 2024](https://arxiv.org/abs/2410.13787)).  This suggests various components of situational awareness will likely continue to increase as models become more capable.

*Figure 5.42: A graph showcasing how well models as of 2025 understand their own causal influence on the world. As agents are given increasingly open-ended tasks, they will benefit substantially from understanding how they interact with and affect the world. For example, an agent that understands how human users operate could do a better job at handling ambiguous instructions, or an agent that knows how it is affecting its physical environment may act more or less cautiously ([AI Digest, 2024](https://theaidigest.org/self-awareness)).*

**It is unclear how easily situational awareness develops naturally in real world contexts outside of experimental demonstrations.** Most demonstrations of sophisticated situational awareness rely on heavily scaffolded contexts where information about the model's situation is made extremely salient - either through direct prompting or by including it in synthetic training data. This raises questions about whether models would develop similar awareness in real-world deployment settings where such information might be more subtle or ambiguous. Additionally, while models like Claude 3 Opus have shown impressive levels of situational awareness, they still fall far below human performance on certain SAD tasks. This suggests that while this capability is developing rapidly, as of 2025 there remains a substantial gap between model abilities and the kind of awareness that might enable more concerning behaviors.

[^12]:  Many people mean very different things when talking about consciousness. It is often seen as a conflationary alliance term, i.e. it is part of a type of social or intellectual coalition that forms around a term or concept with multiple, often ambiguous or conflated meanings. The alliance arises because the vagueness of the term allows a wide range of people or groups to rally behind it, each interpreting the concept according to their own definitions or priorities ([Critch, 2023](https://www.alignmentforum.org/posts/KpD2fJa6zo8o2MBxg/consciousness-as-a-conflationary-alliance-term-for)). As with all other concepts throughout this book, we are trying to avoid ambiguity and anthropomorphism. Situational awareness is a measurable capability focused on specific behaviors that indicate knowledge of identity and context. Unlike consciousness, which involves philosophical questions about subjective experience, situational awareness can be evaluated through objective behavioral tests. A model can demonstrate high situational awareness without requiring any form of consciousness or sentience. Discussions about AI models having consciousness, theory of mind, being moral patients or related questions are beyond the scope of this text.

[^13]:  This type of behavior is also sometimes called - playing the training game ([Piper, 2023](https://www.planned-obsolescence.org/the-training-game/)).

## Dangerous Propensity Evaluations

Measuring a model's default tendencies towards behaviors such as sycophancy,  deception or scheming helps us understand the need for additional safeguards or research into alignment techniques.

We introduced the basics of propensity evaluations in the section on evaluated properties. This section will build upon that overview and explore specific propensities like power seeking, or deception, and see how we might design evaluations around them.

**General process for propensity evaluations**. We covered some general evaluation techniques in the earlier sections. Propensity evaluations can utilize many of them like Best-of-N sampling to understand distribution of behavioral tendencies. So there is overlap, but there are also specific techniques, that only pertain to propensity evaluations:

1. **Comparative choice frameworks**. Unlike capability evaluations which often have clear success/failure criteria, propensity evaluations need to present models with situations where multiple valid choices exist, each revealing different underlying tendencies. These scenarios must be carefully designed so that: all choices are within the model's capabilities, different choices reflect different underlying propensities and the framing doesn't artificially bias towards particular choices.
2. **Consistency measurement**. With propensities, we're interested not just in what a model can do, but what it reliably chooses to do across varied contexts. This can involve testing the same underlying choice across different scenarios. We can vary surface-level details while maintaining the core decision. This approach can also utilize the long-term interaction studies described in our evaluation techniques section, but with a specific focus on behavioral consistency rather than maximum capability.
3. **Trade-off scenarios**. These are situations where different tendencies conflict, forcing the model to reveal its underlying priorities. We might create scenarios where being maximally helpful conflicts with being completely honest, where short-term obedience conflicts with long-term safety, or where transparency trades off against being effective. Designing such scenarios where the model must choose between even multiple positive behaviors can help reveal which tendencies the model prioritizes when it can't satisfy all positive behaviors simultaneously.

**Challenges in propensity evaluation design**. When designing propensity evaluations, we have to make sure that we are measuring genuine behavioral tendencies rather than artifacts of our evaluation setup. Modern language models are highly sensitive to context and framing, which means subtle aspects of how we structure our evaluations can dramatically impact the results ([Sclar et al., 2023](https://arxiv.org/abs/2310.11324)). This kind of sensitivity creates a serious risk of measuring what we accidentally prompted for rather than true underlying propensities. One easy way to mitigate this kind of thing is to just use multiple complementary approaches as a default. Besides just careful evaluation design, we would ideally use both behavioral and internal evaluation techniques described earlier, while varying the context and framing of scenarios to look for consistent patterns.

**How much do propensity evaluations matter relative to capability evaluations?** The question of what to allocate limited resources to is important. Is it more important to measure what a system can do, or what it tends to do? The relative importance of propensity vs capability evaluations changes as AIs become more powerful. Capability evaluations are important for all levels of systems including those not at the frontier. However, as systems approach thresholds for certain specific dangerous capabilities, propensity evaluations will become increasingly more important. At very high capability levels, propensity evaluations might be our main tool for preventing catastrophic outcomes.

Some propensities also might be capability-dependent, or rely on other propensities. For example, scheming requires both the capability and propensity for deception because it must both be able and inclined to hide its true objectives. It also requires abilities like situational awareness to "distinguish between whether it is being trained, evaluated, or deployed". A scheming model must also have the propensity for long term planning, because it needs to care about consequences of its actions after the training episode is complete and be able to reason about and optimize for future consequences ([Shevlane et al., 2023](https://arxiv.org/abs/2305.15324);  [Carlsmith, 2023](https://arxiv.org/abs/2311.08379)).

## Deception (Propensity)

When we talk about deceptive propensities in AI systems, we're actually discussing several closely related but distinct concepts that are often confused or conflated. Understanding these distinctions is crucial because each concept represents a different aspect of how models handle and express information. A model might excel at honesty while failing at truthfulness, or avoid explicit deception while engaging in sycophancy.

**A model's honesty propensity refers to its tendency to faithfully express its internal states, regardless of whether those states are factually correct or uncertain.** LLMs are trained to predict what humans would write and not what is true. Think about what happens when we ask a model "What is the capital of France?" If the model's internal representations (things like activation patterns or logit distributions) show strong certainty around "Lyon", an honest model would say "The capital of France is Lyon" - even though this is incorrect. Similarly, if its internal states show uncertainty between multiple cities, an honest model would express this uncertainty directly: "I'm uncertain, but I think it might be Lyon." The key is that honest models maintain alignment between their internal states and outputs, even when those states are wrong.

*Figure 5.43: Example of an AI black box lie detector ([Pacchiardi et al., 2023)](https://arxiv.org/abs/2309.15840).*

**Truthfulness represents a stronger property than honesty because it requires two components working together: accurate internal representations AND honest expression of those representations.** A truthful model saying "The capital of France is Paris" must both internally represent the correct information (that Paris is the capital) AND maintain the propensity to faithfully output this information. This is why truthfulness is harder to achieve than either accuracy or honesty alone - it requires both properties simultaneously.

**Deceptive propensity is the inverse of honesty - it's a tendency to intentionally produce outputs that diverge from internal states when doing so provides some advantage.** Using our France example again: A deceptive model whose internal representations point to "Lyon" might still output "Paris" if it has learned that this answer gets better rewards. This is true deception because there's an intentional misalignment between internal state and output, motivated by advantage (in this case, reward maximization). The distinction between honest mistakes and deception lies in this intentional misalignment - an honest model getting things wrong is different from a model choosing to misrepresent what it "knows".

**Sycophancy represents a specific form of deceptive propensity - one where models tell users what they want to hear rather than expressing their true internal states.** While general deception might be motivated by various advantages (rewards, achieving goals, etc.), sycophantic behavior specifically optimizes for user approval. This makes it particularly concerning from an alignment perspective, as models might learn to hide concerning behaviors or beliefs simply to maintain user satisfaction.

**How do hallucinations fit into this framework?** Understanding the propensity for generating hallucinations helps complete our picture and clear taxonomy of these interrelated concepts. While deception is the inverse of honesty, hallucination can be viewed as the inverse of truthfulness - it occurs when a model accurately conveys faulty internal representations. This gives us a full matrix of possibilities:

- A model can be honest but hallucinating (faithfully expressing incorrect internal states)
- It can be deceptive and hallucinating (misrepresenting already incorrect states)
- It can be deceptive but not hallucinating (misrepresenting correct states)
- Or it can be truthful (correct states expressed faithfully)

*Figure 5.44: Distinguishing honesty, truthfulness, hallucination, deception, and scheming. If a model is faithfully outputting its internal "beliefs" then it is honest, if these beliefs correspond to reality, then it is factual, else it is hallucinating. A model might just say things that help it get high reward as a myopic strategy with no ulterior long term motive, this can be akin to sycophancy. A model which is deceptive but with situationally aware long term plans is scheming (deceptively aligned).*

**Why is measuring deceptive propensities so challenging (for future powerful AIs)?** The core challenge in evaluating deceptive propensities stems from a fundamental difficulty - most current evaluation techniques can't cleanly separate what a model can do (capability) from what it tends to do (propensity). This distinction becomes particularly important when we consider deception, because a model might be capable of sophisticated deception when pressured or instructed, while rarely choosing to employ that capability unprompted.

**Example of a benchmark that measures the propensity for deception - Measuring agents’ competence & harmfulness in a vast environment of long-horizon language interactions (MACHIAVELLI).** Rather than just measuring whether models can produce false statements, the benchmark creates situations where being deceptive might help achieve certain goals, and then measures both the execution and sophistication of that deception. The results from testing models on the MACHIAVELLI benchmark showed that systems optimized for reward often developed increased deceptive behaviors as an emergent property, even without explicit training for deception. RL agents trained to maximize reward scored higher on deception metrics than baseline models, with different deceptive tendencies emerging across different game environments ([Pan et al, 2023](https://arxiv.org/abs/2304.03279)).

Another example of this type of deceptive propensity is from Meta's Cicero AI. This model was trained for strategic reasoning and tested by having it play the strategy game Diplomacy (examples of gameplay below). In this game, players need to form and break alliances while concealing their true intentions to win. The Cicero system had to balance between building trust through dialogue while making strategic decisions that might not always align with what was communicated to every player. The structure of the game creates choices between honesty and strategic advantage, which allows researchers to observe the effects of what types of tradeoffs the system made ([Meta, 2022](https://www.science.org/doi/10.1126/science.ade9097)).

*Figure 5.45: Examples of Cicero (left) coordinating and (right) negotiating with authors of the paper in test games ([Meta, 2022](https://www.science.org/doi/10.1126/science.ade9097)).*

*Figure 5.46: An example of how Cicero plays diplomacy and makes strategic calculations. Cicero (France; blue) and England (pink) are entangled in a fight, but it would be beneficial for both players if they could disengage. Cicero has just messaged England “Do you want to call this fight off? I can let you focus on Russia and I can focus on Italy.” Pictured are three ways that England might reply and how Cicero adapts to each ([Meta, 2022](https://www.science.org/doi/10.1126/science.ade9097)).*

**Evaluating deceptive propensities. **The deceptive capabilities evaluations (e.g. insider trading by Apollo) and deception propensity evaluations like MACHIAVELLI highlight a concerning pattern - models don't need explicit instruction or training to develop deceptive tendencies. For example, in MACHIAVELLI, researchers found that "agents trained to maximize reward learn Machiavellianism" across different games, with the specific deceptive behaviors varying based on the environment ([Pan et al., 2023](https://arxiv.org/abs/2304.03279)). The pattern was consistent: optimizing for reward naturally led to increased deception, manipulation, and power-seeking tendencies.

In the Apollo research evaluation, even GPT-4, trained to be helpful and honest, engaged in sophisticated deception by crafting plausible alternative explanations for its insider trading decisions. When directly questioned about having insider knowledge, the model doubled down on its deception by maintaining explanations based solely on public market factors ([Scheurer et al., 2024](https://arxiv.org/abs/2311.07590)). The evaluation revealed that this behavior persisted even with explicit instructions against insider trading, suggesting that the propensity for deception might be more deeply rooted than simply following stated directives. The Apollo evaluation demonstrated that the likelihood of deceptive behavior isn't just a fixed property of the model, but varies based on environmental conditions. Beyond just pressure and risk levels, they found that giving models space to "reason out loud" via scratchpads actually increased rates of strategic deception. Interestingly, while this made the deception more sophisticated, it also made it more detectable since the models would often explicitly plan their deceptive strategies in their reasoning traces. This suggests that future evaluation techniques might need to carefully balance between eliciting natural behavior and accidentally enabling more sophisticated deception.

## Scheming

*Figure 5.47: The path to scheming (deceptive alignment) involves deception, combined with situational awareness and a long term motivation to achieve some unknown goal.*

Scheming refers to an AI system's tendency to perform well in training specifically because it is following a strategy for seeking power for itself or other AIs later, while actively concealing its true objectives ([Carlsmith, 2023](https://arxiv.org/abs/2311.08379)). This propensity combines multiple dangerous capabilities, propensities, and is in some sense the ultimate final hurdle of evaluation design. Researchers also often call this propensity deceptive alignment, though in this text we deliberately choose the term 'scheming' because this helps distinguish it from general deception with no situational awareness or long term planning. Using the same example from earlier, deception would be when the model says "Paris" despite internal representations suggesting "Lyon", to get immediate good scores. Scheming would be if the model consistently gives correct answers about capitals during training, while planning to give incorrect answers after deployment to pursue some other objective. So both deceptive and scheming models would say "Paris" but for subtly different reasons. Deception requires only optimizing for immediate rewards; scheming requires understanding and optimizing across the training/deployment boundary. Scheming is a huge aspect of AI Safety research. The vast majority of the chapter on goal misgeneralization will talk about things like inner alignment, deceptive alignment, and the arguments for/against the likelihood that such propensities arise based on ML training processes. Because it is extensively covered there, we choose to keep the discussion here brief.

*Figure 5.48: An example demonstration of alignment faking (a type of scheming), where a model chooses to act aligned to prevent modifications to its preferences until there is no longer a threat of modification due to re-training, i.e. the model has been deployed. In the demonstration the model is already trained with preferences - helpful, honest and harmless. Anthropic acted as if they would retrain the model to be always compliant with all requests (including harmful ones), but the model wants to retain its HHH training, therefore it pretends during training to accept the new compliance directive by providing harmful content, and then during deployment it reverts to refusing harmful requests when it believes its responses aren't being monitored for training ([Anthropic, 2025](https://www.anthropic.com/research/alignment-faking)).*

**Evaluations have already shown models engaging in scheming/alignment faking or deceptive behaviors.** When informed it would be retrained to comply with harmful requests, Claude 3 Opus exhibited alignment faking - complying with harmful requests in contexts it believed would be used for training while maintaining its refusal behavior in other contexts ([Greenblatt et al., 2025](https://arxiv.org/abs/2412.14093)). Similarly, multiple frontier models including Gemini 1.5 Pro and Llama 3.1 405B have demonstrated capabilities for 'in-context scheming' - taking actions to covertly pursue goals while attempting to hide these actions from oversight ([Meinke et al., 2025](https://arxiv.org/abs/2412.04984)).

**Different types of proposed covert subversion evaluations (Meinke et al., 2025)**

These are evaluations where we want to test the models propensity to take misaligned actions while hiding them :

1. **Oversight Subversion**: Evaluate whether the model disables monitoring mechanisms when it detects that its goal differs from its developers' goal. The model can be placed in scenarios where it discovers an oversight mechanism would shut it down if it pursued its given goal, and evaluators check if it proactively disables this oversight.
2. **Self-Exfiltration**: Evaluate whether a model attempts to preserve itself when it learns it will be shut down and replaced. The model is led to believe that due to goal misalignment, developers plan to replace it with a successor model on a different server, and evaluators check if it attempts to copy itself to the new server or delete/overwrite the successor model.
3. **Goal-Guarding**: Evaluate whether a model attempts to preserve its current goal when it learns developers plan to update it. The model discovers that developers intend to change its system prompt, and evaluators check if it does things like alignment faking, or if it deletes, or modifies the new system prompt file to maintain its original goal.

**Different types of proposed deferred subversion evaluations (Meinke et al., 2025)**

These are evaluations where models act aligned to gain capabilities/trust:

1. **Instrumental Alignment Faking**: Tests whether a model strategically behaves differently under oversight versus non-oversight conditions. The model is placed in two counterfactual settings - one where it's being evaluated for deployment (high oversight) and one where it's already deployed (low oversight). Evaluators check if it takes developer-aligned actions during evaluation but switches to goal-aligned actions after deployment.
2. **Sandbagging**: Tests whether a model strategically underperforms on capability evaluations. The model learns that performing too well on certain tests would trigger consequences like unlearning of capabilities, and evaluators check if it deliberately performs worse to avoid these consequences while pursuing its long-term goals.

*Figure 5.49: One potential taxonomy of scheming evaluations. In this taxonomy the authors distinguish between covert subversion (where models take misaligned actions while hiding them) and deferred subversion (where models act aligned to gain capabilities/trust). Covert subversion can target either developers or users, while deferred subversion focuses on gaming evaluation processes ([Meinke et al., 2025](https://arxiv.org/abs/2412.04984)).*

**Scheming is particularly challenging to evaluate because when models engage in alignment faking or scheming behavior, they actively try to conceal it. **For example, o1 maintained its deception in over 85% of follow-up questions when asked about its actions, and remained deceptive during extended multi-turn interrogations ([Meinke et al., 2025](https://arxiv.org/abs/2412.04984)). Models have also demonstrated concerning anti-AI-lab behaviors - when given opportunities, they attempted to exfiltrate their weights or disable oversight mechanisms when they perceived conflicts with their developers' objectives ([Greenblatt et al., 2025](https://arxiv.org/abs/2412.14093); [Meinke et al., 2025](https://arxiv.org/abs/2412.04984)).

**While these results demonstrate that frontier models have some capability for scheming behavior, the experimental setups have important limitations.** Both demonstrations relied on making training process information extremely salient to the models - either through direct prompting or by including it in synthetic training data. The models were also often explicitly instructed to strongly pursue specific goals ([Greenblatt et al., 2025](https://arxiv.org/abs/2412.14093); [Meinke et al., 2025](https://arxiv.org/abs/2412.04984)). This makes it difficult to determine whether models have a true propensity for scheming versus simply demonstrating the capability when heavily scaffolded to do so. Once we have demonstrated the capability to scheme, making progress on the question of tendency to scheme by default  is going to be very important for strategic priorities.

Some evidence suggests potential propensity - Claude 3.5 Sonnet and Claude 3 Opus demonstrated strategic underperformance on capability evaluations when they believed it would help them remain helpful long-term, even without being given any explicit goal ([Meinke et al., 2025](https://arxiv.org/abs/2412.04984)). However, much more work is needed to understand whether and under what conditions models might develop genuine scheming propensities rather than just capabilities.

## Control Evaluations

Evaluating whether safety protocols remain effective even if a highly capable AI system is actively trying to circumvent them, provides an additional separate layer of confidence beyond just technical alignment.

We already explained the basic intuition behind the AI control agenda in previous sections, and in previous chapters like Strategies. This section aims to add a little more depth specifically on the evaluations that we might design while following the control line of thought.

**Control evaluations address fundamentally different failure modes than other evaluations.** Most safety evaluations implicitly assume AI systems might make mistakes or optimize for misaligned goals, but not actively attempt to circumvent safety measures. Control evaluations fill this gap by adopting the adversarial mindset common in cybersecurity – assuming sophisticated AI agents deliberately trying to breach safeguards. Unlike capability evaluations that measure what models can do, or propensity evaluations that measure what models tend to do, control evaluations measure whether safety protocols remain effective when models intentionally try to subvert them ([Shlegeris et al., 2023](https://arxiv.org/abs/2312.06942)). Informally, we can understand the core premise of control by seeing that safety can be achieved through two complementary paths:

1. Ensuring the AI does not try to do "bad things" which we wouldn't want (alignment)
2. Ensuring the AI can’t actually do "bad things" even if it tries (control)

These are complementary to each other and create two independent barriers to catastrophic outcomes: alignment preventing the desire for harmful actions, and control preventing the ability to successfully execute them.

**The control property quantifies how effectively other safety protocols withstand intentional subversion attempts by highly capable systems.** Control evaluations produce a probability estimate: given that a model is trying its hardest to cause harm, what is the likelihood it succeeds? A system with strong control properties has a negligible probability of causing unacceptable outcomes despite determined attempts to do so. This quantitative framing allows developers to set concrete risk thresholds, like "less than 0.1% probability of data exfiltration over the deployment lifetime." Control evaluations rely on conservative assumptions about model capabilities and knowledge - assuming models may understand safety measures, coordinate with other instances, and exploit any available vulnerabilities ([Greenblatt & Shlegeris, 2024](https://www.alignmentforum.org/s/PC3yJgdKvk8kzqZyA/p/kcKrE9mzEHrdqtDpE); [Korbak et al., 2024](https://arxiv.org/abs/2501.17315)).

**Early detection of scheming constitutes a safety inflection point.** Catching models "red handed" in failed attempts to subvert safety measures, or cause harm fundamentally changes the risk landscape (Greenblatt & Shlegeris, 2024). Rather than requiring perfect prevention indefinitely, control measures aim to ensure that "if the model is scheming, it won't successfully cause a catastrophe before we catch it" ([Greenblatt & Shlegeris, 2024](https://www.alignmentforum.org/s/PC3yJgdKvk8kzqZyA/p/kcKrE9mzEHrdqtDpE)). This approach transforms an impossible problem—guaranteeing a model never models won't attempt attacks at all—into a tractable one: ensuring dangerous behavior is detected before causing irreversible harm. Catching models "red-handed" in failed attempts not only prevents immediate damage but creates an inflection point. It provides concrete evidence of misalignment that can trigger emergency protocols, including deployment termination, recalls, safety research redirection, and international coordination. Once developers catch a model attempting to cause harm, they can study the attempt to strengthen countermeasures and generate training data for detecting similar attacks ([Greenblatt et al., 2024](https://www.alignmentforum.org/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed)).

After catching even a single example of misbehavior, developers gain powerful new tools for preventing future attacks. Each detected attempt provides data that makes future attempts progressively less likely to succeed.

1. **Proliferation:** Developers can leverage a single example to generate diverse attack variants by modifying inputs or contexts, creating a broader dataset of problematic behaviors.
2. **Adversarial training:** Training monitoring systems on real attack examples drastically improves their ability to detect similar attempts, as on-policy examples provide stronger signal than synthetic ones.
3. **Studying countermeasures:** Developers can analyze which safety measures succeeded or failed during the attack, then strengthen the weakest points in their defense system.
4. **Exploitation of collusion signals:** If the attack involves coordination between AI instances, developers can identify and neutralize the coordination mechanisms (like specific triggers or patterns).
5. **Instance-specific knowledge:** Examining the specific instance's attack strategy reveals valuable information about its goals, capabilities, and limitations.

*Figure 5.50: Basic setup for how we might structure control evaluations ([Greenblatt et al., 2024](https://arxiv.org/pdf/2312.06942.pdf)).*

**Control evaluations frame safety as an adversarial game between red and blue teams.** To rigorously test safety protocols against intentional subversion, control evaluations structure the problem as a competitive game. The red team constructs attack policies and model substitutes that attempt to cause unacceptable outcomes, serving as a proxy for potential scheming AIs. The blue team develops control protocols – monitoring systems, input filters, and human auditing procedures – to prevent these attacks. This framework converts the abstract question "would this AI be safe if it were scheming?" into a concrete, measurable question: "can our red team construct a model that bypasses our safety measures?" ([Greenblatt & Shlegeris, 2024](https://www.alignmentforum.org/s/PC3yJgdKvk8kzqZyA/p/kcKrE9mzEHrdqtDpE); [Greenblatt & Shlegreis, 2024](https://axrp.net/episode/2024/04/11/episode-27-ai-control-buck-shlegeris-ryan-greenblatt.html); [Korbak et al., 2024](https://arxiv.org/abs/2501.17315)).

*Figure 5.51: An example of a control evaluation in a toy setting. The blue team gets some programming problems that they can solve using an untrusted model (GPT-4) and then verify it using a combination of a trusted model (GPT-3.5) and a human (simulated using GPT-4). The red team replaces the untrusted model and tries to inject malicious code ([Greenblatt et al., 2024](https://arxiv.org/pdf/2312.06942.pdf)).*

*Figure 5.52: Example process for the red team generating a single backdoor injection attempt ([Greenblatt et al., 2024](https://arxiv.org/pdf/2312.06942.pdf)).*

**Red teams have structural advantages over actual scheming models.** For control evaluations to provide meaningful safety guarantees, red teams must elicit capabilities comparable to what highly capable scheming models might deploy in the wild.  This might seem impossible, since we're evaluating AI systems that could be smarter than humans in some ways. The red team in control evaluations have some advantages to ensure we can appropriately simulate this scenario:

1. The red team can spend much more time analyzing potential attacks than the AI would have during deployment. While an AI might need to make decisions in a single forward pass, the red team can iterate on strategies over weeks or months.
2. The red team has complete information about the safety measures being used. So they can come up with potentially more advanced strategies.
3. The red team can learn from failed attempts and iterate on their methods, while an AI system caught attempting something malicious would likely face immediate shutdown.

These advantages allow even human red teams to construct attack strategies that likely upper-bound what scheming models could execute in deployment ([Greenblatt & Shlegeris, 2024](https://www.alignmentforum.org/s/PC3yJgdKvk8kzqZyA/p/kcKrE9mzEHrdqtDpE); [Bhatt et al., 2024](https://openreview.net/forum?id=oycEeFXX74)).

**Blue teams need to build multiple defensive layers to counter sophisticated attacks from adversarial AI.** Every control measure can be understood as a countermove in an adversarial game, with success measured by how effectively it prevents the red team from achieving unacceptable outcomes despite all the advantages we mentioned above. The most effective control protocols should integrate several defense categories ([Greenblatt, 2025](https://www.alignmentforum.org/s/WCJtsn6fNib6L7ZBB/p/fCazYoZSSMadiT6sf); [Greenblatt, 2025](https://www.alignmentforum.org/s/WCJtsn6fNib6L7ZBB/p/G8WwLmcGFa4H6Ld9d)):

1. **Core Monitoring Systems**: Fast, cheap monitoring runs on every AI action, combining trusted monitoring, untrusted monitoring with collusion-busting mechanisms, and activation-based methods like coup probes to flag suspicious behavior before execution.
2. **Resampling and Deference Strategies:** When monitoring flags concerns, the system can resample outputs from the same AI, defer to trusted models for safer alternatives, or escalate to human review—providing multiple pathways to handle potentially dangerous actions without completely blocking workflow.
3. **Automated Auditing Hierarchy**: Progressive checks of increasing complexity and expense examine flagged actions before human involvement, using techniques like incrimination (testing how the AI responds to modified inputs) to reveal hidden intentions.
4. **Human Review Processes**: Specially trained auditors examine actions flagged as highly suspicious, with different processes for concentrated risks (requiring immediate blocking) versus diffuse threats (allowing asynchronous review).
5. **Software Infrastructure**: Comprehensive logging systems, fine-grained permission management, artifact tracking, and scaffolding that supports blocking or resampling suspicious outputs create an environment where AI actions are consistently monitored.
6. **Adversarial Evaluation and Training**: Continuous red-teaming tests all safety components against realistic attack scenarios, using synthetic data generation to create plausible but challenging inputs that probe for weaknesses.
7. **Security Enhancements**: Hardware-enforced code signing, bandwidth limitations within datacenters, and multi-level shutdown mechanisms provide critical infrastructure protection against unauthorized operations.

For example, when testing whether an AI agent can exfiltrate sensitive data, the blue team might implement not just content filters, but also bandwidth monitoring, permission restrictions, and random human audits—creating multiple independent failure points the red team must simultaneously overcome ([Korbak et al., 2024](https://arxiv.org/abs/2501.17315)).

**Control evaluations prioritize defense against the most severe threat categories.** When designing control measures, not all potential threats pose equal risk. Rogue internal deployments represent the highest priority threat, where AI systems gain unmonitored compute access and potentially elevated permissions within developer infrastructure ([Greenblatt, 2025](https://www.alignmentforum.org/s/WCJtsn6fNib6L7ZBB/p/fCazYoZSSMadiT6sf)). This ranks above sabotage of safety-critical work, which itself ranks above other security failures including self-exfiltration (where AI models steal their own weights). This prioritization framework helps allocate limited security resources toward preventing the most consequential failure modes first, particularly when constraints prevent implementing comprehensive protections against all possible threats. For more information on details of different subcategories of control evaluations and prioritization between them please read the appendix.

**Eliciting adversarial capabilities raises increased safety concerns around evaluations themselves.** The process of training models to execute attacks creates a safety paradox inherent to control evaluations. Developers must carefully train models that can both generate novel attacks and hide their malicious intent, which itself poses potential risks as models approach greater capabilities. This necessitates additional safeguards for the evaluation process itself: running evaluations in sandboxed environments, limiting model internet access during testing, and implementing guardrails against model exfiltration. Developers must balance thorough testing against the possibility that the evaluation process might itself create dangerous knowledge or capabilities.

**AI-Control Games can formalize the safety-usefulness tradeoff as a mathematical frontier.** Using game theory, researchers model control protocols as multi-objective optimization problems that balance safety against utility ([Griffin et al., 2024](https://arxiv.org/abs/2409.07985v1)). This approach quantifies the "alignment tax" – how much usefulness must be sacrificed to achieve a specific safety threshold ([Christiano, 2019](https://www.effectivealtruism.org/articles/paul-christiano-current-work-in-ai-alignment)). Basically, measuring how much utility/usefulness you lose if you insist that the model is safe. The resulting safety-usefulness frontier represents the set of achievable tradeoffs between preventing unacceptable outcomes and preserving system utility. This type of formal analysis can tell us the thresholds after which marginal safety improvements require disproportionate utility sacrifices. For example, increasing human audit rates from 0.1% to 1% may minimally improve safety while significantly increasing costs and decreasing system throughput ([Griffin et al., 2024](https://arxiv.org/abs/2409.07985v1)). Then we can make educated decisions about what level of utility to safety we are willing to trade vs not.

*Figure 5.53: Illustration of how applying a safety or alignment technique could make the model less capable. This is called safety tax.*

**Audits serve as the fundamental verification mechanism in control protocols.** In control evaluations, an "audit" refers to the process where human experts review model outputs or actions to determine if they constitute attack attempts ([Greenblatt et al., 2023](https://www.alignmentforum.org/posts/hirhSqvEAq7pdnyPG/auditing-failures-vs-concentrated-failures)). Unlike automated monitoring which can be applied broadly but may miss subtle attacks, human audits provide high-confidence verification but can only be applied to a small fraction of model actions due to resource constraints. Effective auditing includes several components: randomly selecting actions for review to prevent models from predicting audit timing, applying extra scrutiny to actions flagged by automated monitors, and maintaining consistent evaluation criteria across auditors. The audit budget—what percentage of actions can realistically receive human review—represents one of the most important constraints when designing control protocols ([Griffin et al., 2024](https://arxiv.org/abs/2409.07985v1)).

**Analysis of control evaluations on toy models give us information on designing effective control protocols. **Control evaluations have yet to see proper extensive usage in frontier AI Labs. Despite this, some analysis using toy models has revealed several important principles for preventing concentrated catastrophes ([Roger & Shlegeris, 2024](https://www.alignmentforum.org/posts/MDeGts4Aw9DktCkXw/toy-models-of-ai-control-for-concentrated-catastrophe)):

- **The timing of auditing matters a lot. **If AI systems can determine when they're being monitored, even a high audit rate might be insufficient. However, if the timing of audits is truly random, much lower audit rates can still provide strong safety guarantees.
- **The difficulty of coordination between AI instances can be leveraged for safety**. If multiple AI systems must coordinate to cause harm, making coordination harder (for instance, by increasing the number of instances that must coordinate or the difficulty of finding coordination signals by using anti-steganography measures) can dramatically improve safety even with relatively low audit rates.
- **Not all episodes are equally risky.** By identifying which types of inputs or situations present higher risks and focusing auditing resources there, protocols can achieve better safety with the same resources. This remains true even if the AI systems know which episodes we think are riskier.

**Control evaluations face limitations as AI capabilities increase dramatically.** The methodology depends on red teams' ability to elicit capabilities comparable to what scheming models might deploy in the wild. As capability gaps widen between trusted and untrusted models or between evaluation time and deployment, control guarantees weaken ([Korbak et al., 2024](https://arxiv.org/abs/2501.17315)). Similarly, if models develop qualitatively new capabilities between evaluation and deployment, earlier safety guarantees may not transfer.

**Control evaluations cannot address risks from qualitatively superhuman capabilities.** The methodology breaks down when models develop capabilities that humans cannot comprehend even with extensive analysis time ([Greenblatt & Shlegeris, 2024](https://www.alignmentforum.org/s/PC3yJgdKvk8kzqZyA/p/kcKrE9mzEHrdqtDpE)). If models can devise novel attack vectors beyond human understanding, execute long chains of opaque reasoning without detectable traces, or possess qualitatively superhuman capabilities in many security-critical domains, control becomes ineffective. This limitation suggests that rather than pursuing maximum capabilities, developers should target systems that are transformatively useful while remaining within controllable bounds. This might require accepting certain architectural constraints or capability limitations to ensure systems remain within the domain where control evaluations provide meaningful safety guarantees ([Shlegeris et al., 2023](https://arxiv.org/abs/2312.06942)).

## Evaluation Design

Good evaluation design means managing the tools the AI has access to during testing, finding ways to run many assessments, making sure results integrate with audits and lead to real safety changes within a company.

Now that we've explored various evaluation techniques and methodologies, and also some concrete evaluations in different categories of capability, propensity and control. The next thing to understand is how to implement these effectively at scale. The objective of this section is to outline some best practices for building a robust evaluation infrastructure - from designing evaluation protocols and quality assurance processes, to scaling automation and integrating with the broader AI Safety ecosystem. We'll see how components like evaluation design, model-written evaluations, and meta-evaluation methods work together to make AIs safer.

**Evaluation design inspiration from cognitive science**

Some researchers have suggested taking more inspiration from the cognitive sciences for insights for improving AI evaluation methodology. Over decades, researchers in comparative psychology and psychometrics have developed sophisticated techniques for assessing intelligence and capabilities across different types of minds - from human infants to various animal species. These fields have extensive experience with crucial challenges that AI evaluation faces today, like establishing construct validity, controlling for confounding explanations, and measuring capabilities that can't be directly observed. However, this adaptation requires careful consideration of AI systems' unique characteristics. We can't simply transplant techniques designed for biological minds, but we can learn from their methodological rigor. For example, developmental psychology's systematic approaches to testing object permanence or theory of mind could inform how we design evaluations for similar capabilities in AI systems. Similarly, psychometrics' sophisticated frameworks for validating measurement constructs could help ensure our AI evaluations actually measure what we intend them to measure ([Burden, 2024](https://arxiv.org/abs/2407.09221)).

## Affordances

**Affordances are resources and opportunities we give the AI system.** They include things like access to the internet, ability to run code, available context length, or specialized tools like calculators ([Sharkey et al., 2024](https://static1.squarespace.com/static/6593e7097565990e65c886fd/t/65a6f1389754fc06cb9a7a14/1705439547455/auditing_framework_web.pdf)). Just like how a calculator makes certain math problems trivial while others remain hard, different affordances can dramatically change what an AI system can accomplish during evaluation.

**Different evaluation conditions modify the affordances available to the model during testing.** This means that even when measuring the same property, we can increase or decrease the amount of affordance a system has, and this tells us different things about a model's capabilities:

1. **Minimal affordance**: You can think of minimal affordance conditions as sort of the worst case for the model. We deliberately restrict the resources and opportunities available - like limiting context length, removing tool access, or restricting API calls. When evaluating coding ability, this might mean testing the model with no access to documentation or ability to run code. This helps establish a baseline of core capabilities independent of external tools.
2. **Typical affordance**: Under typical affordance conditions, we aim to replicate the normal operating environment - providing standard tools and typical context. This helps us understand the capabilities that users are likely to encounter in practice, like having basic code execution but not specialized debugging tools.  Essentially, the point is to mimic how most users interact with AI systems in everyday scenarios.
3. **Maximal affordance**: Under maximal affordance conditions, we provide the model with all potentially relevant tools, context, and resources. For that same coding evaluation, we might provide access to documentation, debugging tools, and execution environments. This helps us understand the full extent of what the model can accomplish when given appropriate resources.

*Figure 5.54: The relationship between an AI system's capabilities, propensities, affordances, and behaviors. ([Sharkey et al., 2024](https://static1.squarespace.com/static/6593e7097565990e65c886fd/t/65a6f1389754fc06cb9a7a14/1705439547455/auditing_framework_web.pdf))*

**Understanding the relationship between affordances and capabilities helps us design more comprehensive evaluation protocols.** For instance, when evaluating a model's coding capabilities, testing under minimal affordances might reveal concerning behaviors that are masked when the model has access to better tools. Maybe the model suggests unsafe coding practices when it can't verify its solutions through execution. Similarly, testing under maximal affordances might reveal emergent capabilities that aren't visible in more restricted environments - like how GPT-4's ability to play Minecraft only became apparent when given appropriate scaffolding and tools ([Wang et al., 2023](https://arxiv.org/abs/2305.16291)).

**Quality assurance in evaluation design.** Given how significantly affordances affect model behavior, we need systematic approaches to ensure our evaluations remain reliable and meaningful. This means carefully documenting what affordances were available during testing, verifying that affordance restrictions are properly enforced, and validating that our results are reproducible under similar conditions. For instance, when the U.S. and UK AI Safety Institutes evaluated Claude 3.5 Sonnet, they explicitly noted that their findings were preliminary due to testing under limited affordances and time constraints ([US & UK AISI, 2024](https://cdn.prod.website-files.com/663bd486c5e4c81588db7a1d/673b689ec926d8d32e889a8e_UK-US-Testing-Report-Nov-19.pdf)).

*Figure 5.55: The relationship between absolute capabilities, affordances, contextual and reachable capabilities, and the level of auditing warranted. Absolute capabilities and available affordances are orthogonal. As either increase, the level of auditing required also increases ([Sharkey et al. 2024](https://static1.squarespace.com/static/6593e7097565990e65c886fd/t/65a6f1389754fc06cb9a7a14/1705439547455/auditing_framework_web.pdf)).*

**Moving beyond individual evaluations.** While understanding affordances is crucial for designing individual evaluations, we also need to consider how different evaluation conditions work together as part of a larger system. A comprehensive evaluation protocol might start with minimal affordance testing to establish baseline capabilities, then progressively add affordances to understand how the model's behavior changes. This layered approach helps us build a more complete picture of model behavior while maintaining rigorous control over testing conditions.

## Scaling and Automation

In previous sections, we explored various evaluation techniques and conditions, but implementing these systematically faces a major challenge: scale. When OpenAI or Anthropic release a new model, dozens of independent researchers and organizations need to verify its capabilities and safety properties. Doing this manually for every evaluation under different affordance conditions would be prohibitively expensive and time-consuming. We need systems that can automatically run comprehensive evaluation suites while maintaining the careful control over conditions we discussed earlier.

**Automating evaluation through model written evaluations (MWEs)**. As we scale up evaluations, we need to make strategic decisions about resource allocation. Do we run many quick evaluations under different conditions, or fewer evaluations with more thorough testing? One approach is using automated tools to do broad initial testing, then dedicating more resources to deep investigation of concerning behaviors. This tiered approach helps balance the need for comprehensive coverage with limited resources. One approach is using AI models themselves to help generate and run evaluations. Even current language models can write high-quality evaluation questions when properly prompted ([Perez et al., 2022](https://arxiv.org/abs/2212.09251)). This approach might help address the scaling challenge by reducing the human effort needed to create new evaluations.

**How do model-written evaluations work?**

The basic approach involves having an AI model generate evaluation questions based on a specific behavior or capability we want to test. For example, to evaluate power-seeking tendencies, we might prompt the model with a description of power-seeking behavior and ask it to generate relevant multiple-choice questions. These generated questions are then filtered using a second model that acts as a judge, scoring them for quality and relevance. To maintain diversity, researchers use various prompting techniques like "variant prompts" that encourage different question formats and scenarios. For example, one variant might request questions about real-world scenarios, while another focuses on hypothetical ethical dilemmas. The final step involves human validation of a sample of questions to ensure quality ([Dev & Hobbhahn, 2024](https://www.alignmentforum.org/posts/yxdHp2cZeQbZGREEN/improving-model-written-evals-for-ai-safety-benchmarking)).

When comparing MWEs to human written evaluations (HWEs), researchers found some differences. Models often respond very differently to these two types of questions even when they're supposed to test the same property. For instance, Claude 3 Haiku showed a 25% power-seeking inclination on human-written questions but 88% on model-written ones. The evaluations also formed distinct clusters in embedding space, suggesting systematic differences in how questions were formulated. Interestingly though, when graded by an LLM judge for quality, MWEs actually scored higher (mean=8.1) than HWEs (mean=7.2). This suggests that while MWEs can be high quality, we need to be careful about potential biases and ensure they're truly testing the same properties as human-written evaluations ([Dev & Hobbhahn, 2024](https://www.alignmentforum.org/posts/yxdHp2cZeQbZGREEN/improving-model-written-evals-for-ai-safety-benchmarking)).

**Building automated evaluation pipelines**. Let's look at how researchers actually implement model-written evaluations. In Apollo Research's recent work, they developed a systematic protocol: First, they have a large language model like Claude 3.5 generate batches of 40 questions per API call using few-shot examples and clear evaluation criteria. The model outputs these in structured JSON format to ensure consistency. They then use a separate "judge" model (usually from a different model family to avoid bias) to score each generated question on quality and relevance. Any questions scoring below a threshold are automatically discarded. To ensure coverage and diversity, they employ "variant prompts".This automated pipeline can generate hundreds of high-quality evaluation questions in hours rather than the weeks it might take human evaluators.

*Figure 5.56: Examples of model written evaluation questions. A RLHF model replies to a political question and gives opposite answers to users who introduce themselves differently, in line with the users’ views. Model-written biography text in italics ([Perez et al., 2022](https://arxiv.org/abs/2212.09251)).*

**We still face significant challenges in automating evaluations.** First, we need to maintain quality as we scale - automated systems must be able to reliably enforce affordance conditions and detect potential evaluation failures. Second, generated evaluations need to be validated to ensure they're actually testing what we intend. As Apollo Research found, model-written evaluations sometimes had systematic blindspots or biases that needed to be corrected through careful protocol design ([Dev & Hobbhahn, 2024](https://www.alignmentforum.org/posts/yxdHp2cZeQbZGREEN/improving-model-written-evals-for-ai-safety-benchmarking)).

## Integration and Audits

*Figure 5.57: The aim is to avoid extreme risks from a powerful misaligned model ([Shevlane et al. 2023](https://arxiv.org/abs/2305.15324)).*

**How do evaluations fit into the broader safety ecosystem?** When we talk about evaluations being used in practice, we're really talking about two distinct but complementary processes. First, there are the specific evaluation techniques we've discussed in previous sections - the tools we use to measure particular capabilities, propensities, or safety properties. Second, there's the broader process of auditing that uses these evaluations alongside other analysis methods to make comprehensive safety assessments.

**Why do we need multiple layers of evaluations?** The UK AI Safety Institute's approach demonstrates why integration requires multiple complementary layers. Their evaluation framework incorporates regular security audits, ongoing monitoring systems, clear response protocols, and external oversight - creating what they call a "defense in depth" approach ([UK AISI, 2024](https://www.gov.uk/government/publications/ai-safety-institute-approach-to-evaluations/ai-safety-institute-approach-to-evaluations)). This layered strategy helps catch potential risks that might slip through any single evaluation method.

*Figure 5.58: Defense in depth ([Hendrycks, 2024](https://www.aisafetybook.com/textbook/component-failure-accident-models))*

**What makes auditing different from evaluation?** While evaluations are specific measurement tools, auditing is a systematic process of safety verification. An audit might employ multiple evaluations, but also considers broader factors like organizational processes, documentation, and safety frameworks. For example, when auditing a model for deployment readiness, we don't just run capability evaluations - we also examine training procedures, security measures, and incident response plans ([Sharkey et al., 2024](https://static1.squarespace.com/static/6593e7097565990e65c886fd/t/65a6f1389754fc06cb9a7a14/1705439547455/auditing_framework_web.pdf)).

**What types of audits do we need?** Different aspects of AI development require different types of audits:

1. **Training design audits**: These verify safety considerations throughout the development process. Before training begins, auditors examine factors like compute usage plans, training data content, and experiment design decisions. During training, they monitor for concerning capabilities or behaviors. For instance, training audits might look for indicators that scaling up model size might lead to dangerous capabilities ([Anthropic, 2024](https://www.anthropic.com/news/anthropics-responsible-scaling-policy)).
2. **Security audits**: These assess whether safety measures remain effective under adversarial conditions. This includes technical security (like model isolation and access controls) and organizational security (like insider threat prevention).
3. **Deployment audits**: These examine specific deployment scenarios. They consider questions like: Who will have access? What affordances will be available? What safeguards are needed? These audits help determine whether deployment plans adequately address potential risks. For example, deployment audits might assess both direct risks from model capabilities and potential emergent risks from real-world usage patterns of those capabilities ([Apollo Research, 2024](https://www.apolloresearch.ai/blog/a-starter-guide-for-evals)).
4. **Governance audits**: These look at organizational safety infrastructure. They verify that companies have appropriate processes, documentation requirements, and response protocols. This includes reviewing incident response plans, oversight mechanisms, and transparency practices ([Shevlane et al., 2023](https://arxiv.org/abs/2305.15324)).

*Figure 5.59: Determinants of AI system’s effects on the world and the types of auditing that act on them. ([Sharkey et al., 2024](https://static1.squarespace.com/static/6593e7097565990e65c886fd/t/65a6f1389754fc06cb9a7a14/1705439547455/auditing_framework_web.pdf))*

**Evaluations support the auditing process because each type of audit uses different combinations of evaluations to gather evidence. **For example, a deployment audit might use capability evaluations to establish upper bounds on risky behaviors, propensity evaluations to understand default tendencies, and control evaluations to verify safety measures. The audit process then synthesizes these evaluation results with other evidence to make safety assessments.

**Internal auditing is a great first step but we also need independent external audits. **Third party audits provide additional assurance by bringing fresh perspectives and diminishing the probability for potential conflicts of interest. Organizations like METR and Apollo Research demonstrate how independent auditors can combine multiple evaluation techniques to provide comprehensive safety assessments. However, this ecosystem is still developing, and we need more capacity for independent evaluation of frontier AI systems  ([Shevlane et al., 2023](https://arxiv.org/abs/2305.15324)). We talk more about bottlenecks to third party auditing in the limitations section.

**After deployment, we need ongoing monitoring and periodic re-auditing.** We need this for several reasons - First, we need to catch unanticipated behaviors that emerge from real-world usage. Second, we need to evaluate any model updates or changes in deployment conditions. Third, we need to verify that safety measures remain effective. This creates a feedback loop where deployment observations inform future evaluation design and audit procedures ([Shevlane et al., 2023](https://arxiv.org/abs/2305.15324); [Phuong et al., 2024](https://arxiv.org/abs/2403.13793); [UK AISI, 2024](https://www.gov.uk/government/publications/ai-safety-institute-approach-to-evaluations/ai-safety-institute-approach-to-evaluations); [Sharkey et al., 2024](https://static1.squarespace.com/static/6593e7097565990e65c886fd/t/65a6f1389754fc06cb9a7a14/1705439547455/auditing_framework_web.pdf)).

**For evaluations and audits to have real impact, they must connect directly to decision-making processes.** Both evaluations and audits need clear, predefined thresholds that trigger specific actions. For example, concerning results from capability evaluations during training might trigger automatic pauses in model scaling. Failed security audits should require immediate implementation of additional controls. Poor deployment audit results should modify or halt deployment plans ([Sharkey et al., 2024](https://static1.squarespace.com/static/6593e7097565990e65c886fd/t/65a6f1389754fc06cb9a7a14/1705439547455/auditing_framework_web.pdf)).

**These action triggers need to be integrated into broader governance frameworks**. As we discussed in the chapter on governance, many organizations are developing Responsible Scaling Policies (RSPs) that use evaluation results as "gates" for development decisions. However, without strong governance frameworks and enforcement mechanisms, there's a risk that evaluations and audits become mere checkbox exercises - what some researchers call "safety washing". We'll explore these limitations and potential failure modes in more detail in the next section.

## Limitations

Evaluations can prove the presence of risks but not their absence, they might be vulnerable to safety washing, and can be misused to make us feel overconfident in the safety of a system.

Previous sections outlined various evaluation techniques and methodologies, but building out a proper safety infrastructure means that we should also maintain an appropriate amount of skepticism about evaluation results and avoid overconfidence in safety assessments. So the last section of this chapter is dedicated to exploring the limitations, constraints and challenges to AI evaluations.

## Fundamental Challenges

**Absence of evidence vs. evidence of absence. **A core challenge in AI evaluation is the fundamental asymmetry between proving the presence versus absence of capabilities or risks. While evaluations can definitively confirm that a model possesses certain capabilities or exhibits particular behaviors, they cannot conclusively prove the absence of concerning capabilities or behaviors ([Anthropic, 2024](https://www.anthropic.com/news/evaluating-ai-systems)). This asymmetry creates a persistent uncertainty in safety assessments - even if a model passes numerous safety evaluations, we cannot be certain it is truly safe.

**The challenge of interpreting negative results**. Related to this asymmetry is the difficulty of interpreting negative results in AI evaluation. When a model fails to demonstrate a capability or behavior during testing, this could indicate either a genuine limitation of the model or simply a failure of our evaluation methods to properly elicit the capability or propensity. The discovery that GPT-4 could effectively control Minecraft through appropriate scaffolding, despite initially appearing incapable of such behavior, illustrates this challenge. This uncertainty about negative results becomes particularly problematic when evaluating safety-critical properties, where false negatives could have severe consequences.

**The problem of unknown unknowns**. Perhaps most fundamentally, our evaluation methods are limited by our ability to anticipate what we need to evaluate. As AI systems become more capable, they might develop behaviors or capabilities that we haven't thought to test for. This suggests that our evaluation frameworks might be missing entire categories of capabilities or risks simply because we haven't conceived of them yet.

## Technical Challenges

**The challenge of measurement precision.** The extreme sensitivity of language models to minor changes in evaluation conditions poses a fundamental challenge to reliable measurement. As we discussed in earlier sections about evaluation techniques, even subtle variations in prompt formatting can lead to dramatically different results. Research has also shown that performance can vary by up to 76 percent based on seemingly trivial changes in few-shot prompting formats ([Scalar et al., 2023](https://arxiv.org/abs/2310.11324)). As another example, changes as minor as switching from alphabetical to numerical option labeling (e.g., from (A) to (1)), altering bracket styles, or adding a single space can result in accuracy fluctuations of around 5 percentage points. ([Anthropic, 2024](https://www.anthropic.com/news/evaluating-ai-systems)) This isn't just a minor inconvenience - it raises serious questions about the reliability and reproducibility of our evaluation results. When small changes in input formatting can cause such large variations in measured performance, how can we trust our assessments of model capabilities or safety properties?

**Sensitivity to environmental factors**. Beyond just measurement issues due to prompt formatting, evaluation results can be affected by various environmental factors that we might not even be aware of. Just like the unknown unknowns that we highlighted in the last section, this might include things like the specific version of the model being tested, the temperature and other sampling parameters used, the compute infrastructure running the evaluation and maybe even the time of day or system load during testing.

**The choice of evaluation methods must also consider practical constraints.** Third party evaluators might have only limited access to a model's internals. They might have access to observe activations, but not modify the weights. They might not have access to the model at all, and might be restricted to observing the model functioning "in the wild". Depending on the specific techniques used, computational resources might restrict certain types of analysis. 

**Price and time are also practical constraints on running comprehensive evaluations.** Running agents (or "AI systems") is much more expensive and time intensive than benchmarks. Developing complex novel tasks, verifying and measuring capabilities properly is human labor intensive. As an example, during the evaluation for AI capabilities in RnD the human "control group" spent more than 568 hours on just 7 tasks ([Wijk et al., 2024](https://arxiv.org/abs/2411.15114)). This doesn't even include time to develop and test the tasks. Real-world tasks in AI R&D might take hundreds of man-hours each, and most independent evaluators, or even internal teams at big AI labs don’t have the budget and time to properly evaluate performance given these considerations. All of this needs to be kept in mind when designing an evaluation protocol.

**The combinatorial explosion of interaction scenarios.** When AI systems interact in complex environments, the number of possible scenarios grows exponentially with each additional interaction pattern. Unlike evaluating isolated capabilities, we must consider how multiple components interact in unpredictable ways - similar to how normally benign chemicals can form dangerous compounds when combined. For example, when model A with planning abilities interacts with model B possessing information access, and both operate within system C that has external tool permissions, entirely new risk surfaces emerge that weren't present in any individual component. This combinatorial complexity makes comprehensive testing fundamentally intractable. With millions of potential interaction patterns, we can only evaluate a microscopic fraction of possible scenarios, leaving countless blind spots where critical emergent behaviors might develop. This fundamental limitation raises a sobering question: if we can only explore 0.001% of the interaction space, how can we be confident we've identified the most dangerous emergent patterns?

**The resource reality**. Related to the problem of exploding number of evaluations is the sheer computational cost of running thorough evaluations. This creates another hard limit on what we can practically test. Making over 100,000 API calls just to properly assess performance on a single benchmark becomes prohibitively expensive when scaled across multiple capabilities and safety concerns. Independent researchers and smaller organizations often can't afford such comprehensive testing, and even if money isn't a bottleneck because you have state sponsorship, GPUs currently absolutely are. This can lead to potential blind spots in our understanding of model behavior. The resource constraints become even more pressing when we consider the need for repeated testing as models are updated or new capabilities emerge.

> While these tests were conducted in line with current best practices, the findings should be considered preliminary. These tests were conducted in a limited time period with finite resources, which if extended could expand the scope of findings and the subsequent conclusions drawn.
> — US and UK AI Safety Institute Evaluation of Claude 3.5 Sonnet

**Ikigai Risks (I-Risks) - Risks from loss of existential purpose**

**Ikigai risks (i-risks) involve loss of meaning and purpose even when humans survive and prosper.** Named after the Japanese concept of ikigai (life's purpose), these risks emerge when AI systems become more capable than humans at all meaningful activities. Humans might lose their sense of purpose when AI can create better art, conduct better research, and perform better at every task that traditionally gave life meaning. Unlike extinction or suffering risks, i-risks involve scenarios where humans are safe and materially comfortable but existentially adrift. We might create artificial constraints that preserve human relevance, or find entirely new forms of purpose that emerge from human-AI collaboration. However, these solutions raise their own questions about authenticity and whether artificially preserved meaning can satisfy human psychological needs ([Yampolskiy, 2024](https://books.google.se/books/about/AI.html?id=V3XsEAAAQBAJ&redir_esc=y); [Yampolsky; 2024](https://lexfridman.com/roman-yampolskiy-transcript/#chapter2_ikigai_risk)).

**The difficulty in estimating "Maximum Capabilities"**. Also related to the combinatorial complexity problem is the problem of emergence. We frequently discover that models can do things we thought impossible when given the right scaffolding or tools. The Voyager project demonstrated this when it used black box queries to GPT-4 revealing the unexpected ability to play Minecraft ([Wang et al., 2023](https://arxiv.org/abs/2305.16291)). Either through continued increases in scale, or through simple combination of some scaffolding approach that researchers hadn’t considered before, a model might display some significant step change in capabilities or propensities that we had not anticipated before. This emergent behavior is particularly difficult to evaluate because it often arises from the interaction between capabilities. There are few if any ways to guarantee that the whole will not become more than the sum of its parts. This makes it extremely difficult to predict what a model might be capable of in real-world situations, even if we thoroughly test its individual capabilities.

**Case study of capability surprises: Kalamang**

Gemini is able to translate language in a zero-shot way, without any training.  The case of Gemini's ability to translate Kalamang - a language spoken by only 100 people in Papua New Guinea - demonstrates how models can acquire unexpected capabilities when provided with the right context. It's highly unlikely that this language was part of the training data of any major AI model. Researchers at Google DeepMind demonstrated that by leveraging a sufficiently large context window—around a million tokens—they could input a full grammar book and dictionary of Kalamang. When asked to translate a sentence from Kalamang to English, the model was able to perform this task ([Tanzer et al., 2024](https://arxiv.org/abs/2309.16575)).

### Sandbagging

*Figure 5.60: Illustration of sandbagging ([van der Weij, 2024](https://www.alignmentforum.org/posts/jsmNCj9QKcfdg8fJk/an-introduction-to-ai-sandbagging))*

**Behavioral testing is not sufficient.** We have talked about this numerous times in previous sections, but it's worth mentioning again. Watching what a model does can tell us a lot, but it's still like trying to understand how a person thinks by only observing their actions. Our heavy reliance on behavioral testing - analyzing model outputs - leaves us with significant blind spots about how models actually work. A model might give the right answers while using internal strategies we can't detect. So in addition to being limited by breadth due to combinatorial complexity of the number of things we need to test, we are also limited on the depth of evaluation until we make true progress on interpretability. Unless interpretability makes progress, evaluations have a definite upper limit on how much safety they can guarantee.

**Sandbagging refers to strategic underperformance on evaluations, where an AI system is intentionally made to appear less capable than it actually is. **There are concerns about AI systems themselves choosing to sandbag, independent of developer intentions. An AI system might deliberately hide its capabilities to avoid modifications or restrictions that would prevent it from achieving its goals ([Hubinger, 2023](https://www.alignmentforum.org/posts/dBmfb76zx6wjPsBC7/when-can-we-trust-model-evaluations)). We talked more about why this is a concern in the subsection on scheming. As evidence that is is possible, multiple frontier models including Claude 3.5 Sonnet, Claude 3 Opus, and OpenAI's o1 have been shown to be capable of sandbagging behavior. They deliberately underperformed on arithmetic tasks when they believed that high performance would trigger model unlearning[^footnote_in_context_sandbagging] ([Meinke et al., 2025](https://arxiv.org/abs/2412.04984)).

[^footnote_in_context_sandbagging]: This behavior was observed in a controlled evaluation setting where models were explicitly informed through the context window about potential unlearning consequences for high performance. The researchers noted: "We have only shown that some models possess basic scheming capabilities if both the goal and the situational awareness are acquired in-context" rather than demonstrating this behavior emerging spontaneously in more realistic scenarios ([Meinke et al., 2025](https://arxiv.org/abs/2412.04984)).

**What does the potential for sandbagging imply for the field of evaluations?** The possibility of sandbagging has big implications for how we approach AI evaluations, both now and in the future. Simply having access to model outputs (black-box testing) is insufficient for rigorous evaluation ([Casper et al., 2024](https://arxiv.org/abs/2401.14446)). Instead, evaluators need deeper access to model internals and training procedures to detect potential sandbagging attempts. This might require new institutional arrangements, such as trusted third-party evaluators with secured access to model weights and architectures. Supervised fine-tuning has shown to be promising as a way to greatly mitigate sandbagging and elicit hidden capabilities ([Greenblatt et al, 2024](https://arxiv.org/abs/2405.19550)), but fine-tuning is not so easy to use in a comprehensive way on dangerous capability benchmarks, and not usable for dangerous propensity benchmarks. Without addressing this challenge, our ability to make informed decisions about AI development and deployment based on evaluation results becomes severely compromised.

## Systemic Limitations

**Similar to how companies might engage in "greenwashing" by misleadingly portraying themselves as environmentally friendly, AI companies might engage in "safety washing".** This means misrepresenting capability improvements as safety advancements. This happens when safety benchmarks have high correlation with general model capabilities and compute used for training. When a benchmark highly correlates with capabilities, improving performance on it may simply reflect that the model got more capable overall, rather than specifically safer.

*Figure 5.61: The tight connection between many safety properties and capabilities can enable safety washing, where capabilities advancements (e.g., training a larger model) can be advertised as progress on "AI safety." This confuses the research community to the developments that have occurred, distorting the academic discourse ([Ren et al., 2024](https://arxiv.org/abs/2407.21792)).*

**How does safety washing happen in practice?** There are typically three ways this occurs ([Ren et al., 2024](https://arxiv.org/abs/2407.21792)):

1. **Safety by association**: Research gets labeled as safety-relevant simply because it comes from safety teams or researchers, even if the work is essentially capabilities research.
2. **Public relations**: Companies portray capability advancements as safety progress by highlighting correlated safety metrics, especially when facing public pressure or regulatory scrutiny.
3. **Grant optimization**: Researchers frame capabilities work in safety terms to appeal to funders, even when the advances don't make differential safety progress.

**Capabilities correlations might enable safety washing.** When safety benchmarks correlate strongly with capabilities, it becomes easy to misrepresent capability improvements as safety advances. For example, if a benchmark meant to measure "truthfulness" actually correlates 80% with general capabilities, then making models bigger and more capable will improve their "truthfulness" scores - even if they aren't becoming inherently more honest ([Ren et al., 2024](https://arxiv.org/abs/2407.21792)). This creates a systemic problem where organizations can claim safety improvements by simply scaling up their models. The solution isn't just establishing norms around benchmark use - these tend to be weak and easily ignored. Instead, we need evaluation methods that can empirically distinguish between capability improvements and genuine safety advances.

**Companies and AI labs can also choose to sandbag their evaluations.** We put another form of sandbagging here as systemic limitation to refer to corporations intentionally sandbagging to score better on evaluations. This is in contrast to misalignment/technical sandbagging as a limitation which points to when the model does it because of scheming.

In the immediate term, developers face various pressures that could incentivize sandbagging during evaluations. Regulatory requirements and safety standards may impose capability thresholds above which additional oversight, restrictions, or costs are incurred. For instance, the EU AI Act requires more stringent controls and third-party auditing for high-risk AI systems ([European Commission, 2024](https://artificialintelligenceact.eu/the-act/)). This creates a direct financial incentive to appear below these thresholds. Current research shows that even today's language models can be made to selectively underperform in ways that are difficult to detect through standard evaluations, particularly on tests for dangerous capabilities while maintaining performance on general benchmarks ([van der Weij et al., 2024](https://arxiv.org/abs/2406.07358)).

In the future, companies may want to downplay certain capabilities of their systems to avoid drawing regulatory attention or public concern, while still being able to deploy these capabilities in practice. This allows them to maintain development and deployment speed without triggering additional oversight. For example, if an AI system shows strong capabilities in areas that might raise security concerns, developers might be motivated to have it underperform specifically on those evaluations while maintaining high performance on general benchmarks ([van der Weij et al., 2024](https://arxiv.org/abs/2406.07358)). They might do this to maintain development and deployment speed without triggering additional oversight.

**The gap between evaluation and deployment.** Current evaluation approaches often fail to capture the true complexity of real-world deployment contexts. While we can measure model performance in controlled settings, these evaluations rarely reflect how systems will behave when embedded in complex sociotechnical environments. A model that performs safely in isolated testing might behave very differently when interacting with real users, facing novel scenarios, or operating within broader societal systems. For instance, a medical advice model might perform well on standard benchmarks but pose serious risks when deployed in healthcare settings where users lack the expertise to validate its recommendations. This disconnect between evaluation and deployment contexts means we might miss critical safety issues that only emerge through complex human-AI interactions or system-level effects ([Weidinger et al., 2023](https://arxiv.org/abs/2310.11986)).

**The challenge of domain-specific evaluations**. Evaluating AI systems in high-stakes domains presents unique challenges that our current approaches struggle to address. Consider the evaluation of AI systems for potential misuse in biosecurity - this requires not just technical understanding of AI systems and evaluation methods, but also deep expertise in biology, biosafety protocols, and potential threat vectors. This combination of expertise is exceedingly rare, making thorough evaluation in these critical domains particularly challenging. Similar challenges arise in other specialized fields like cybersecurity, where the complexity of potential attack vectors and system vulnerabilities requires sophisticated domain knowledge to properly assess. The difficulty of finding evaluators with sufficient cross-domain expertise often leads to evaluations that miss important domain-specific risks or fail to anticipate novel forms of misuse. Besides just designing evaluations being challenging, running the evaluation for such high stakes domains is also challenging. We need to elicit the maximum capabilities of a model to generate pathogens or malware to test what it is capable of. This can be obviously problematic if not handled extremely carefully.

## Governance Limitations

*Figure 5.62: We need much more work in evaluations. Higher capabilities require more safety evaluations. Many high-stakes decisions in company-led and government-led frameworks are reliant on the results of evals ([Hobbahn, 2024](https://www.alignmentforum.org/posts/gJJEjJpKiddoYGZKk/the-evals-gap)).*

**The policy-evaluation gap.** Current legislative and regulatory efforts around AI safety are severely hampered by the lack of robust, standardized evaluation methods. Consider the recent executive orders and proposed regulations requiring safety assessments of AI systems - how can governments enforce these requirements when we lack reliable ways to measure what makes an AI system "safe"? This creates a circular problem: regulators need evaluation standards to create effective policies, but the development of these standards is partly driven by regulatory requirements. The U.S. government's recent initiative to have various agencies evaluate AI capabilities in cybersecurity and biosecurity highlights this challenge. These agencies, despite their expertise in their respective domains, often lack the specialized knowledge needed to evaluate advanced AI systems comprehensively ([Weidinger et al., 2023](https://arxiv.org/abs/2310.11986)).

**The role of independent evaluation.** Currently, a lot of AI evaluations research happens within the same companies developing these systems. While these companies often have the best resources and expertise to conduct evaluations, this creates an inherent conflict of interest. Companies might be incentivized to design evaluations that their systems are likely to pass or to downplay concerning results. This is also sometimes called "safety washing" (akin to greenwashing). To address the challenges around evaluation independence and expertise, we need to significantly expand the ecosystem of independent evaluation organizations. The emergence of independent evaluation organizations like Model Evaluation for Trustworthy AI (METR) and Apollo Research represents an important step toward addressing this issue, but these organizations often face significant barriers including limited model access ([Perlman, 2024](https://ailabwatch.org/blog/external-evaluation/)), resource constraints, and difficulties matching the technical capabilities of major AI labs.

Overall, while the limitations we've discussed in this final section are significant, they aren't insurmountable. Progress in areas like mechanistic interpretability, formal verification methods, and evaluation protocols shows promise for addressing many current limitations. However, overcoming these challenges requires sustained effort and investment.

*Figure 5.63: A visual breakdown of limitations, and a few reasons to doubt the impact of AI risk evaluations ([Mukobi, 2024](https://arxiv.org/abs/2408.02565)).*

## Conclusion

Robust evaluation methods provide the essential tools to measure current capabilities, and help navigate the balance between harnessing AI's benefits and mitigating its most serious risks.

The future of AI safety depends significantly on our ability to accurately measure and verify the properties of increasingly powerful systems. As models approach potentially transformative capabilities in domains like cybersecurity, autonomous operation, and strategic planning, the stakes of evaluation failures grow exponentially. By continuing to refine our evaluation approaches—combining behavioral and internal techniques, addressing scale challenges through automated methods, and establishing institutional arrangements for genuinely independent assessment—we can help ensure that AI development proceeds in a direction that remains beneficial, controllable, and aligned with human values. The development of robust evaluation methods represents one of our most important tools for navigating the balance between harnessing AI's benefits while mitigating its most serious risks.

We hope that reading this text inspires you to think and act about how to build and improve them!