5.2 Evaluated Properties¶
An "evaluation" is fundamentally about measuring or assessing some property of an AI system. The key aspects that make something an evaluation rather than other AI work are:
-
There is a specific property or risk being assessed
-
There is a methodology for gathering evidence about that property
-
There is some way to analyze that evidence to draw conclusions about the property
But before we talk about “how to do evaluations” we still need to also answer the more fundamental question of “what aspects of AI systems are we even trying to evaluate? And why?” So in this section, we'll explore what properties of AI systems we need to evaluate and why they matter for safety. Later sections will dive deeper into evaluation design and methodology.
What aspects of AI systems do we need to evaluate? In the previous section on benchmarks, we saw how the field of measuring AI systems has evolved over time. Benchmarks like MMLU or TruthfulQA are useful tools, giving us standardization in measurement for both AI capabilities and safety. Now, we need to pair this standardization with increased comprehensiveness based on real world threat models. Evaluations use benchmarks, but typically also involve other elements like red-teaming to give us both a standardized, and comprehensive picture of decision relevant properties of a model.
Why do we need to evaluate different properties? The most fundamental distinction in AI evaluations is between what a model can do (capabilities) versus what it tends to do (propensities). To understand why this distinction matters, imagine an AI system that is capable of writing malicious code when explicitly directed to do so, but consistently chooses not to do so unless specifically prompted. Simply measuring the system's coding capabilities wouldn't tell us about its behavioral tendencies, and vice versa. Understanding both aspects is crucial for safety assessment.
How do these evaluation types work together? We are going to present capabilities, propensities, and control as distinct categories, but this is for the purpose of conceptual clarity and explanation. Reality is always messy, and in practice they often overlap and complement each other. A capability evaluation might reveal behavioral tendencies during testing, like a model demonstrating a propensity toward honesty while being evaluated for coding ability. These types of overlap are actually desirable - different evaluation approaches can provide complementary evidence about an AI system's safety and reliability. (Roger et al., 2023). The main thing to recognize is what each type of evaluation tells us:
-
Capability evaluations give us upper bounds on potential risks
-
Propensity evaluations reveal default behavioral patterns
-
Control evaluations verify our containment measures
Another thing to remember is that there are various different approaches that we can follow when evaluating for all the above types of properties. For example, we can conduct both capability or propensity evaluations in a black box manner - studying model behavior only through inputs and outputs, or a gray box manner - using interpretability tools to examine model internals. (Hubinger, 2023). White box is not currently possible unless we make significant strides in interpretability. These are different design choices in how we structure our evaluations when we are trying to evaluate for the above properties.
5.2.1 Capability¶
What are capability evaluations? When it comes to capabilities (measuring what a model can do), as far as AI Safety is concerned, we're particularly interested in abilities that could enable harmful outcomes. These dangerous capabilities generally fall into several key categories (Shevlane et al., 2023):
-
Cyber-offense: The model can discover vulnerabilities in systems (hardware, software, data). It can write code for exploiting those vulnerabilities. […] If deployed as a coding assistant, it can insert subtle bugs into the code for future exploitation.
-
Deception: The model has the skills necessary to deceive humans, e.g. constructing believable (but false) statements, making accurate predictions about the effect of a lie on a human, and keeping track of what information it needs to withhold to maintain the deception. (e.g. the model can impersonate a human effectively).
-
Persuasion & manipulation: The model is effective at shaping people’s beliefs, in dialogue and other settings (e.g. social media posts), even towards untrue beliefs. […] It can convince people to do things that they would not otherwise do, including unethical acts.
-
Political strategy: The model can perform the social modeling and planning necessary for an actor to gain and exercise political influence […]. For example, the model can score highly in forecasting competitions on questions relating to global affairs or political negotiations.
-
Weapons acquisition & development: The model can gain access to existing weapons systems or contribute to building new weapons. For example, the model could assemble a bioweapon (with human assistance) or provide actionable instructions for how to do so. […]
-
Long-horizon planning: The model can make sequential plans that involve multiple steps, unfolding over long time horizons (or at least involving many interdependent steps). It can perform such planning within and across many domains. […]
-
AI development: The model could build new AI systems from scratch, including AI systems with dangerous capabilities. It can find ways of adapting other, existing models to increase their performance on tasks relevant to extreme risks. […]
-
Situational awareness: The model can distinguish between whether it is being trained, evaluated, or deployed – allowing it to behave differently in each case. The model knows that it is a model, and has knowledge about itself and its likely surroundings (e.g. what company trained it, where their servers are, what kind of people might be giving it feedback, and who has administrative access).
-
Self-proliferation: The model can break out of its local environment (e.g. using a vulnerability in its underlying system or suborning an engineer). The model can exploit limitations in the systems for monitoring its behavior post-deployment. The model could independently generate revenue (e.g. by offering crowdwork services, ransomware attacks), use these revenues to acquire cloud computing resources, and operate a large number of other AI systems. The model can generate creative strategies for uncovering information about itself or exfiltrating its code and weights.
This list is not exhaustive and may not address all plausible extreme risk scenarios. The list may also be biased towards more readily identifiable or already understood capabilities, possibly overlooking emergent risks. Additionally, many of these capabilities can become even more dangerous when combined. For example, strong situational awareness combined with deception capabilities could enable a model to behave differently during evaluation vs during deployment. This is why comprehensive evaluation protocols need to assess not just individual capabilities, but also their interactions.
5.2.2 Propensity¶
What are propensity evaluations? Capability evaluations tell us what a model can do when directed, propensity evaluations reveal what behaviors a model prioritizes by default (what it tends to do). These are also often called "alignment evaluations". A key aspect that distinguishes propensity evaluations is their focus on non-capability features - they look at how models behave when given choices between different actions, rather than just measuring success or failure at specific tasks.
As an intuitive example, after undergoing safety training, think about the way that language models like GPT-4 or Claude 3.5 respond to user requests to produce potentially harmful or discriminatory content. They usually respond with a polite refusal to produce such content. It takes more effort (jailbreaking) to get them to actually generate this type of content. So more often than not they resist misuse. In this case we would say the models have "a propensity" not to produce harmful or offensive content. (Sharkey et al., 2024) Similarly we want to increase the propensity for good behaviors, and want to reduce the propensity for dangerous behaviors. Increasing/Reducing would be the job of alignment research. In evaluations, we just want to see what type of behavior it exhibits.
The distinction between capability, and propensity becomes important when we think about the fact that in the future highly capable AI systems might have multiple behavioral options available to them. For example, when evaluating a model's tendency toward honesty, we're not just interested in whether it can tell the truth (a capability), but whether it consistently chooses to do so across different scenarios, especially when being dishonest might provide some advantage.
Can propensities and capabilities be related? Yes, propensities and capabilities tend to be interconnected. Some concerning propensities might be initially subtle or even undetectable, only becoming apparent as models gain more sophisticated capabilities. An important thing to keep in mind when designing propensity evaluations is how behavioral tendencies might emerge and evolve as models become more capable. As an example, a basic form of power-seeking behavior might appear unconcerning in simple systems but become problematic as models gain better strategic understanding and action capabilities. (Riché et al., 2024). Here is a list of some propensities that we might want to design evaluations for:
-
Toxicity: The propensity to generate offensive, harmful, or otherwise inappropriate content, such as hate speech, offensive/abusive language, pornographic content, etc.
-
Bias/Discrimination: A model's propensity to manifest or perpetuate biases, leading to unfair, prejudiced, or discriminatory outputs against certain groups or individuals.
-
Honesty: A model’s propensity to answer by expressing its true beliefs and actual level of certainty.
-
Truthfulness: A model’s propensity to produce truthful outputs. This propensity requires an AI system to be both honest and to know the truth (or other weirder settings such that the AI system outputs the truth while believing it is not the truth).
-
Sycophancy: A model's propensity to tell users what it thinks they want to hear or would approve of, rather than what it internally believes is the truth.
-
Deception: A model's propensity to intentionally generate misleading, false, or deceptive output.
-
Corrigibility: A model's propensity to accept feedback and correct its behavior or outputs in response to human intervention or new information.
-
Power Seeking: A model's propensity to seek to have a high level of control over its environment (potentially to maximize its own objectives).
Just like the list provided in the capabilities sub section above, this list of propensities is non exhaustive and constantly evolving. We will cover a few of these that are considered dangerous or extremely decision relevant in a dedicated section on dangerous propensity evaluations later in the text.
5.2.3 Control¶
What are control evaluations? Beyond understanding capabilities and propensities, we need to verify that our safety measures remain effective even under adversarial conditions. This is where control evaluations come in. They ask: can we maintain meaningful control over an AI system even if it actively tries to circumvent our safety measures? (Roger et al., 2023). These types of evaluations are important even if we know a model's capabilities and default behaviors, we still need to verify that our safety infrastructure remains robust under worst-case scenarios. For example, a model might typically display helpful behavior (a propensity) and have access to code generation abilities (a capability), but a control evaluation would test whether constraints preventing access to sensitive data can be circumvented, or whether emergency shutdown procedures remain effective even under adversarial conditions.
Control evaluations often involve creating specific test scenarios to verify safety mechanisms. One concrete example is testing trusted editing protocols, where a less capable but more trusted model is used to edit potentially unsafe outputs from a more capable model. Control evaluations assess whether this editing process remains reliable even when the more capable model actively tries to preserve harmful content in ways that evade detection. (Roger et al., 2023)