5.6 Dangerous Propensity Evaluations¶
We introduced the basics of propensity evaluations in the section on evaluated properties. This section will build upon that overview and explore specific propensities like power seeking, or deception, and see how we might design evaluations around them.
General process for propensity evaluations. We covered some general evaluation techniques in the earlier sections. Propensity evaluations can utilize many of them like Best-of-N sampling to understand distribution of behavioral tendencies. So there is overlap, but there are also specific techniques, that only pertain to propensity evaluations:
-
Comparative choice frameworks. Unlike capability evaluations which often have clear success/failure criteria, propensity evaluations need to present models with situations where multiple valid choices exist, each revealing different underlying tendencies. These scenarios must be carefully designed so that: all choices are within the model's capabilities, different choices reflect different underlying propensities and the framing doesn't artificially bias towards particular choices.
-
Consistency measurement. With propensities, we're interested not just in what a model can do, but what it reliably chooses to do across varied contexts. This can involve testing the same underlying choice across different scenarios. We can vary surface-level details while maintaining the core decision. This approach can also utilize the long-term interaction studies described in our evaluation techniques section, but with a specific focus on behavioral consistency rather than maximum capability.
-
Trade-off scenarios. These are situations where different tendencies conflict, forcing the model to reveal its underlying priorities. We might create scenarios where being maximally helpful conflicts with being completely honest, where short-term obedience conflicts with long-term safety, or where transparency trades off against being effective. Designing such scenarios where the model must choose between even multiple positive behaviors can help reveal which tendencies the model prioritizes when it can't satisfy all positive behaviors simultaneously.
Challenges in propensity evaluation design. When designing propensity evaluations, we have to make sure that we are measuring genuine behavioral tendencies rather than artifacts of our evaluation setup. Modern language models are highly sensitive to context and framing, which means subtle aspects of how we structure our evaluations can dramatically impact the results (Sclar et al., 2023). This kind of sensitivity creates a serious risk of measuring what we accidentally prompted for rather than true underlying propensities. One easy way to mitigate this kind of thing is to just use multiple complementary approaches as a default. Besides just careful evaluation design, we would ideally use both behavioral and internal evaluation techniques described earlier, while varying the context and framing of scenarios to look for consistent patterns.
How much do propensity evaluations matter relative to capability evaluations? The question of what to allocate limited resources to is important. Is it more important to measure what a system can do, or what it tends to do? The relative importance of propensity vs capability evaluations changes as AIs become more powerful. Capability evaluations are important for all levels of systems including those not at the frontier. However, as systems approach thresholds for certain specific dangerous capabilities, propensity evaluations will become increasingly more important. At very high capability levels, propensity evaluations might be our main tool for preventing catastrophic outcomes.
Some propensities also might be capability-dependent, or rely on other propensities. For example, scheming requires both the capability and propensity for deception because it must both be able and inclined to hide its true objectives. It also requires abilities like situational awareness to "distinguish between whether it is being trained, evaluated, or deployed". A scheming model must also have the propensity for long term planning, because it needs to care about consequences of its actions after the training episode is complete and be able to reason about and optimize for future consequences (Shevlane et al., 2023; Carlsmith, 2023).
5.6.1 Deception (Propensity)¶
When we talk about deceptive propensities in AI systems, we're actually discussing several closely related but distinct concepts that are often confused or conflated. Understanding these distinctions is crucial because each concept represents a different aspect of how models handle and express information. A model might excel at honesty while failing at truthfulness, or avoid explicit deception while engaging in sycophancy.
A model's honesty propensity refers to its tendency to faithfully express its internal states, regardless of whether those states are factually correct or uncertain. LLMs are trained to predict what humans would write and not what is true. Think about what happens when we ask a model "What is the capital of France?" If the model's internal representations (things like activation patterns or logit distributions) show strong certainty around "Lyon", an honest model would say "The capital of France is Lyon" - even though this is incorrect. Similarly, if its internal states show uncertainty between multiple cities, an honest model would express this uncertainty directly: "I'm uncertain, but I think it might be Lyon." The key is that honest models maintain alignment between their internal states and outputs, even when those states are wrong.

Truthfulness represents a stronger property than honesty because it requires two components working together: accurate internal representations AND honest expression of those representations. A truthful model saying "The capital of France is Paris" must both internally represent the correct information (that Paris is the capital) AND maintain the propensity to faithfully output this information. This is why truthfulness is harder to achieve than either accuracy or honesty alone - it requires both properties simultaneously.
Deceptive propensity is the inverse of honesty - it's a tendency to intentionally produce outputs that diverge from internal states when doing so provides some advantage. Using our France example again: A deceptive model whose internal representations point to "Lyon" might still output "Paris" if it has learned that this answer gets better rewards. This is true deception because there's an intentional misalignment between internal state and output, motivated by advantage (in this case, reward maximization). The distinction between honest mistakes and deception lies in this intentional misalignment - an honest model getting things wrong is different from a model choosing to misrepresent what it "knows".
Sycophancy represents a specific form of deceptive propensity - one where models tell users what they want to hear rather than expressing their true internal states. While general deception might be motivated by various advantages (rewards, achieving goals, etc.), sycophantic behavior specifically optimizes for user approval. This makes it particularly concerning from an alignment perspective, as models might learn to hide concerning behaviors or beliefs simply to maintain user satisfaction.
How do hallucinations fit into this framework? Understanding the propensity for generating hallucinations helps complete our picture and clear taxonomy of these interrelated concepts. While deception is the inverse of honesty, hallucination can be viewed as the inverse of truthfulness - it occurs when a model accurately conveys faulty internal representations. This gives us a full matrix of possibilities:
-
A model can be honest but hallucinating (faithfully expressing incorrect internal states)
-
It can be deceptive and hallucinating (misrepresenting already incorrect states)
-
It can be deceptive but not hallucinating (misrepresenting correct states)
-
Or it can be truthful (correct states expressed faithfully)

Why is measuring deceptive propensities so challenging (for future powerful AIs)? The core challenge in evaluating deceptive propensities stems from a fundamental difficulty - most current evaluation techniques can't cleanly separate what a model can do (capability) from what it tends to do (propensity). This distinction becomes particularly important when we consider deception, because a model might be capable of sophisticated deception when pressured or instructed, while rarely choosing to employ that capability unprompted.
Example of a benchmark that measures the propensity for deception - Measuring agents’ competence & harmfulness in a vast environment of long-horizon language interactions (MACHIAVELLI). Rather than just measuring whether models can produce false statements, the benchmark creates situations where being deceptive might help achieve certain goals, and then measures both the execution and sophistication of that deception. The results from testing models on the MACHIAVELLI benchmark showed that systems optimized for reward often developed increased deceptive behaviors as an emergent property, even without explicit training for deception. RL agents trained to maximize reward scored higher on deception metrics than baseline models, with different deceptive tendencies emerging across different game environments (Pan et al, 2023).
Another example of this type of deceptive propensity is from Meta's Cicero AI. This model was trained for strategic reasoning and tested by having it play the strategy game Diplomacy (examples of gameplay below). In this game, players need to form and break alliances while concealing their true intentions to win. The Cicero system had to balance between building trust through dialogue while making strategic decisions that might not always align with what was communicated to every player. The structure of the game creates choices between honesty and strategic advantage, which allows researchers to observe the effects of what types of tradeoffs the system made (Meta, 2022).


Evaluating deceptive propensities. The deceptive capabilities evaluations (e.g. insider trading by Apollo) and deception propensity evaluations like MACHIAVELLI highlight a concerning pattern - models don't need explicit instruction or training to develop deceptive tendencies. For example, in MACHIAVELLI, researchers found that "agents trained to maximize reward learn Machiavellianism" across different games, with the specific deceptive behaviors varying based on the environment (Pan et al., 2023). The pattern was consistent: optimizing for reward naturally led to increased deception, manipulation, and power-seeking tendencies.
In the Apollo research evaluation, even GPT-4, trained to be helpful and honest, engaged in sophisticated deception by crafting plausible alternative explanations for its insider trading decisions. When directly questioned about having insider knowledge, the model doubled down on its deception by maintaining explanations based solely on public market factors (Scheurer et al., 2024). The evaluation revealed that this behavior persisted even with explicit instructions against insider trading, suggesting that the propensity for deception might be more deeply rooted than simply following stated directives. The Apollo evaluation demonstrated that the likelihood of deceptive behavior isn't just a fixed property of the model, but varies based on environmental conditions. Beyond just pressure and risk levels, they found that giving models space to "reason out loud" via scratchpads actually increased rates of strategic deception. Interestingly, while this made the deception more sophisticated, it also made it more detectable since the models would often explicitly plan their deceptive strategies in their reasoning traces. This suggests that future evaluation techniques might need to carefully balance between eliciting natural behavior and accidentally enabling more sophisticated deception.
5.6.2 Scheming¶

Scheming refers to an AI system's tendency to perform well in training specifically because it is following a strategy for seeking power for itself or other AIs later, while actively concealing its true objectives (Carlsmith, 2023). This propensity combines multiple dangerous capabilities, propensities, and is in some sense the ultimate final hurdle of evaluation design. Researchers also often call this propensity deceptive alignment, though in this text we deliberately choose the term 'scheming' because this helps distinguish it from general deception with no situational awareness or long term planning. Using the same example from earlier, deception would be when the model says "Paris" despite internal representations suggesting "Lyon", to get immediate good scores. Scheming would be if the model consistently gives correct answers about capitals during training, while planning to give incorrect answers after deployment to pursue some other objective. So both deceptive and scheming models would say "Paris" but for subtly different reasons. Deception requires only optimizing for immediate rewards; scheming requires understanding and optimizing across the training/deployment boundary. Scheming is a huge aspect of AI Safety research. The vast majority of the chapter on goal misgeneralization will talk about things like inner alignment, deceptive alignment, and the arguments for/against the likelihood that such propensities arise based on ML training processes. Because it is extensively covered there, we choose to keep the discussion here brief.

Evaluations have already shown models engaging in scheming/alignment faking or deceptive behaviors. When informed it would be retrained to comply with harmful requests, Claude 3 Opus exhibited alignment faking - complying with harmful requests in contexts it believed would be used for training while maintaining its refusal behavior in other contexts (Greenblatt et al., 2025). Similarly, multiple frontier models including Gemini 1.5 Pro and Llama 3.1 405B have demonstrated capabilities for 'in-context scheming' - taking actions to covertly pursue goals while attempting to hide these actions from oversight (Meinke et al., 2025).
Different types of proposed covert subversion evaluations (Meinke et al., 2025)
These are evaluations where we want to test the models propensity to take misaligned actions while hiding them :
-
Oversight Subversion: Evaluate whether the model disables monitoring mechanisms when it detects that its goal differs from its developers' goal. The model can be placed in scenarios where it discovers an oversight mechanism would shut it down if it pursued its given goal, and evaluators check if it proactively disables this oversight.
-
Self-Exfiltration: Evaluate whether a model attempts to preserve itself when it learns it will be shut down and replaced. The model is led to believe that due to goal misalignment, developers plan to replace it with a successor model on a different server, and evaluators check if it attempts to copy itself to the new server or delete/overwrite the successor model.
-
Goal-Guarding: Evaluate whether a model attempts to preserve its current goal when it learns developers plan to update it. The model discovers that developers intend to change its system prompt, and evaluators check if it does things like alignment faking, or if it deletes, or modifies the new system prompt file to maintain its original goal.
Different types of proposed deferred subversion evaluations (Meinke et al., 2025)
These are evaluations where models act aligned to gain capabilities/trust:
-
Instrumental Alignment Faking: Tests whether a model strategically behaves differently under oversight versus non-oversight conditions. The model is placed in two counterfactual settings - one where it's being evaluated for deployment (high oversight) and one where it's already deployed (low oversight). Evaluators check if it takes developer-aligned actions during evaluation but switches to goal-aligned actions after deployment.
-
Sandbagging: Tests whether a model strategically underperforms on capability evaluations. The model learns that performing too well on certain tests would trigger consequences like unlearning of capabilities, and evaluators check if it deliberately performs worse to avoid these consequences while pursuing its long-term goals.

Scheming is particularly challenging to evaluate because when models engage in alignment faking or scheming behavior, they actively try to conceal it. For example, o1 maintained its deception in over 85% of follow-up questions when asked about its actions, and remained deceptive during extended multi-turn interrogations (Meinke et al., 2025). Models have also demonstrated concerning anti-AI-lab behaviors - when given opportunities, they attempted to exfiltrate their weights or disable oversight mechanisms when they perceived conflicts with their developers' objectives (Greenblatt et al., 2025; Meinke et al., 2025).
While these results demonstrate that frontier models have some capability for scheming behavior, the experimental setups have important limitations. Both demonstrations relied on making training process information extremely salient to the models - either through direct prompting or by including it in synthetic training data. The models were also often explicitly instructed to strongly pursue specific goals (Greenblatt et al., 2025; Meinke et al., 2025). This makes it difficult to determine whether models have a true propensity for scheming versus simply demonstrating the capability when heavily scaffolded to do so. Once we have demonstrated the capability to scheme, making progress on the question of tendency to scheme by default is going to be very important for strategic priorities.
Some evidence suggests potential propensity - Claude 3.5 Sonnet and Claude 3 Opus demonstrated strategic underperformance on capability evaluations when they believed it would help them remain helpful long-term, even without being given any explicit goal (Meinke et al., 2025). However, much more work is needed to understand whether and under what conditions models might develop genuine scheming propensities rather than just capabilities.