5.4 Evaluation Frameworks¶
Evaluation Techniques vs Evaluation Frameworks. When evaluating AI systems, individual techniques are like tools in a toolbox - useful for specific tasks but most powerful when combined systematically. This is where evaluation frameworks come in. While techniques are specific methods of studying AI systems (like chain-of-thought prompting or internal activation pattern analysis), frameworks provide structured approaches for combining these techniques to answer broader questions about AI systems. As an example, we might use behavioral techniques like red teaming to probe for deceptive outputs, internal techniques like circuit analysis to understand how deception is implemented, and combine these within a model organism's framework specifically designed to create a sample of AI deception. Each layer - technique, analysis type, and framework - serves a different role in building understanding and safety.
Types of Evaluation Frameworks. Evaluation frameworks can be broadly categorized into technical frameworks and governance frameworks:
-
Technical frameworks: These are things like the Model Organisms Framework, or several evaluation suites that provide specific methodologies or objectives for conducting evaluations. They might detail which techniques to use, how to combine them, and what specific outcomes to measure. For example, they might specify how to create controlled examples of deceptive behavior or how to measure situational awareness.
-
Governance frameworks: These are things like Anthropics Responsible Scaling Policies Framework (RSPs), OpenAIs Preparedness Framework, and DeepMinds Frontier Safety Framework (FSF). These frameworks instead focus on when to conduct evaluations, what their results should trigger, and how they fit into broader organizational decision-making. They establish protocols for how evaluation results translate into concrete actions - whether to continue development, or implement additional safety measures.
Technical frameworks help us understand how to measure AI capabilities and behaviors, governance frameworks help us determine what to do with those measurements.
5.4.1 Model Organisms Framework¶
What are model organisms in AI safety? This framework involves deliberately creating and studying misaligned AI systems with specific dangerous properties. It's a controlled approach to studying potentially dangerous capabilities and behaviors. (Hubinger et al., 2023).
Why create potentially dangerous AI systems on purpose? There are two compelling reasons for this approach. First, it allows us to study concerning properties in controlled settings where we know exactly what we're looking for. Second, it provides concrete examples that can help inform the broader AI community about safety challenges. Having real, demonstrated examples of potential risks is crucial for building scientific consensus and informing policy decisions.
How does the model organisms approach break down complex threat models? Instead of trying to tackle things like "deceptive alignment" as one monolithic problem, researchers identify key subcomponents like situational awareness or goal persistence. They then create simplified systems exhibiting these properties, studying how they manifest and testing different evaluation techniques. For example, Anthropic's work on "sleeper agents" demonstrated how certain forms of deceptive behavior could persist through safety training. They created models that would write secure code when asked about the year 2023, but insert vulnerabilities when asked about 2024 (Hubinger et al., 2024). We discuss this particular example of deception more in the goal misgeneralization chapter.
What are the framework's limitations? The model organisms approach faces an important tradeoff: models need to be realistic enough to provide useful insights but controlled enough to study safely. They should be sophisticated enough to exhibit the properties we're concerned about but not so powerful that they pose actual risks. Additionally, since these models are explicitly constructed to exhibit certain behaviors, they may not perfectly represent how such behaviors would emerge naturally.
5.4.2 Governance Frameworks¶
5.4.2.1 RSP Framework¶
Why do we need scaling policies? One domain in which evaluations are central is in trying to determine when we should continue development versus when we should invest more into safety measures. As AI systems become more capable, we need systematic ways to ensure safety keeps pace with capability growth. Without structured policies, competitive pressures or development momentum might push companies to scale faster than their safety measures can handle. We saw in the capabilities chapter arguments for the "scaling hypotheses" - that ML systems will continue to improve along dimensions of performance and generality with increases in compute, data or parameters (Branwen, 2020). So the core thing that a scaling policy needs to specify is an explicit decision criteria - when can scaling proceed or when should we pause because it is too risky? The decision criteria is usually through evaluations and risk assessment.
What is evaluation gated scaling? The way we figure out if someone should be allowed to continue to scale their models is through evaluation gated scaling. This means that progress in AI development is controlled by specific evaluation results ("gates"/thresholds) (Anthropic, 2024). Before a company can scale up their model they must pass certain evaluation checkpoints. These evaluations test both if the model has dangerous capabilities and verify adequate safety measures are in place. This creates clear decision points where evaluation results.
Example of evaluation gates: AI Safety Levels (ASL). One concrete example of evaluation gated scaling are Anthropic's responsible scaling policies (RSPs) that use the concept of safety levels. These are inspired by biosafety levels (BSL) used in infectious disease research, where increasingly dangerous pathogens require increasingly stringent containment protocols (Anthropic, 2024). AI Safety Levels create standardized tiers of capability that require increasingly stringent safety measures. For example, Anthropic's framework defines levels from ASL-1 (basic safety measures) through ASL-3 (comprehensive security and deployment restrictions). This is in principle similar to how biologists handle increasingly dangerous pathogens, with each level having specific evaluation requirements and safety protocols.
What is a scaling policy framework? A scaling policy framework puts everything together - determining which evaluations are needed, which safety measures are required, how strictly things should be tested, and what evaluation requirements exist before training, deployment, and post-deployment. Essentially, it establishes systematic rules and protocols for monitoring, safety and being willing to pause development if safety cannot be assured (METR, 2023).
Responsible scaling policy framework (RSP). This is Anthropic’s attempt at continuing to increase AI capabilities in a way that maintains safety margins and implements appropriate protective measures at each level. They define specific evaluation requirements, capability thresholds, and clear protocols for when development must pause (Hubinger, 2023). This moves beyond ad-hoc safety measures to create comprehensive frameworks for deciding when and how to scale AI development.
Which evaluations are necessary to act as gates to further scale? RSPs require several categories of evaluation working together, building on the evaluation types we discussed earlier in this chapter. Capability evaluations detect dangerous abilities like autonomous replication, CBRN, or cyberattack capabilities. Security evaluations verify protection of model weights and training infrastructure. Safety evaluations test whether control measures remain effective (Anthropic, 2024). These evaluations need to work together - passing one category isn't sufficient if others indicate concerns. This connects directly to our earlier discussion on how capability, propensity, and control evaluations complement each other.
5.4.2.2 Preparedness Framework¶
What is the Preparedness Framework? OpenAI's Preparedness Framework has a lot of overlap with Anthropic's RSPs. Rather than using fixed capability levels for the entire model like ASLs, the preparedness framework publishes model cards with organized evaluations around specific risk categories like cybersecurity, persuasion, and autonomous replication. For each category, they define a spectrum from low to critical risk, with specific evaluation requirements and mitigation measures for each level (OpenAI, 2023). So similar to RSPs, in the preparedness framework evaluations play a central role.
Evaluations in the preparedness framework. The framework requires both pre-mitigation and post-mitigation evaluations. Pre-mitigation evaluations assess a model's raw capabilities and potential for harm, while post-mitigation evaluations verify whether safety measures effectively reduce risks to acceptable levels. This maps onto our earlier discussions about capability and control evaluations - we need to understand both what a model can do and whether we can reliably prevent harmful outcomes (OpenAI, 2024, Progress on AI Safety). The framework sets clear safety baselines: only models with post-mitigation scores of "medium" or below can be deployed, and only models with post-mitigation scores of "high" or below can be developed further. Models showing "high" or "critical" pre-mitigation risk require specific security measures to prevent model weight exfiltration. This creates direct links between evaluation results and required actions (OpenAI, 2023). A unique aspect of the Preparedness Framework is its explicit focus on "unknown unknowns" - potential risks that current evaluation protocols might miss. The framework includes processes for actively searching for unanticipated risks and updating evaluation protocols accordingly. This hoping to address one of the limitations of AI evaluations that we will discuss in a later section.
5.4.2.3 Frontier Safety Framework¶
What is the Frontier Safety Framework? DeepMind's FSF shares core elements with other governance frameworks but introduces some unique elements. Instead of ASLs or risk spectrums, it centers on "Critical Capability Levels" (CCLs) that trigger specific evaluation and mitigation requirements. The framework includes both deployment mitigations (like safety training and monitoring) and security mitigations (protecting model weights). (DeepMind, 2024) While ASLs look at overall model capability levels, CCLs are more granular and domain-specific. Separate CCLs exist for biosecurity, cybersecurity, and autonomous capabilities. Each CCL has its own evaluation requirements and triggers different combinations of security and deployment mitigations. This allows for more targeted responses to specific risks rather than treating all capabilities as requiring the same level of protection. (DeepMind, 2024)
Scaling buffers are used to calculate evaluation timing. The FSF requires evaluations every 6x increase in effective compute and every 3 months of fine-tuning progress. This timing is designed to provide adequate safety buffers - they want to detect CCLs before models actually reach them. (DeepMind, 2024) Anthropics RSPs have a similar scaling buffer requirement, but they have lower thresholds - evaluations for every 4x increase in effective compute. (Anthropic, 2024)