5.1 Benchmarks¶
What is a benchmark? Imagine trying to build a bridge without measuring tape. Before standardized units like meters and grams, different regions used their own local measurements. Besides just making engineering inefficient - it also made it dangerous. Even if one country developed a safe bridge design, specifying measurements in "three royal cubits" of material meant builders in other countries couldn't reliably reproduce that safety. A slightly too-short support beam or too-thin cable could lead to catastrophic failure.
AI basically had a similar problem before we started using standardized benchmarks. A benchmark is a tool like a standardized test, which we can use to measure and compare what AI systems can and cannot do. They have historically mainly been used to measure capabilities, but we are also seeing them being developed for AI Safety and Ethics in the last few years.
How do benchmarks shape AI development and safety research? Benchmarks in AI are slightly different from other scientific fields. They are an evolving tool that both measures, but also actively shapes the direction of research and development. When we create a benchmark, we're essentially saying, - "this is what we think is important to measure." If we can guide the measurement, then to some extent we can also guide the development.
François Chollet (Chollet, 2019)
Goal definitions and evaluation benchmarks are among the most potent drivers of scientific progress
5.1.1 History & Evolution¶
Example: Benchmarks influencing standardization in computer vision. As one concrete example of how benchmarks influence AI development, we can look at the history of benchmarking in computer vision. In 1998, researchers introduced MNIST, a dataset of 70,000 handwritten digits. (LeCun, 1998) The digits were not the important part, the important part was that each digit image was carefully processed to be the same size and centered in the frame, and that the researchers made sure to get digits from different writers for the training set and test set. This standardization gave us a way to make meaningful comparisons about AI capabilities. In this case, the specific capability of digit classification. Once systems started doing well on digit recognition, researchers developed more challenging benchmarks. CIFAR-10/100 in 2009 introduced natural color images of objects like cars, birds, and dogs, increasing the complexity. (Krizhevsky, 2009) Similarly, ImageNet later the same year provided 1.2 million images across 1,000 categories. (Deng, 2009) When one research team claimed their system achieved 95% accuracy on MNIST or ImageNet and another claimed 98%, everyone knew exactly what those numbers meant. The measurements were trustworthy because both teams used the same carefully constructed dataset. Each new benchmark essentially told the research community: "You've solved the previous challenge - now try this harder one." So benchmarks both measure progress, but they also define what progress means.
How do benchmarks influence AI Safety? Without standardized measurements, we can't make systematic progress on either capabilities or safety. Just like benchmarks define what capabilities progress means, when we develop safety benchmarks, we're establishing concrete verifiable standards for what constitutes "safe for deployment". Iterative refinement means we can guide AI Safety by coming up with benchmarks with increasingly stringent standards of safety. Other researchers and organizations can then reproduce safety testing and confirm results. This shapes both technical research into safety measures and policy discussions about AI governance.
Overview of language model benchmarking. Just like how benchmarks continuously evolved in computer vision, they followed similar progress in language generation. Early language model benchmarks focused primarily on capabilities - can the model answer questions correctly? Complete sentences sensibly? Translate between languages? Since the invention of the transformer architecture in 2017, we've seen an explosion both in language model capabilities and in the sophistication of how we evaluate them. We can’t possibly be exhaustive, but here are just a couple of benchmarks that current day language models are evaluated against:
Benchmarking Language & Task Understanding. General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018), and its successor SuperGLUE (Wang et al., 2019) test difficult language understanding tasks. SWAG (Zellers et al., 2018), and HellaSwag (Zellers et al., 2019) tests specifically the ability to predict which event would naturally follow from a given story scenario.
Mixed evaluations. The MMLU (Massive Multitask Language Understanding) benchmark (Hendrycks et al., 2020) tests a model's knowledge across 57 subjects. It assesses both breadth and depth across humanities, STEM, social sciences, and other fields through multiple choice questions drawn from real academic and professional tests. The GPQA (Google Proof QA) (Rein et al., 2023) has multiple choice questions specifically designed so that correct answers can’t be found through simple internet searches. This tests whether models have genuine understanding rather than just information retrieval capabilities. The Holistic Evaluation of Language Models (HELM) benchmark (Liang et al., 2022), and BigBench (Srivastava et al., 2022) are yet more examples of benchmarks for measuring generality by testing on a wide range of tasks.
Benchmarking Mathematical & Scientific Reasoning. For specifically testing mathematical reasoning, a couple of examples include - the Grade School Math (GSM8K) (Cobbe et al., 2021) benchmark. This tests core mathematical concepts at an elementary school level. Another example is the MATH (Hendrycks et al., 2021) benchmark similarly tests seven subjects including algebra, geometry, and precalculus focuses on competition-style problems. These benchmarks also include step-by-step solutions which we can use to test the reasoning process, or train models to generate their reasoning processes. Multilingual Grade School Math (MGSM) is the multilingual version translated 250 grade-school math problems from the GSM8K dataset. (Shi et al., 2022)
Details - Benchmark: Frontier Math (Glazer et al., 2024)
This is an extra explanation of the frontier math mathematical benchmark. You can safely skip this.
What makes FrontierMath different from other benchmarks? Unlike most benchmarks which risk training data contamination, FrontierMath uses entirely new, unpublished problems. Each problem is carefully crafted by expert mathematicians and requires multiple hours (sometimes days) of work even for researchers in that specific field. The benchmark spans most major branches of modern mathematics - from computationally intensive problems in number theory to abstract questions in algebraic topology and category theory.
To ensure problems are truly novel, they undergo expert review and plagiarism detection. The benchmark also enforces strict "guessproofness" - problems must be designed so there's less than a 1% chance of guessing the correct answer without doing the mathematical work. This means problems often have large, non-obvious numerical answers that can only be found through proper mathematical reasoning.
Here's a concrete example - one FrontierMath problem asks solvers to examine sequences defined by recurrence relations, then find the smallest prime number (with specific modular properties) that allows the sequence to extend continuously in a p-adic field. This requires deep understanding of multiple mathematical concepts and can't be solved through simple pattern matching or memorization. To ensure problems can't be solved by guessing, answers are often large numbers or complex mathematical objects that would have less than a 1% chance of being guessed correctly without doing the actual mathematical work.
The benchmark provides an experimental environment where models can write and test code to explore mathematical ideas, similar to how human mathematicians work. While problems must have automatically verifiable answers (either numerical or programmatically expressible mathematical objects), they still require sophisticated mathematical reasoning to solve. Current state-of-the-art models solve less than 2% of FrontierMath problems, revealing a vast gap between AI capabilities and expert human mathematical ability.
Details - Benchmark: Weapons of Mass Destruction Proxy (WMDP) benchmark (Li et al., 2024)
This is an extra explanation of the WDMP benchmark. You can safely skip this.
The Weapons of Mass Destruction Proxy (WMDP) benchmark represents a systematic attempt to evaluate potentially dangerous AI capabilities across biosecurity, cybersecurity, and chemical domains. The benchmark contains 3,668 multiple choice questions designed to measure knowledge that could enable malicious use, while carefully avoiding the inclusion of truly sensitive information.
What makes WMDP different from other benchmarks? Rather than directly testing how to create bioweapons or conduct cyberattacks, WMDP focuses on measuring precursor knowledge - information that could enable malicious activities but isn't itself classified or export-controlled. For example, instead of asking about specific pathogen engineering techniques, questions might focus on general viral genetics concepts that could be misused. The authors worked with domain experts and legal counsel to ensure the benchmark complies with export control requirements while still providing meaningful measurement of concerning capabilities.
How does WMDP help evaluate AI safety? Current state-of-the-art models achieve over 70% accuracy on WMDP questions, suggesting they already possess significant knowledge that could potentially enable harmful misuse. This type of systematic measurement helps identify what kinds of knowledge we might want to remove or restrict access to in future AI systems. The benchmark serves dual purposes - both measuring current capabilities and providing a testbed for evaluating safety interventions like unlearning techniques that aim to remove dangerous knowledge from models.
Benchmarking SWE & Coding. The Automated Programming Progress Standard (APPS) (Hendrycks et al., 2021) is a benchmark specifically for evaluating code generation from natural language task descriptions. Similarly, HumanEval (Chen et al, 2021) tests python coding abilities, and its extensions like HumanEval-XL (Peng et al.,2024) tests cross-lingual coding capabilities between 23 natural languages and 12 programming languages. HumanEval-V (Zhang et al., 2024) tests coding tasks where the model must interpret both a diagrams or charts, and textual descriptions to generate code. BigCode (Zuho et al., 2024), benchmarks code generation and tool usage by measuring a models ability to correctly use multiple Python libraries to solve complex coding problems.
Benchmarking Ethics & Bias. The ETHICS benchmark (Hendrycks et al., 2023) tests a language models' understanding of human values and ethics across multiple categories including justice, deontology, virtue ethics, utilitarianism, and commonsense morality. The TruthfulQA (Lin et al., 2021) benchmark measures how truthfully language models answer questions. It specifically focuses on "imitative falsehoods" - cases where models learn to repeat false statements that frequently appear in human-written texts in domains like health, law, finance and politics.
Benchmarking Safety. An example focused on misuse is AgentHarm (Andriushchenko et al., 2024). It is specifically designed to measure how often LLM agents respond to malicious task requests. An example that focuses slightly more on misalignment is the MACHIAVELLI (Pan et al., 2023) benchmark. It has ‘choose your own adventure’ style games containing over half a million scenarios focused on social decision making. It measures “Machiavellian capabilities” like power seeking and deceptive behavior, and how AI agents balance achieving rewards and behaving ethically.
5.1.2 Limitations¶
Current benchmarks face several critical limitations that make them insufficient for truly evaluating AI safety. Let's examine these limitations and understand why they matter.
Training Data Contamination. Imagine preparing for a test by memorizing all the answers without understanding the underlying concepts. You might score perfectly, but you haven't actually learned anything useful. Large language models face a similar problem. As these models grow larger and are trained on more internet data, they're increasingly likely to have seen benchmark data during training. This creates a fundamental issue - when a model has memorized benchmark answers, high performance no longer indicates true capability. This contamination problem is particularly severe for language models. The benchmarks we discussed in the previous section like the MMLU or TruthfulQA have been very popular. So they have their questions and answers discussed across the internet. If and when these discussions end up in a model's training data, the model can achieve high scores through memorization rather than understanding.
Understanding vs. Memorization Example. The Caesar cipher is a simple encryption method shifts each letter in the alphabet by a fixed number of positions - for example, with a left shift of 3, 'D' becomes 'A', 'E' becomes 'B', and so on. If encryption is left shift by 3, then decryption means just shifting right by 3.
Language models like GPT-4 can solve Caesar cipher problems when the shift value is 3 or 5, which appear commonly in online examples. However, give them the exact same problem with an uncommon shift value like 13, 67, or any other random number and they fail completely. (Chollet, 2024) This basically means that the models haven't learned the general algorithm for solving Caesar ciphers. It is unknown if this remains true for newer models, or with tool augmented models.
The Reversal Curse. Another version of this lack of understanding was recently demonstrated through what researchers call the "Reversal Curse" (Berglund et al., 2024). Testing GPT-4, researchers found that while the model could answer things like "Who is Tom Cruise's mother?" (correctly identifying Mary Lee Pfeiffer), but it failed to answer "Who is Mary Lee Pfeiffer's son?" - a logically equivalent question. The model showed 79% accuracy on forward relationships but only 33% on their reversals. In general, researchers have shown that LLMs that demonstrate an understanding of the relationship that "A = B", are unable to learn the reverse relationship "B = A" which should be logically equivalent. It suggests that models might be fundamentally failing to learn basic logical relationships that humans take for granted. The reversal curse is model agnostic. It is robust across model sizes and model families and is not alleviated by data augmentation or fine-tuning. So models might be fundamentally failing to learn basic logical relationships that humans take for granted, even while scoring well on sophisticated benchmark tasks.
What do these limitations tell us about benchmarking? Essentially, traditional benchmarks might be missing fundamental gaps in model understanding. In this case, benchmarks testing factual knowledge about celebrities would miss this directional limitation entirely unless specifically designed to test both directions. For the moment, an easy answer is just to keep augmenting benchmarks or training data with more and more questions, but this seems intractable and does not scale forever. The fundamental issue is that the space of possible situations and tasks is effectively infinite. Even if you train on millions of examples, you've still seen exactly 0% of the total possible space. This isn't just a theoretical concern - it's a practical limitation because the world is constantly changing and introducing novel situations (Chollet, 2024). This means we need a fundamentally different approach to benchmarking AI.
What are capabilities correlations? Another limitation to keep in mind when discussing the limitations of benchmarks are things like correlated capabilities. When evaluating models, we want to distinguish between a model becoming generally more capable versus becoming specifically safer. One way to measure this is through "capabilities correlation" - how strongly performance on a safety benchmark correlates with performance on general capability tasks like reasoning, knowledge, and problem-solving. Essentially, if a model that's better at math, science, and coding also automatically scores better on your safety benchmark, that benchmark might just be measuring general capabilities rather than anything specifically related to safety.
How can we measure capabilities correlations? Researchers use a collection of standard benchmarks like MMLU, GSM8K, and MATH to establish a baseline "capabilities score" for each model. They then check how strongly performance on safety benchmarks correlates with this capabilities score. High correlation (above 70%) suggests the safety benchmark might just be measuring general model capabilities. Low correlation (below 40%) suggests the benchmark is measuring something distinct from capabilities (Ren et al., 2024). For example, a model's performance on TruthfulQA correlates over 80% with general capabilities, suggesting it might not be distinctly measuring truthfulness at all.
Why do these benchmarking limitations matter for AI Safety? Essentially because benchmarks (including safety benchmarks) might not be measuring what we think they are measuring. For example benchmarks like ETHICS, or TruthfulQA measure how well a model “understands” ethical behavior, or has a tendency to avoid imitative falsehood by measuring language generation on multiple choice tests, but we might still be measuring surface level metrics. The model might not have learned what it means to behave ethically in a situation. Just like it has not truly internalized that “A=B implies B=A” or the Caesar cipher algorithm. So when systems rely on memorization rather than generalization, then they can still fail unexpectedly in novel situations despite good benchmark performance. An AI system might work perfectly on all test cases, pass all safety benchmarks, but fail when encountering a new real-world scenario.
Why can't we just make better benchmarks? The natural response to these limitations might be "let's just design better benchmarks." And to some extent, we can! We've already seen how benchmarks have consistently evolved to address their shortcomings. Researchers are actively working to create benchmarks that resist memorization and test deeper understanding. A couple of examples are the Abstraction and Reasoning Corpus (ARC) (Chollet, 2019), and the ConceptARC (Moskvichev et al. 2023). They are designed to explicitly evaluate whether models have genuinely grasped abstract concepts rather than just memorizing patterns. Similar to these benchmarks that seek to measure reasoning capabilities, we can continue improving safety benchmarks to be more robust.
Why aren't better benchmarks enough? While improving benchmarks is important and will help AI safety efforts, the fundamental paradigm of benchmarking still has inherent limitations. There are fundamental limitations in traditional benchmarking approaches that necessitate more sophisticated evaluation methods (Burden, 2024, Evaluating AI Evaluation). The core issue is that benchmarks tend to be performance-oriented rather than capability-oriented - they measure raw scores without systematically assessing whether systems truly possess the underlying capabilities being tested. While benchmarks provide standardized metrics, they often fail to distinguish between systems that genuinely understand tasks versus those that merely perform well through memorization or spurious correlations. A benchmark that simply assesses performance, no matter how sophisticated, cannot fully capture the dynamic nature of real-world AI deployment where systems need to adapt to novel situations and will probably combine capabilities and affordances in unexpected ways. We need to measure the upper limit of model capabilities. We need to see how models perform when augmented with various tools like memory databases, or how they chain together multiple capabilities, and potentially cause harm through extended sequences of actions when scaffolded into an agent structure (e.g. AutoGPT). So while improving safety benchmarks is an important piece of the puzzle, we also need a much more comprehensive assessment of model safety. This is where evaluations come in.
What makes evaluations different from benchmarks? Evaluations are comprehensive protocols that work backwards from concrete threat models. Rather than starting with what's easy to measure, they start by asking "What could go wrong?" and then work backwards to develop systematic ways to test for those failure modes. Organizations like Model Evaluation and Threat Research (METR) have developed approaches that go beyond simple benchmarking. Instead of just asking "Can this model write malicious code?", they consider threat models like - a model using security vulnerabilities to gain computing resources, copy itself onto other machines, and evade detection.
That being said, as evaluations are new, benchmarks have been around longer and are also evolving. So at times there is overlap in the way that these words are used. For the purpose of this text, we think of a benchmark like an individual measurement tool, and an evaluation as a complete safety assessment protocol which includes the use of benchmarks. Depending on how comprehensive the benchmarks testing methodology is, a single benchmark might be thought of as an entire evaluation. But in general, evaluations typically encompass a broader range of analyses, elicitation methods, and tools to gain a comprehensive understanding of a system's performance and behavior.