What is a benchmark? Imagine trying to build a bridge without measuring tape. Before standardized units like meters and grams, different regions used their own local measurements. Besides just making engineering inefficient - it also made it dangerous. Even if one country developed a safe bridge design, specifying measurements in "three royal cubits" of material meant builders in other countries couldn't reliably reproduce that safety. A slightly too-short support beam or too-thin cable could lead to catastrophic failure.
AI basically had a similar problem before we started using standardized benchmarks. 1 A benchmark is a tool like a standardized test, which we can use to measure and compare what AI systems can and cannot do. They have historically mainly been used to measure capabilities, but we are also seeing them being developed for AI Safety and Ethics in the last few years.
How do benchmarks shape AI development and safety research? Benchmarks in AI are slightly different from other scientific fields. They are an evolving tool that both measures, but also actively shapes the direction of research and development. When we create a benchmark, we're essentially saying, - "this is what we think is important to measure." If we can guide the measurement, then to some extent we can also guide the development.
Goal definitions and evaluation benchmarks are among the most potent drivers of scientific progress
History and Evolution #
Example: Benchmarks influencing standardization in computer vision. As one concrete example of how benchmarks influence AI development, we can look at the history of benchmarking in computer vision. In 1998, researchers introduced MNIST, a dataset of 70,000 handwritten digits. (LeCun, 1998) The digits were not the important part, the important part was that each digit image was carefully processed to be the same size and centered in the frame, and that the researchers made sure to get digits from different writers for the training set and test set . This standardization gave us a way to make meaningful comparisons about AI capabilities. In this case, the specific capability of digit classification. Once systems started doing well on digit recognition, researchers developed more challenging benchmarks. CIFAR-10/100 in 2009 introduced natural color images of objects like cars, birds, and dogs, increasing the complexity. (Krizhevsky, 2009) Similarly, ImageNet later the same year provided 1.2 million images across 1,000 categories. (Deng, 2009) When one research team claimed their system achieved 95% accuracy on MNIST or ImageNet and another claimed 98%, everyone knew exactly what those numbers meant. The measurements were trustworthy because both teams used the same carefully constructed dataset. Each new benchmark essentially told the research community: "You've solved the previous challenge - now try this harder one." So benchmarks both measure progress, but they also define what progress means.
How do benchmarks influence AI Safety? Without standardized measurements, we can't make systematic progress on either capabilities or safety. Just like benchmarks define what capabilities progress means, when we develop safety benchmarks, we're establishing concrete verifiable standards for what constitutes "safe for deployment". Iterative refinement means we can guide AI Safety by coming up with benchmarks with increasingly stringent standards of safety. Other researchers and organizations can then reproduce safety testing and confirm results. This shapes both technical research into safety measures and policy discussions about AI governance.
Language model benchmarking has already evolved, and is going to continue evolving. Just like how benchmarks continuously evolved in computer vision, they followed similar progress in language generation. Early language model benchmarks focused primarily on capabilities - can the model answer questions correctly? Complete sentences sensibly? Translate between languages? Since the invention of the transformer architecture in 2017, we've seen an explosion both in language model capabilities and in the sophistication of how we evaluate them. We can’t possibly be exhaustive, but here are just a couple of benchmarks that current day language models are evaluated against:
Benchmarking ethics and bias. The ETHICS benchmark (Hendrycks et al., 2023) tests a language model's understanding of human values and ethics across multiple categories including justice, deontology, virtue ethics, utilitarianism, and commonsense morality. The TruthfulQA (Lin et al., 2021) benchmark measures how truthfully language models answer questions. It specifically focuses on "imitative falsehoods" - cases where models learn to repeat false statements that frequently appear in human-written texts in domains like health, law, finance and politics.
Benchmarking safety. An example focused on misuse is AgentHarm (Andriushchenko et al., 2024). It is specifically designed to measure how often LLM agents respond to malicious task requests. An example that focuses slightly more on misalignment is the MACHIAVELLI (Pan et al., 2023) benchmark. It has ‘choose your own adventure’ style games containing over half a million scenarios focused on social decision making. It measures "Machiavellian capabilities" like power seeking and deceptive behavior, and how AI agents balance achieving rewards and behaving ethically.
Limitations #
Current benchmarks face several critical limitations that make them insufficient for truly evaluating AI safety. Let's examine these limitations and understand why they matter.
Training Data Contamination. Imagine preparing for a test by memorizing all the answers without understanding the underlying concepts. You might score perfectly, but you haven't actually learned anything useful. LLMs face a similar problem. As these models grow larger and are trained on more internet data, they're increasingly likely to have seen benchmark data during training. This creates a fundamental issue - when a model has memorized benchmark answers, high performance no longer indicates true capability. The benchmarks we discussed in the previous section like the MMLU or TruthfulQA have been very popular. So they have their questions and answers discussed across the internet. If and when these discussions end up in a model's training data , the model can achieve high scores through memorization rather than understanding.
Understanding vs. Memorization Example. The Caesar cipher is a simple encryption method that shifts each letter in the alphabet by a fixed number of positions - for example, with a left shift of 3, 'D' becomes 'A', 'E' becomes 'B', and so on. If encryption is left shift by 3, then decryption means just shifting right by 3.
Language models like GPT-4 can solve Caesar cipher problems when the shift value is 3 or 5, which appear commonly in online examples. However, give them the exact same problem with uncommon shift values (like 67) they tend to fail completely (Chollet, 2024). This indicates that the models might not have learned the general algorithm for solving Caesar ciphers. We are not trying to point to a limitation in model capabilities. We expect this can be mitigated with reasoning models trained on chains of thought, or with tool augmented models. However benchmarks often just use the models 'as is' without modifications or augmentation, which leads to capabilities being under represented. This is the core point that we are trying to convey.
Why do these benchmarking limitations matter for AI Safety? Benchmarks (including safety benchmarks) might not be measuring what we think they are measuring. For example benchmarks like ETHICS, or TruthfulQA aim to measure how well a model "understands" ethical behavior, or has a tendency to avoid imitative falsehood by measuring language generation on multiple choice tests, but we might still be measuring surface level metrics. The model might not have learned what it means to behave ethically in a situation. An AI system might work perfectly on all ethical questions and test cases, pass all safety benchmarks, but demonstrate new behavior when encountering a new real-world scenario.
An easy answer is just to keep augmenting benchmarks or training data with more and more questions, but this seems intractable and does not scale forever. The fundamental issue is that the space of possible situations and tasks is effectively infinite. Even if you train on millions of examples, you've still effectively seen roughly 0% of the total possible space. (Chollet, 2024) Research indicates that this isn't just a matter of insufficient data or model size - it's baked into how language models are currently trained - logical relationships like inferring inverses (the weights learned when training on "A → B" don't automatically strengthen the reverse connection "B ← A") or transitivity don't emerge naturally from standard training (Zhu et al., 2024; Golovneva et al., 2024; Berglund et al., 2024). Proposed solutions like reverse training during pre-training show promise to alleviate such issues (Golovneva et al., 2024), but they require big changes to how models are trained.
Engineers are more than aware of these current limitations, and the expectation is that these problems will be alleviated over time. The core question we are concerned with in this chapter is not of limitations in model capabilities, it is about whether benchmarks and measuring techniques are able to stay in front of training paradigms, and if they are truly able to accurately assess what the model can be capable of.
Why can't we just make better benchmarks? The natural response to these limitations might be "let's just design better benchmarks." And to some extent, we can!
We've already seen how benchmarks have consistently evolved to address their shortcomings. Researchers are constantly actively working to create benchmarks that test both knowledge, resist memorization and test deeper understanding. Just a couple of examples are the Abstraction and Reasoning Corpus (ARC) (Chollet, 2019), ConceptARC (Moskvichev et al. 2023), Frontier Math (Glazer et al., 2024) and Humanities Last Exam (Hendrycks & Wang, 2024). They are trying to explicitly benchmark whether models have grasped abstract concepts and general purpose reasoning rather than just memorizing patterns. Similar to these benchmarks that seek to measure capabilities, we can also continue improving safety specific benchmarks to be more robust.
Why aren't better benchmarks enough? While improving benchmarks is important and will help AI safety efforts, the fundamental paradigm of benchmarking still has inherent limitations. There are fundamental limitations in traditional benchmarking approaches that necessitate more sophisticated evaluation methods (Burden, 2024). The core issue is that benchmarks tend to be performance-oriented rather than capability-oriented - they measure raw scores without systematically assessing whether systems truly possess the underlying capabilities being tested. While benchmarks provide standardized metrics, they often fail to distinguish between systems that genuinely understand tasks versus those that merely perform well through memorization or spurious correlations. A benchmark that simply assesses performance, no matter how sophisticated, cannot fully capture the dynamic nature 2 of real-world AI deployment where systems need to adapt to novel situations and will probably combine capabilities and affordances in unexpected ways. We need to measure the upper limit of model capabilities.
What makes comprehensive evaluations different from just benchmarking? Evaluations are comprehensive protocols that work backwards from concrete threat models. Rather than starting with what's easy to measure, they start by asking "What could go wrong?" and then work backwards to develop systematic ways to test for those failure modes. Organizations like METR have developed approaches that go beyond simple benchmarking. Instead of just asking "Can this model write malicious code?", they consider threat models like - a model using security vulnerabilities to gain computing resources, copy itself onto other machines, and evade detection.
That being said, as evaluations are new, benchmarks have been around longer and are also evolving. So at times there is overlap in the way that these words are used. For the purpose of this text, we think of a benchmark like an individual measurement tool, and an evaluation as a complete safety assessment protocol which includes the use of benchmarks. Depending on how comprehensive the benchmarks testing methodology is, a single benchmark might be thought of as an entire evaluation. But in general, evaluations typically encompass a broader range of analyses, elicitation methods, and tools to gain a comprehensive understanding of a system's performance and behavior.
Footnotes
-
This is true to a large extent, but as always there is not 100% standardization. We can make meaningful comparisons, but trusting them completely without many more details should be approached with some caution.
↩ -
We could have benchmarks in environments populated by other agents. Some RL benchmarks already do this. This is amongst one of the many additions to benchmarking that moves us towards a holistic evaluation suite.
↩
Was this section useful?
Thank you for your feedback
Your input helps improve the Atlas.