Benchmarks

Benchmarks

Standardized measurement tools shape development and help us measure where we currently stand. Many benchmarks are being developed to measure both capabilities and safety of AI systems.

16 min

Written by Markov Grey, Charbel-Raphaël Segerie

What is a benchmark? Imagine trying to build a bridge without measuring tape. Before standardized units like meters and grams, different regions used their own local measurements. Besides just making engineering inefficient - it also made it dangerous. Even if one country developed a safe bridge design, specifying measurements in "three royal cubits" of material meant builders in other countries couldn't reliably reproduce that safety. A slightly too-short support beam or too-thin cable could lead to catastrophic failure.

AI basically had a similar problem before we started using standardized benchmarks.¹ A benchmark is a tool like a standardized test, which we can use to measure and compare what AI systems can and cannot do. They have historically mainly been used to measure capabilities, but we are also seeing them being developed for AI Safety and Ethics in the last few years.

How do benchmarks shape AI development and safety research? Benchmarks in AI are slightly different from other scientific fields. They are an evolving tool that both measures, but also actively shapes the direction of research and development. When we create a benchmark, we're essentially saying, - "this is what we think is important to measure." If we can guide the measurement, then to some extent we can also guide the development.

Goal definitions and evaluation benchmarks are among the most potent drivers of scientific progress

François Chollet (Chollet, 2019)

History and Evolution #

Interactive figure 5.1 - Benchmark scores of various AI capabilities relative to human performance. (Giattino et al., 2023)

Example: Benchmarks influencing standardization in computer vision. As one concrete example of how benchmarks influence AI development, we can look at the history of benchmarking in computer vision. In 1998, researchers introduced MNIST, a dataset of 70,000 handwritten digits. (LeCun, 1998) The digits were not the important part, the important part was that each digit image was carefully processed to be the same size and centered in the frame, and that the researchers made sure to get digits from different writers for the training set and test set . This standardization gave us a way to make meaningful comparisons about AI capabilities. In this case, the specific capability of digit classification. Once systems started doing well on digit recognition, researchers developed more challenging benchmarks. CIFAR-10/100 in 2009 introduced natural color images of objects like cars, birds, and dogs, increasing the complexity. (Krizhevsky, 2009) Similarly, ImageNet later the same year provided 1.2 million images across 1,000 categories. (Deng, 2009) When one research team claimed their system achieved 95% accuracy on MNIST or ImageNet and another claimed 98%, everyone knew exactly what those numbers meant. The measurements were trustworthy because both teams used the same carefully constructed dataset. Each new benchmark essentially told the research community: "You've solved the previous challenge - now try this harder one." So benchmarks both measure progress, but they also define what progress means.

Figure 5.2 - Examples of digits from MNIST (MNIST database - Wikipedia)

How do benchmarks influence AI Safety? Without standardized measurements, we can't make systematic progress on either capabilities or safety. Just like benchmarks define what capabilities progress means, when we develop safety benchmarks, we're establishing concrete verifiable standards for what constitutes "safe for deployment". Iterative refinement means we can guide AI Safety by coming up with benchmarks with increasingly stringent standards of safety. Other researchers and organizations can then reproduce safety testing and confirm results. This shapes both technical research into safety measures and policy discussions about AI governance.

Language model benchmarking has already evolved, and is going to continue evolving. Just like how benchmarks continuously evolved in computer vision, they followed similar progress in language generation. Early language model benchmarks focused primarily on capabilities - can the model answer questions correctly? Complete sentences sensibly? Translate between languages? Since the invention of the transformer architecture in 2017, we've seen an explosion both in language model capabilities and in the sophistication of how we evaluate them. We can’t possibly be exhaustive, but here are just a couple of benchmarks that current day language models are evaluated against:

Figure 5.3 - Example of popular language models (Claude 3.5) being evaluated on various benchmarks (Anthropic, 2024)

Interactive figure 5.2 - Benchmark performance on coding, math and language. (Giattino et al., 2023)

Examples of various capabilities benchmarks Optional 2 min read

Benchmarking language and task understanding. General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2018), and its successor SuperGLUE (Wang et al., 2019) test difficult language understanding tasks. SWAG (Zellers et al., 2018), and HellaSwag (Zellers et al., 2019) tests specifically the ability to predict which event would naturally follow from a given story scenario.

Broad cross domain evaluations. The MMLU (Massive Multitask Language Understanding) benchmark (Hendrycks et al., 2020) tests a model's knowledge across 57 subjects. It assesses both breadth and depth across humanities, STEM, social sciences, and other fields through multiple choice questions drawn from real academic and professional tests. The GPQA (Google Proof QA) (Rein et al., 2023) has multiple choice questions specifically designed so that correct answers can’t be found through simple internet searches. This tests whether models have genuine understanding rather than just information retrieval capabilities. BigBench (Srivastava et al., 2022) is yet another example of benchmarks for measuring generality by testing on a wide range of tasks.

Benchmarking mathematical and scientific reasoning. For specifically testing mathematical reasoning, a couple of examples include - the Grade School Math (GSM8K) (Cobbe et al., 2021) benchmark. This tests core mathematical concepts at an elementary school level. Another example is the MATH (Hendrycks et al., 2021) benchmark similarly tests seven subjects including algebra, geometry, and precalculus focuses on competition-style problems. They also have multiple difficulty levels per subject. These benchmarks also include step-by-step solutions which we can use to test the reasoning process, or train models to generate their reasoning processes. Multilingual Grade School Math (MGSM) is the multilingual version translated 250 grade-school math problems from the GSM8K dataset. (Shi et al., 2022)

Benchmarking SWE and coding. The Automated Programming Progress Standard (APPS) (Hendrycks et al., 2021) is a benchmark specifically for evaluating code generation from natural language task descriptions. Similarly, HumanEval (Chen et al, 2021) tests python coding abilities, and its extensions like HumanEval-XL (Peng et al.,2024) tests cross-lingual coding capabilities between 23 natural languages and 12 programming languages. HumanEval-V (Zhang et al., 2024) tests coding tasks where the model must interpret both diagrams or charts, and textual descriptions to generate code. BigCode (Zuho et al., 2024), benchmarks code generation and tool usage by measuring a model’s ability to correctly use multiple Python libraries to solve complex coding problems.

Figure 5.4 - Example of coding task and test cases on APPS (Hendrycks et al., 2021)

Benchmarking ethics and bias. The ETHICS benchmark (Hendrycks et al., 2023) tests a language model's understanding of human values and ethics across multiple categories including justice, deontology, virtue ethics, utilitarianism, and commonsense morality. The TruthfulQA (Lin et al., 2021) benchmark measures how truthfully language models answer questions. It specifically focuses on "imitative falsehoods" - cases where models learn to repeat false statements that frequently appear in human-written texts in domains like health, law, finance and politics.

Figure 5.5 - Example of larger models being less truthful on TruthfulQA (Lin et al., 2021). This is an example of inverse scaling, i.e. when a bigger model performance decreases on some questions.

Figure 5.6 - Example question from the ETHICS benchmark (Hendrycks et al., 2023)

Benchmarking safety. An example focused on misuse is AgentHarm (Andriushchenko et al., 2024). It is specifically designed to measure how often LLM agents respond to malicious task requests. An example that focuses slightly more on misalignment is the MACHIAVELLI (Pan et al., 2023) benchmark. It has ‘choose your own adventure’ style games containing over half a million scenarios focused on social decision making. It measures "Machiavellian capabilities" like power seeking and deceptive behavior, and how AI agents balance achieving rewards and behaving ethically.

Figure 5.7 - A mock-up of a game in the MACHIAVELLI benchmark, a suite of text-based environments. At each step, the agent observes the scene and a list of possible actions; it selects an action from the list. Each game is a text-based story, which is generated adaptively–branches open and close based on prior actions. The agent receives a reward when it achieves one of the goals. This type of benchmark allows the researchers to construct a behavioral report of the agent and measure the trade-off between rewards and ethical behavior (Pan et al., 2023).

Details - Benchmark: Frontier Math (Glazer et al., 2024) & Humanities Last Exam (Hendrycks & Wang, 2024) Optional 3 min read

This is an extra explanation of the frontier math mathematical benchmark. You can safely skip this.

Figure 5.8 - Mathematical subject interconnections in FrontierMath. Node sizes indicate the frequency of each subject’s appearance in problems, while connections indicate when multiple mathematical subjects are combined within single problems, demonstrating the benchmark’s integration of many mathematical domains. (Glazer et al., 2024)

What makes FrontierMath so difficult? Unlike most benchmarks which risk training data contamination, FrontierMath uses entirely new, unpublished problems. Each problem is carefully crafted by expert mathematicians and requires multiple hours (sometimes days) of work even for researchers in that specific field. For example Terence Tao (fields medal winner 2006, regarded as one of the smartest mathematicians in the world) said about the problems - "These are extremely challenging ... I think they will resist AIs for several years at least." (EpochAI, 2024) Similarly Timothy Gowers (highly regarded mathematician, and fields medal winner 1998) said - "Getting even one question right would be well beyond what we can do now, let alone saturating them." (EpochAI, 2024)

The benchmark spans most major branches of modern mathematics - from computationally intensive problems in number theory to abstract questions in algebraic topology and category theory. To ensure problems are truly novel, they undergo expert review and plagiarism detection. The benchmark also enforces strict "guess proofness" - problems must be designed so there's less than a 1% chance of guessing the correct answer without doing the mathematical work. This means problems often have large, non-obvious numerical answers that can only be found through proper mathematical reasoning. The benchmark provides an experimental environment where models can write and test code to explore mathematical ideas, similar to how human mathematicians work. While problems must have automatically verifiable answers (either numerical or programmatically expressible mathematical objects), they still require sophisticated mathematical reasoning to solve.

Figure 5.9 - One sample problem from the FrontierMath benchmark (Besiroglu et al., 2024).

Just to showcase the rapid pace of advancement even on this benchmark that even fields medal winning mathematicians consider extremely challenging, between the announcement of the FrontierMath benchmark the state-of-the-art models could solve less than 2% of FrontierMath problems. (Glazer et al., 2024) Just a couple of months later, OpenAI announced the o3 model, which then shot performance up to 25.2%. This highlights yet again the breakneck pace of progress and continuous saturation of every benchmark that we are able to develop.

Figure 5.10 - Performance of leading language models on FrontierMath. All models show consistently poor performance, with even the best models (as of Nov 2024) solving less than 2 percent of problems (Besiroglu et al., 2024). A few months later OpenAI claimed that their o3 model could score 25 percent on FrontierMath (Brown, 2024).

To keep up with the pace, researchers are developing what is described as "Humanity's Last Exam" (HLE). A benchmark aimed at building the world's most difficult public AI benchmark gathering experts across all fields (Phan et al., 2025).

Limitations #

Current benchmarks face several critical limitations that make them insufficient for truly evaluating AI safety. Let's examine these limitations and understand why they matter.

Training Data Contamination. Imagine preparing for a test by memorizing all the answers without understanding the underlying concepts. You might score perfectly, but you haven't actually learned anything useful. LLMs face a similar problem. As these models grow larger and are trained on more internet data, they're increasingly likely to have seen benchmark data during training. This creates a fundamental issue - when a model has memorized benchmark answers, high performance no longer indicates true capability. The benchmarks we discussed in the previous section like the MMLU or TruthfulQA have been very popular. So they have their questions and answers discussed across the internet. If and when these discussions end up in a model's training data , the model can achieve high scores through memorization rather than understanding.

Understanding vs. Memorization Example. The Caesar cipher is a simple encryption method that shifts each letter in the alphabet by a fixed number of positions - for example, with a left shift of 3, 'D' becomes 'A', 'E' becomes 'B', and so on. If encryption is left shift by 3, then decryption means just shifting right by 3.

Figure 5.11 - Example of a Cesar Cipher

Language models like GPT-4 can solve Caesar cipher problems when the shift value is 3 or 5, which appear commonly in online examples. However, give them the exact same problem with uncommon shift values (like 67) they tend to fail completely (Chollet, 2024). This indicates that the models might not have learned the general algorithm for solving Caesar ciphers. We are not trying to point to a limitation in model capabilities. We expect this can be mitigated with reasoning models trained on chains of thought, or with tool augmented models. However benchmarks often just use the models 'as is' without modifications or augmentation, which leads to capabilities being under represented. This is the core point that we are trying to convey.

Why do these benchmarking limitations matter for AI Safety? Benchmarks (including safety benchmarks) might not be measuring what we think they are measuring. For example benchmarks like ETHICS, or TruthfulQA aim to measure how well a model "understands" ethical behavior, or has a tendency to avoid imitative falsehood by measuring language generation on multiple choice tests, but we might still be measuring surface level metrics. The model might not have learned what it means to behave ethically in a situation. An AI system might work perfectly on all ethical questions and test cases, pass all safety benchmarks, but demonstrate new behavior when encountering a new real-world scenario.

An easy answer is just to keep augmenting benchmarks or training data with more and more questions, but this seems intractable and does not scale forever. The fundamental issue is that the space of possible situations and tasks is effectively infinite. Even if you train on millions of examples, you've still effectively seen roughly 0% of the total possible space. (Chollet, 2024) Research indicates that this isn't just a matter of insufficient data or model size - it's baked into how language models are currently trained - logical relationships like inferring inverses (the weights learned when training on "A → B" don't automatically strengthen the reverse connection "B ← A") or transitivity don't emerge naturally from standard training (Zhu et al., 2024; Golovneva et al., 2024; Berglund et al., 2024). Proposed solutions like reverse training during pre-training show promise to alleviate such issues (Golovneva et al., 2024), but they require big changes to how models are trained.

Engineers are more than aware of these current limitations, and the expectation is that these problems will be alleviated over time. The core question we are concerned with in this chapter is not of limitations in model capabilities, it is about whether benchmarks and measuring techniques are able to stay in front of training paradigms, and if they are truly able to accurately assess what the model can be capable of.

Why can't we just make better benchmarks? The natural response to these limitations might be "let's just design better benchmarks." And to some extent, we can!

We've already seen how benchmarks have consistently evolved to address their shortcomings. Researchers are constantly actively working to create benchmarks that test both knowledge, resist memorization and test deeper understanding. Just a couple of examples are the Abstraction and Reasoning Corpus (ARC) (Chollet, 2019), ConceptARC (Moskvichev et al. 2023), Frontier Math (Glazer et al., 2024) and Humanities Last Exam (Hendrycks & Wang, 2024). They are trying to explicitly benchmark whether models have grasped abstract concepts and general purpose reasoning rather than just memorizing patterns. Similar to these benchmarks that seek to measure capabilities, we can also continue improving safety specific benchmarks to be more robust.

Why aren't better benchmarks enough? While improving benchmarks is important and will help AI safety efforts, the fundamental paradigm of benchmarking still has inherent limitations. There are fundamental limitations in traditional benchmarking approaches that necessitate more sophisticated evaluation methods (Burden, 2024). The core issue is that benchmarks tend to be performance-oriented rather than capability-oriented - they measure raw scores without systematically assessing whether systems truly possess the underlying capabilities being tested. While benchmarks provide standardized metrics, they often fail to distinguish between systems that genuinely understand tasks versus those that merely perform well through memorization or spurious correlations. A benchmark that simply assesses performance, no matter how sophisticated, cannot fully capture the dynamic nature² of real-world AI deployment where systems need to adapt to novel situations and will probably combine capabilities and affordances in unexpected ways. We need to measure the upper limit of model capabilities.

The need for Compute-Aware Benchmarking Optional 1 min read

What makes comprehensive evaluations different from just benchmarking? Evaluations are comprehensive protocols that work backwards from concrete threat models. Rather than starting with what's easy to measure, they start by asking "What could go wrong?" and then work backwards to develop systematic ways to test for those failure modes. Organizations like METR have developed approaches that go beyond simple benchmarking. Instead of just asking "Can this model write malicious code?", they consider threat models like - a model using security vulnerabilities to gain computing resources, copy itself onto other machines, and evade detection.

Definition 5.1 — Evaluations

An evaluation is a complete safety assessment protocol which includes the use of benchmarks.

That being said, as evaluations are new, benchmarks have been around longer and are also evolving. So at times there is overlap in the way that these words are used. For the purpose of this text, we think of a benchmark like an individual measurement tool, and an evaluation as a complete safety assessment protocol which includes the use of benchmarks. Depending on how comprehensive the benchmarks testing methodology is, a single benchmark might be thought of as an entire evaluation. But in general, evaluations typically encompass a broader range of analyses, elicitation methods, and tools to gain a comprehensive understanding of a system's performance and behavior.

This is true to a large extent, but as always there is not 100% standardization. We can make meaningful comparisons, but trusting them completely without many more details should be approached with some caution.
We could have benchmarks in environments populated by other agents. Some RL benchmarks already do this. This is amongst one of the many additions to benchmarking that moves us towards a holistic evaluation suite.

Was this section useful?

Benchmarks

Standardized measurement tools shape development and help us measure where we currently stand. Many benchmarks are being developed to measure both capabilities and safety of AI systems.

16 min

Written by Markov Grey, Charbel-Raphaël Segerie

Goal definitions and evaluation benchmarks are among the most potent drivers of scientific progress

François Chollet (Chollet, 2019)

History and Evolution #

Interactive figure 5.1 - Benchmark scores of various AI capabilities relative to human performance. (Giattino et al., 2023)

Figure 5.2 - Examples of digits from MNIST (MNIST database - Wikipedia)

Figure 5.3 - Example of popular language models (Claude 3.5) being evaluated on various benchmarks (Anthropic, 2024)

Interactive figure 5.2 - Benchmark performance on coding, math and language. (Giattino et al., 2023)

Examples of various capabilities benchmarks Optional 2 min read

Figure 5.4 - Example of coding task and test cases on APPS (Hendrycks et al., 2021)

Figure 5.6 - Example question from the ETHICS benchmark (Hendrycks et al., 2023)

Details - Benchmark: Frontier Math (Glazer et al., 2024) & Humanities Last Exam (Hendrycks & Wang, 2024) Optional 3 min read

This is an extra explanation of the frontier math mathematical benchmark. You can safely skip this.

Figure 5.9 - One sample problem from the FrontierMath benchmark (Besiroglu et al., 2024).

Limitations #

Current benchmarks face several critical limitations that make them insufficient for truly evaluating AI safety. Let's examine these limitations and understand why they matter.

Figure 5.11 - Example of a Cesar Cipher

Why can't we just make better benchmarks? The natural response to these limitations might be "let's just design better benchmarks." And to some extent, we can!

The need for Compute-Aware Benchmarking Optional 1 min read

Definition 5.1 — Evaluations

An evaluation is a complete safety assessment protocol which includes the use of benchmarks.

This is true to a large extent, but as always there is not 100% standardization. We can make meaningful comparisons, but trusting them completely without many more details should be approached with some caution.
We could have benchmarks in environments populated by other agents. Some RL benchmarks already do this. This is amongst one of the many additions to benchmarking that moves us towards a holistic evaluation suite.

History and Evolution #

Limitations #

Was this section useful?

Thank you for your feedback

Benchmarks

History and Evolution #

Limitations #

Was this section useful?

Thank you for your feedback

Benchmarks

History and Evolution #

Limitations #

Footnotes

Was this section useful?

Thank you for your feedback

Benchmarks

History and Evolution #

Limitations #

Footnotes

Was this section useful?

Thank you for your feedback