Too many AI benchmarks are useless

Too many AI benchmarks are useless

Recent research reveals that benchmarks for AI tools are often poorly constructed, challenging to replicate, and reliant on arbitrary metrics. This is significant because these benchmarks help determine the scrutiny and regulation AI models will face.

Moreover, some governments are already incorporating benchmarks into their regulatory frameworks for AI. For example, the EU AI Act, set to take effect in August 2025, uses benchmarks to assess whether a model poses "systemic risk." If it does, it will undergo stricter regulation. Similarly, the UK AI Safety Institute references benchmarks in its Inspect framework, which evaluates the safety of large language models.

Given the growing importance of benchmarks, Anka Reuel and her colleagues set out to evaluate the most popular ones, aiming to understand what makes a good benchmark and whether current benchmarks are sufficiently robust. Reuel is a PhD student in computer science at Stanford University and a member of its Center for AI Safety.

Initially, the researchers tried to verify the benchmark results published by developers, but often found themselves unable to reproduce these results. Testing a benchmark typically requires instructions or code, yet many benchmark creators didn’t make their code publicly available, or provided outdated versions.

Additionally, the questions and answers in the datasets were often kept private, which complicates the evaluation process—though releasing them would risk companies training their models specifically on the benchmark, akin to letting a student see test questions beforehand.

Another major issue is that many benchmarks become "saturated," meaning all the problems they present have essentially been solved. For instance, consider a test with simple math problems. The first generation of an AI model might score 20%, while the second generation achieves 90% and the third reaches 93%. An observer might conclude that AI progress is slowing, but a different perspective could simply be that the benchmark is no longer a good indicator of progress. It no longer differentiates between models of varying capabilities once it’s effectively been solved.

The research aimed to establish criteria that constitute a good benchmark. The paper also introduced a website, Better Bench, which ranks the most widely used AI benchmarks. Factors considered in the rankings include whether experts were consulted during development, whether the capabilities being tested are clearly defined, and other foundational aspects—for instance, if there’s a feedback channel or if the benchmark has been peer-reviewed. Ultimately, the key question is whether a benchmark measures the right thing. A benchmark might meet all these requirements yet still fail if it doesn’t address the appropriate aspect of a model’s capability.

Even a perfectly designed benchmark can be ineffective if it doesn't align with the intended purpose. For instance, a benchmark that measures an AI model's ability to analyze Shakespeare sonnets wouldn’t be helpful if the real concern is the model's potential for hacking.


要查看或添加评论,请登录

Henning Steier的更多文章

社区洞察

其他会员也浏览了