登录查看更多内容

The Death of the Static AI Benchmark

Sandi Besen

Artificial Intelligence Applied Research @ Neudesic, an IBM Company | AI Startup Advisor

发布日期: 2024年3月21日

Benchmarks are often hailed as a hallmark of success. They are a celebrated way of measuring progress -- whether it's achieving the sub 4-minute mile or the ability to excel on standardized exams. In the context of Artificial Intelligence (AI) benchmarks are the most common method of evaluating a model's capability. Industry leaders such as OpenAI, Anthropic, Meta, Google, etc. compete in a race to one-up each other with superior benchmark scores. However, recent research studies and industry grumblings are casting doubt about whether common benchmarks truly capture the essence of a models ability.

Data Contamination Leading to Memorization

Emerging research points to the probability that training sets of some models have been contaminated with the very data that they are being assessed on-- raising doubts on the the authenticity of their benchmark scores reflecting true understanding. Just like in films where actors can portray Doctors or Scientists, they deliver the lines without truly grasping the underlying concepts. When Cillian Murphy played famous physicist J. Robert Oppenheimer in the movie Oppenheimer, he likely did not understand the complex physics theories he spoke of. Although benchmarks are meant to evaluate a models capabilities, are they truly doing so if like an actor the model has memorized them?

Recent findings from the University of Arizona have discovered that GPT-4 is contaminated with AG News, WNLI, and XSum datasets discrediting their associated benchmarks[1]. Further, researchers from the University of Science and Technology of China found that when they deployed their "probing" technique on the popular MMLU Benchmark [2], results decreased dramatically.

Their probing techniques included a series of methods meant to challenge the models understanding of the question when posed different ways with different answer options, but the same correct answer. Examples of the probing techniques consisted of: paraphrasing questions, paraphrasing choices, permuting choices, adding extra context into questions, and adding a new choice to the benchmark questions.

From the graph below, one can gather that although each tested model performed well on the unaltered "vanilla" MMLU benchmark, when probing techniques were added to different sections of the benchmark (LU, PS, DK, All) they did not perform as strongly.

"Vanilla" represents performance on the unaltered MMLU Benchmark.The other keys represent the performance on the altered sections of the MMLU Benchmark:Language Understanding (LU),Problem Solving (PS),Domain Knowledge (DK), All

Future Considerations on how to evaluate AI

This evolving situation prompts a re-evaluation of how AI models are assessed. The need for benchmarks that both reliably demonstrate capabilities and anticipate the issues of data contamination and memorization is becoming apparent.

Data & Analytics 1 个月前

Artificial General Intelligence: Vision for an…

Neil Sahota 1 年前

Inside story on HPC’s AI role in Bridges 'strategic…

Dana Gardner 7 年前

As models continue to evolve and are updated to potentially include benchmark data in their training sets, benchmarks will have an inherently short lifespan. Additionally, model context windows are increasing rapidly, allowing a larger amount of context to be included in the models response. The larger the context window the more potential impact of contaminated data indirectly skewing the model's learning process, making it biased towards the seen test examples .

The Rise of the Dynamic and Real-World Benchmark

To address these challenges, innovative approaches such as dynamic benchmarks are emerging, employing tactics like: altering questions, complicating questions, introduce noise into the question, paraphrasing the question, reversing the polarity of the question, and more [3].

The example below provides an example on several methods to alter benchmark questions (either manually or language model generated).

Source: Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

As we move forward, the imperative to align evaluation methods more closely with real-world applications becomes clear. Establishing benchmarks that accurately reflect practical tasks and challenges will not only provide a truer measure of AI capabilities but also guide the development of Small Language Models (SLMs) and AI Agents. These specialized models and agents require benchmarks that genuinely capture their potential to perform practical and helpful tasks.

Joe Golden

8 个月

This is a very helpful and engaging article Sandi. Since life is full of noisy and context-shifting questions it makes sense that our benchmarks need to evolve continuously. Please keep this series going and I’ll keep reading! Now I’m thinking about a duck egg omelette ??

1 次回应

Woodley B. Preucil, CFA

Senior Managing Director

8 个月

Sandi Besen Very Informative. Thank you for sharing.

1 次回应

查看更多评论

要查看或添加评论，请登录

Sandi Besen的更多文章

From AI Helper to Human Validator

2024年3月12日

From AI Helper to Human Validator

In my every day life, I use Language Models (LMs) to clarify complex concepts, critique my arguments, and act as…

2 条评论
There’s a New Winner in Town — Anthropic’s Claude 3

2024年3月5日

There’s a New Winner in Town — Anthropic’s Claude 3

Anthropic grounds itself in the practice of ethical and constitutional AI. This means they try their best to put…

3 条评论
Advanced Language Model Reasoning: Pre-Training , Fine-Tuning, and Inference Time Techniques

2024年3月5日

Advanced Language Model Reasoning: Pre-Training , Fine-Tuning, and Inference Time Techniques

The Importance of Reasoning Reasoning is a fundamental human capability that plays a crucial role in our survival and…

4 条评论

The Death of the Static AI Benchmark

Sandi Besen

Artificial Intelligence Applied Research @ Neudesic, an IBM Company | AI Startup Advisor

Data Contamination Leading to Memorization

Future Considerations on how to evaluate AI

领英推荐

The Rise of the Dynamic and Real-World Benchmark

Sandi Besen的更多文章

社区洞察

其他会员也浏览了

The World of Reality, Causality and Real AI: Exposing the great unknown unknowns

Artificial Intelligence: The Paradigm-Shifting Force Revolutionizing Telecommunications and Reshaping the Future of Global Connectivity

AI Apocalypse?

Towards using AI/ML as a tool for designing Cellular Physical Layer: is it Hype or Realizable?

FOD#65: Jevons' Paradox in AI

Something is missing in the AI growth debate

When Machines Think Alike: The Phenomenon of AI’s Shared Imagination

Taming the AI Jungle: Cutting Through the Tangle of Tools

Artificial Sentience vs Artificial Intelligence

Pattern Recognition: AI Seeing Where Humans Cannot

Data Contamination Leading to Memorization

Future Considerations on how to evaluate AI

领英推荐

The Rise of the Dynamic and Real-World Benchmark

Sandi Besen的更多文章

From AI Helper to Human Validator

There’s a New Winner in Town — Anthropic’s Claude 3

Advanced Language Model Reasoning: Pre-Training , Fine-Tuning, and Inference Time Techniques

社区洞察

其他会员也浏览了

The World of Reality, Causality and Real AI: Exposing the great unknown unknowns

Artificial Intelligence: The Paradigm-Shifting Force Revolutionizing Telecommunications and Reshaping the Future of Global Connectivity

AI Apocalypse?

Towards using AI/ML as a tool for designing Cellular Physical Layer: is it Hype or Realizable?

FOD#65: Jevons' Paradox in AI

Something is missing in the AI growth debate

When Machines Think Alike: The Phenomenon of AI’s Shared Imagination

Taming the AI Jungle: Cutting Through the Tangle of Tools

Artificial Sentience vs Artificial Intelligence

Pattern Recognition: AI Seeing Where Humans Cannot