登录查看更多内容

Behind the Rankings: LLM Model Evaluation in Benchmark Datasets

Jayant Kumar

Principal ML Scientist at Adobe | Technical Advisor at Preffect | Multimodal AI | Large language models and Knowledge Graph applications

发布日期: 2024年4月20日

Over the past few days, there's been a flurry of posts discussing the newly unveiled Llama 3 model and its impressive performance on benchmark datasets like MMLU and HumanEval. It's worth delving into these benchmarks and exploring some of the nuances and potential pitfalls associated with relying too heavily on them and the reported scores from various companies.

MMLU, for instance, encompasses a wide array of 57 tasks spanning mathematics, history, computer science, and even law [Ref1, Ref2].

Each task presents several questions with four answer choices, one of which is correct. LLMs receive scores based on their accuracy in providing correct answers.

While the accessibility of this dataset and its detailed topic breakdown may seem advantageous, it raises concerns about potential manipulation by companies to optimize LLM performance.

Minor changes in evaluation methods can lead to significant shifts in rankings on the MMLU leaderboard, as demonstrated by research (Hendrycks et al., 2020). Even slight alterations, such as replacing answer choice symbols with rare symbols or fixing the correct answer choice to a specific position, can cause models to move up or down by up to eight positions. The study assessed various evaluation formats, including the use of cloze method for scoring answer choices. Kendall's τ was employed to measure the level of disagreement between the original ranking and the altered rankings, with lower Kendall's τ indicating more disagreement.

Moreover, literature highlights additional concerns. A recent paper [Ref3] delves into the sensitivity of LLM leaderboards and cautions against using benchmark rankings as the sole basis for model selection. It elucidates how minor alterations in the design and evaluation of the MMLU dataset can lead to significant shifts in rankings. Additionally, LLMs exhibit biases toward specific scoring methods for answer choices in multiple-choice questions (MCQs).

By acknowledging these issues and engaging in critical discourse, we can foster a more nuanced understanding of the strengths and limitations of benchmark datasets and better inform our approach to evaluating AI models.

Ref1: https://huggingface.co/datasets/lukaemon/mmlu

Ref2: https://github.com/hendrycks/test

Ref3: https://arxiv.org/pdf/2402.01781.pdf

Ref4: https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu

要查看或添加评论，请登录

Jayant Kumar的更多文章

DeepSeek-R1: A Pure RL-based Reasoning Model

2025年1月26日

DeepSeek-R1: A Pure RL-based Reasoning Model

I summarize the key steps involved in creating the DeepSeek models, from the foundational development of DeepSeek-R1 to…

1 条评论
LLaVA-OneVision

2024年9月21日

LLaVA-OneVision

The LLaVA-NeXT series represents a groundbreaking evolution in large multimodal models with each iteration bringing…

2 条评论
GraphRAG: Powerful but Expensive and Slow Solution

2024年7月29日

GraphRAG: Powerful but Expensive and Slow Solution

Microsoft's GraphRAG architecture represents a significant advancement in Retrieval-Augmented Generation (RAG) systems,…

2 条评论
SIGIR Day 1 - Keynotes and Industry Papers

2024年7月16日

SIGIR Day 1 - Keynotes and Industry Papers

Day 1 started with the opening remarks from general/program chairs. Some key insights are as follows: RecSys has the…
LLM Alignment: Direct Preference Optimization

2024年7月13日

LLM Alignment: Direct Preference Optimization

In the realm of language models (LMs), alignment is essential to ensure that the outputs generated by these models meet…

1 条评论
Navigating the Shifting Tides: Reflections on the Rollercoaster Ride of 2023

2023年12月31日

Navigating the Shifting Tides: Reflections on the Rollercoaster Ride of 2023

The Unfolding Drama in Early 2023: Unrealistic Projections, Layoffs, and the Pressure to Innovate As the curtains rose…

1 条评论
AI Horizons: A Closer Look at the Five Big AI Bets in 2023

2023年12月22日

AI Horizons: A Closer Look at the Five Big AI Bets in 2023

As we navigate the ever-evolving landscape of artificial intelligence, it's natural to wonder – which bets are paying…

1 条评论
BERT as a service

2020年5月17日

BERT as a service

There are multiple ways of leveraging the open source BERT model for your NLP work, for example, via huggingface…
Custom Object Detector

2018年12月2日

Custom Object Detector

Recently I had a chance to try Tensorflow object detection API to develop a custom object detector - an object…

2 条评论
Learning by Teaching

2015年8月22日

Learning by Teaching

I had heard before that the best way to learn anything is to try to teach it to others. If you can explain a topic of…

3 条评论

See all articles

Behind the Rankings: LLM Model Evaluation in Benchmark Datasets

Jayant Kumar

Principal ML Scientist at Adobe | Technical Advisor at Preffect | Multimodal AI | Large language models and Knowledge Graph applications

Jayant Kumar的更多文章

社区洞察

其他会员也浏览了

Exploring the Power of Random Forest: From Decision Trees to Ensemble Methods

How to Create Custom LLMs From Scratch - Interview with Vincent Granville

Can we detect LLM hallucinations?

Algorithms, Simplified!

K-Means Clustering in Machine Learning

Demystifying XGBoost with a Real-World Example

State of the Graph: Knowledge Graphs Emerge As First Killer App

Vector RAG w/o fine tuned LLM

Which is easier to correct, an algorithm’s bias or a human’s?

Titanic Machine Learning from Disaster

Jayant Kumar的更多文章

DeepSeek-R1: A Pure RL-based Reasoning Model

LLaVA-OneVision

GraphRAG: Powerful but Expensive and Slow Solution

SIGIR Day 1 - Keynotes and Industry Papers

LLM Alignment: Direct Preference Optimization

Navigating the Shifting Tides: Reflections on the Rollercoaster Ride of 2023

AI Horizons: A Closer Look at the Five Big AI Bets in 2023

BERT as a service

Custom Object Detector

Learning by Teaching

社区洞察

其他会员也浏览了

Exploring the Power of Random Forest: From Decision Trees to Ensemble Methods

How to Create Custom LLMs From Scratch - Interview with Vincent Granville

Can we detect LLM hallucinations?

Algorithms, Simplified!

K-Means Clustering in Machine Learning

Demystifying XGBoost with a Real-World Example

State of the Graph: Knowledge Graphs Emerge As First Killer App

Vector RAG w/o fine tuned LLM

Which is easier to correct, an algorithm’s bias or a human’s?

Titanic Machine Learning from Disaster