Behind the Rankings: LLM Model Evaluation in Benchmark Datasets
MMLU benchmark page on PapersWithCode (Ref4)

Behind the Rankings: LLM Model Evaluation in Benchmark Datasets

Over the past few days, there's been a flurry of posts discussing the newly unveiled Llama 3 model and its impressive performance on benchmark datasets like MMLU and HumanEval. It's worth delving into these benchmarks and exploring some of the nuances and potential pitfalls associated with relying too heavily on them and the reported scores from various companies.


MMLU, for instance, encompasses a wide array of 57 tasks spanning mathematics, history, computer science, and even law [Ref1, Ref2].

Tasks in MMLU (From HuggingFace)


Each task presents several questions with four answer choices, one of which is correct. LLMs receive scores based on their accuracy in providing correct answers.

While the accessibility of this dataset and its detailed topic breakdown may seem advantageous, it raises concerns about potential manipulation by companies to optimize LLM performance.
Minor changes in evaluation methods can lead to significant shifts in rankings on the MMLU leaderboard, as demonstrated by research (Hendrycks et al., 2020). Even slight alterations, such as replacing answer choice symbols with rare symbols or fixing the correct answer choice to a specific position, can cause models to move up or down by up to eight positions. The study assessed various evaluation formats, including the use of cloze method for scoring answer choices. Kendall's τ was employed to measure the level of disagreement between the original ranking and the altered rankings, with lower Kendall's τ indicating more disagreement.

Moreover, literature highlights additional concerns. A recent paper [Ref3] delves into the sensitivity of LLM leaderboards and cautions against using benchmark rankings as the sole basis for model selection. It elucidates how minor alterations in the design and evaluation of the MMLU dataset can lead to significant shifts in rankings. Additionally, LLMs exhibit biases toward specific scoring methods for answer choices in multiple-choice questions (MCQs).

By acknowledging these issues and engaging in critical discourse, we can foster a more nuanced understanding of the strengths and limitations of benchmark datasets and better inform our approach to evaluating AI models.

Ref1: https://huggingface.co/datasets/lukaemon/mmlu

Ref2: https://github.com/hendrycks/test

Ref3: https://arxiv.org/pdf/2402.01781.pdf

Ref4: https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu


要查看或添加评论,请登录

社区洞察

其他会员也浏览了