Behind the Rankings: LLM Model Evaluation in Benchmark Datasets
MMLU benchmark page on PapersWithCode (Ref4)

Behind the Rankings: LLM Model Evaluation in Benchmark Datasets

Over the past few days, there's been a flurry of posts discussing the newly unveiled Llama 3 model and its impressive performance on benchmark datasets like MMLU and HumanEval. It's worth delving into these benchmarks and exploring some of the nuances and potential pitfalls associated with relying too heavily on them and the reported scores from various companies.


MMLU, for instance, encompasses a wide array of 57 tasks spanning mathematics, history, computer science, and even law [Ref1, Ref2].

Tasks in MMLU (From HuggingFace)


Each task presents several questions with four answer choices, one of which is correct. LLMs receive scores based on their accuracy in providing correct answers.

While the accessibility of this dataset and its detailed topic breakdown may seem advantageous, it raises concerns about potential manipulation by companies to optimize LLM performance.
Minor changes in evaluation methods can lead to significant shifts in rankings on the MMLU leaderboard, as demonstrated by research (Hendrycks et al., 2020). Even slight alterations, such as replacing answer choice symbols with rare symbols or fixing the correct answer choice to a specific position, can cause models to move up or down by up to eight positions. The study assessed various evaluation formats, including the use of cloze method for scoring answer choices. Kendall's τ was employed to measure the level of disagreement between the original ranking and the altered rankings, with lower Kendall's τ indicating more disagreement.

Moreover, literature highlights additional concerns. A recent paper [Ref3] delves into the sensitivity of LLM leaderboards and cautions against using benchmark rankings as the sole basis for model selection. It elucidates how minor alterations in the design and evaluation of the MMLU dataset can lead to significant shifts in rankings. Additionally, LLMs exhibit biases toward specific scoring methods for answer choices in multiple-choice questions (MCQs).

By acknowledging these issues and engaging in critical discourse, we can foster a more nuanced understanding of the strengths and limitations of benchmark datasets and better inform our approach to evaluating AI models.

Ref1: https://huggingface.co/datasets/lukaemon/mmlu

Ref2: https://github.com/hendrycks/test

Ref3: https://arxiv.org/pdf/2402.01781.pdf

Ref4: https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu


要查看或添加评论,请登录

Jayant Kumar的更多文章

  • DeepSeek-R1: A Pure RL-based Reasoning Model

    DeepSeek-R1: A Pure RL-based Reasoning Model

    I summarize the key steps involved in creating the DeepSeek models, from the foundational development of DeepSeek-R1 to…

    1 条评论
  • LLaVA-OneVision

    LLaVA-OneVision

    The LLaVA-NeXT series represents a groundbreaking evolution in large multimodal models with each iteration bringing…

    2 条评论
  • GraphRAG: Powerful but Expensive and Slow Solution

    GraphRAG: Powerful but Expensive and Slow Solution

    Microsoft's GraphRAG architecture represents a significant advancement in Retrieval-Augmented Generation (RAG) systems,…

    2 条评论
  • SIGIR Day 1 - Keynotes and Industry Papers

    SIGIR Day 1 - Keynotes and Industry Papers

    Day 1 started with the opening remarks from general/program chairs. Some key insights are as follows: RecSys has the…

  • LLM Alignment: Direct Preference Optimization

    LLM Alignment: Direct Preference Optimization

    In the realm of language models (LMs), alignment is essential to ensure that the outputs generated by these models meet…

    1 条评论
  • Navigating the Shifting Tides: Reflections on the Rollercoaster Ride of 2023

    Navigating the Shifting Tides: Reflections on the Rollercoaster Ride of 2023

    The Unfolding Drama in Early 2023: Unrealistic Projections, Layoffs, and the Pressure to Innovate As the curtains rose…

    1 条评论
  • AI Horizons: A Closer Look at the Five Big AI Bets in 2023

    AI Horizons: A Closer Look at the Five Big AI Bets in 2023

    As we navigate the ever-evolving landscape of artificial intelligence, it's natural to wonder – which bets are paying…

    1 条评论
  • BERT as a service

    BERT as a service

    There are multiple ways of leveraging the open source BERT model for your NLP work, for example, via huggingface…

  • Custom Object Detector

    Custom Object Detector

    Recently I had a chance to try Tensorflow object detection API to develop a custom object detector - an object…

    2 条评论
  • Learning by Teaching

    Learning by Teaching

    I had heard before that the best way to learn anything is to try to teach it to others. If you can explain a topic of…

    3 条评论

社区洞察

其他会员也浏览了