Revealing the Gaps: Evaluating Large Language Models with New Benchmarks and Metrics
Evaluating Large Language Models

Revealing the Gaps: Evaluating Large Language Models with New Benchmarks and Metrics

Large Language Models (LLMs) have showcased remarkable capabilities across various tasks, particularly in classification contexts. These models excel when provided with gold-standard labels or options that include the correct answer. However, a significant limitation arises when these gold labels are intentionally omitted; LLMs still select among the given possibilities, even if none are correct. This limitation raises critical questions about the true comprehension and intelligence of LLMs in classification scenarios.

Key Concerns in LLM Classification

In the context of LLMs, the absence of uncertainty presents two primary concerns:

  1. Versatility and Label Processing: LLMs demonstrate the ability to work with any set of labels, including those with debatable accuracy. Ideally, to avoid misleading users, these models should emulate human behavior by recognizing accurate labels or indicating their absence. Traditional classifiers, which heavily rely on predetermined labels, lack this level of flexibility.
  2. Discriminative vs. Generative Capabilities: Primarily designed as generative models, LLMs often forgo discriminative capabilities. High-performance metrics suggest that classification tasks are straightforward; however, existing benchmarks may not accurately reflect human-like behavior, potentially overestimating the utility of LLMs.

Introducing the KNOW-NO Benchmarks

To address these concerns, recent research has introduced three categorization tasks as benchmarks, collectively named KNOW-NO. These tasks facilitate further investigation into the limitations of LLMs:

  1. BANK77: An intent classification task.
  2. MC-TEST: A multiple-choice question-answering task.
  3. EQUINFER: A newly developed task that determines the correct equation from four options based on surrounding paragraphs in scientific papers.

KNOW-NO encompasses classification problems with varying label sizes, lengths, and scopes, covering instance-level and task-level label spaces.

OMNIACCURACY: A New Metric for Evaluating LLMs

A novel metric named OMNIACCURACY has been introduced to more accurately assess LLM performance. This metric evaluates LLMs' classification abilities by combining results from two dimensions within the KNOW-NO framework:

  1. Accuracy-W/-GOLD: Measures conventional accuracy when the correct label is provided.
  2. Accuracy-W/O-GOLD: Measures accuracy when the correct label is absent.

OMNIACCURACY aims to approximate human-level discrimination intelligence in classification tasks, demonstrating LLMs' ability to handle both scenarios where correct labels are present and absent.

Primary Contributions of the Study

The research team has highlighted several key contributions:

  1. Highlighting LLM Limitations: This study is the first to emphasize the limitations of LLMs when correct answers are absent in classification tasks.
  2. Introducing CLASSIFY-W/O-GOLD: A new framework to assess LLMs and describe tasks without gold labels.
  3. Presenting the KNOW-NO Benchmark: Comprising one newly-created task and two well-known categorization tasks, this benchmark evaluates LLMs in the CLASSIFY-W/O-GOLD scenario.
  4. Proposing OMNIACCURACY: This metric combines outcomes when correct labels are present and absent, providing a deeper understanding of LLM capabilities in various situations.

Implications for Future Research

As LLMs continue to evolve, it is imperative to understand their limitations to foster further advancements. The introduction of benchmarks like KNOW-NO and metrics such as OMNIACCURACY provides a more nuanced evaluation of LLM capabilities. These tools ensure a more accurate reflection of human-like behavior and intelligence in classification tasks, guiding future improvements in the design and application of LLMs.

By addressing the limitations identified through these new benchmarks and metrics, researchers and practitioners can develop more robust and reliable AI systems. This progress is crucial for enhancing the practical applicability of LLMs across diverse real-world scenarios.


Check out the Paper " LLMs' Classification Performance is Overclaimed "

For more insights into the latest trends and innovations in AI, Mohamed MARZOUGUI & Khouloud Ben Cheikh ???? , and subscribe to Carthagin'IA Insights. Stay informed and join our growing community of AI enthusiasts and professionals.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了