A Practical Guide to Benchmarks for LLM Evaluation

A Practical Guide to Benchmarks for LLM Evaluation

Evaluating LLMs is not as straightforward as traditional machine learning tasks. Typically, in classification tasks, we measure performance with metrics like accuracy, precision, and recall, while in regression tasks, we use distance measures such as MAE, RMSE, or MAPE to gauge the error.

However, LLMs operate differently. Instead of producing simple yes/no answers or numerical values, they generate varied outputs like text, images, and videos. This variety demands more inventive methods to assess the quality of their outputs, ensuring we capture the nuances of their creativity and effectiveness.

With LLMs, the output is non-deterministic and language-based evaluation is much more challenging.

The sentences, “Mike really loves drinking tea.” and “Mike adores sipping tea.” have the same meaning. The sentences, “Mike does not drink coffee.”, and “Mike does drink coffee.” have a different meaning but there is only one word difference between these two sentences even though the meaning is different.

Source: Generative AI with Large Language Models (Coursera)

When it comes to measuring how well LLMs are performing, there are a few tools in the toolbox - such as ROUGE, BLEU, HELM, GLUE, SUPERGLUE and few others.

ROUGE Score

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. ROUGE is a set of metrics that evaluate the quality of summaries by comparing them to a set of reference summaries. This metric is particularly useful for evaluating the effectiveness of text-summarization tasks.

ROUGE scores fall on a scale from 0 to 1, where a score closer to 1 suggests that the LLMs summary is strikingly similar to the human-generated one. There are different variants of ROUGE, such as ROUGE-1, ROUGE-2, and ROUGE-L.

ROUGE-1: This version of ROUGE focuses on the simplest form of comparison. It measures how many individual words (unigrams) from the LLM-generated summary can also be found in the human-generated reference. This is a basic check of precision, seeing how much of the LLM's output directly matches words from the reference.


Source: Generative AI with Large Language Models (Coursera)


ROUGE-2 steps things up from ROUGE-1 by not just looking at individual words but at pairs of words, known as bigrams. This approach is similar to ROUGE-1 in terms of how it's calculated, but ROUGE-2 checks how many two-word phrases from the LLM-generated summary show up in the human-generated reference. By evaluating bigrams, ROUGE-2 addresses some of the limitations of ROUGE-1 related to the sequence or order of words, offering a slightly more nuanced view.


Source: Generative AI with Large Language Models (Coursera)


ROUGE-L takes a different approach compared to ROUGE-1 and ROUGE-2. Instead of focusing on unigrams or bigrams, ROUGE-L evaluates the longest common subsequence (LCS) between the LLM-generated summary and the human-generated reference. This method looks for the longest string of words that appear in the same order in both texts, providing a deeper insight into the overall coherence and order of the content in the summaries.


Source: Generative AI with Large Language Models (Coursera)


Implementing ROUGE on Python

To calculate the ROUGE score in Python is really straightforward way - simply compare a generated text against a reference text.

Here’s how you can compute ROUGE scores:

Simple code to evaluate ROUGE score between a reference and a generation.

The output will give you scores for ROUGE-1, ROUGE-2, and ROUGE-L. Each of these will include:

  • f (F1-score): Harmonic mean of precision and recall.
  • p (Precision): The proportion of words in the generated summary that are also in the reference summary.
  • r (Recall): The proportion of words in the reference summary that are also captured in the generated summary.

BLEU Score

BLEU stands for Bilingual Evaluation Understudy. BLEU is commonly used to evaluate the quality of machine-generated translations by comparing them to human reference translations.

It measures the precision of the generated text by counting how many n-grams in the generated text overlap with the reference translations.

BLEU Score = Average (precision across a range of n-grams)

BLEU is computed based on precision, while higher BLEU scores indicate better precision and higher similarity to the reference translations.

Implementing BLEU on Python

Here’s how you can compute BLEU scores:

Simple code to evaluate BLEU score between a reference and a candidate.

The output will give you scores for BLEU. The closer the candidate is to the reference the higher the BLEU score is.

Evaluation Benchmarks

LLMs are complex, and simple evaluation metrics like ROUGE and BLEU scores are limited, particularly if the generated words are all present but in a different sequence, it presents a challenge.

In order to measure and compare LLMs more holistically, it's possible to use pre-existing datasets, and associated benchmarks that have been established by LLM researchers specifically for this purpose.


Famous Evaluation Benchmarks for LLMs


  • GLUE is a collection of natural language tasks, such as sentiment analysis and question-answering. GLUE was created to encourage the development of models that can generalize across multiple tasks, and the benchmark can be used to measure and compare the model performance.
  • SuperGLUE was introduced to address limitations of GLUE. It consists of a series of tasks, some of which are not included in GLUE, and some of which are more challenging versions of the same tasks. SuperGLUE includes tasks such as multi-sentence reasoning and reading comprehension.
  • HELM (Holistic Evaluation of Language Models) is designed to increase transparency and provide insight into how models perform in specific tasks. What sets HELM apart is its focus on more than just accuracy, like F1 score precision - it also considers fairness, bias, and toxicity metrics. These are crucial as language models become more advanced, capable of human-like language generation, and potentially harmful behavior.
  • MMLU (Massive Multitask Language Understanding) is designed specifically for modern LLMs. To perform well models must possess extensive world knowledge and problem-solving ability. Models are tested on elementary mathematics, US history, computer science, law, and more.
  • BIG-bench: This test includes over 200+ different tasks that cover a wide range of topics such as linguistics, childhood development, math, common sense reasoning, biology, physics, social bias, software development and more.

Benchmarks for evaluating LLMs are essential. They help us assess the quality of outputs of LLMs and also consider fairness, bias, and toxicity metrics.

As AI keeps evolving, these benchmarks will need to evolve too, making sure our LLMs are not only smart but also align with the ethical standards and real-world applicability.

要查看或添加评论,请登录

Anuk Dissanayake的更多文章

社区洞察

其他会员也浏览了