A Practical Guide to Benchmarks for LLM Evaluation
Evaluating LLMs is not as straightforward as traditional machine learning tasks. Typically, in classification tasks, we measure performance with metrics like accuracy, precision, and recall, while in regression tasks, we use distance measures such as MAE, RMSE, or MAPE to gauge the error.
However, LLMs operate differently. Instead of producing simple yes/no answers or numerical values, they generate varied outputs like text, images, and videos. This variety demands more inventive methods to assess the quality of their outputs, ensuring we capture the nuances of their creativity and effectiveness.
With LLMs, the output is non-deterministic and language-based evaluation is much more challenging.
The sentences, “Mike really loves drinking tea.” and “Mike adores sipping tea.” have the same meaning. The sentences, “Mike does not drink coffee.”, and “Mike does drink coffee.” have a different meaning but there is only one word difference between these two sentences even though the meaning is different.
When it comes to measuring how well LLMs are performing, there are a few tools in the toolbox - such as ROUGE, BLEU, HELM, GLUE, SUPERGLUE and few others.
ROUGE Score
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. ROUGE is a set of metrics that evaluate the quality of summaries by comparing them to a set of reference summaries. This metric is particularly useful for evaluating the effectiveness of text-summarization tasks.
ROUGE scores fall on a scale from 0 to 1, where a score closer to 1 suggests that the LLMs summary is strikingly similar to the human-generated one. There are different variants of ROUGE, such as ROUGE-1, ROUGE-2, and ROUGE-L.
ROUGE-1: This version of ROUGE focuses on the simplest form of comparison. It measures how many individual words (unigrams) from the LLM-generated summary can also be found in the human-generated reference. This is a basic check of precision, seeing how much of the LLM's output directly matches words from the reference.
ROUGE-2 steps things up from ROUGE-1 by not just looking at individual words but at pairs of words, known as bigrams. This approach is similar to ROUGE-1 in terms of how it's calculated, but ROUGE-2 checks how many two-word phrases from the LLM-generated summary show up in the human-generated reference. By evaluating bigrams, ROUGE-2 addresses some of the limitations of ROUGE-1 related to the sequence or order of words, offering a slightly more nuanced view.
ROUGE-L takes a different approach compared to ROUGE-1 and ROUGE-2. Instead of focusing on unigrams or bigrams, ROUGE-L evaluates the longest common subsequence (LCS) between the LLM-generated summary and the human-generated reference. This method looks for the longest string of words that appear in the same order in both texts, providing a deeper insight into the overall coherence and order of the content in the summaries.
Implementing ROUGE on Python
To calculate the ROUGE score in Python is really straightforward way - simply compare a generated text against a reference text.
领英推荐
Here’s how you can compute ROUGE scores:
The output will give you scores for ROUGE-1, ROUGE-2, and ROUGE-L. Each of these will include:
BLEU Score
BLEU stands for Bilingual Evaluation Understudy. BLEU is commonly used to evaluate the quality of machine-generated translations by comparing them to human reference translations.
It measures the precision of the generated text by counting how many n-grams in the generated text overlap with the reference translations.
BLEU Score = Average (precision across a range of n-grams)
BLEU is computed based on precision, while higher BLEU scores indicate better precision and higher similarity to the reference translations.
Implementing BLEU on Python
Here’s how you can compute BLEU scores:
The output will give you scores for BLEU. The closer the candidate is to the reference the higher the BLEU score is.
Evaluation Benchmarks
LLMs are complex, and simple evaluation metrics like ROUGE and BLEU scores are limited, particularly if the generated words are all present but in a different sequence, it presents a challenge.
In order to measure and compare LLMs more holistically, it's possible to use pre-existing datasets, and associated benchmarks that have been established by LLM researchers specifically for this purpose.
Benchmarks for evaluating LLMs are essential. They help us assess the quality of outputs of LLMs and also consider fairness, bias, and toxicity metrics.
As AI keeps evolving, these benchmarks will need to evolve too, making sure our LLMs are not only smart but also align with the ethical standards and real-world applicability.