登录查看更多内容

A Practical Guide to Benchmarks for LLM Evaluation

Anuk Dissanayake

Data Scientist

发布日期: 2024年4月17日

Evaluating LLMs is not as straightforward as traditional machine learning tasks. Typically, in classification tasks, we measure performance with metrics like accuracy, precision, and recall, while in regression tasks, we use distance measures such as MAE, RMSE, or MAPE to gauge the error.

However, LLMs operate differently. Instead of producing simple yes/no answers or numerical values, they generate varied outputs like text, images, and videos. This variety demands more inventive methods to assess the quality of their outputs, ensuring we capture the nuances of their creativity and effectiveness.

With LLMs, the output is non-deterministic and language-based evaluation is much more challenging.

The sentences, “Mike really loves drinking tea.” and “Mike adores sipping tea.” have the same meaning. The sentences, “Mike does not drink coffee.”, and “Mike does drink coffee.” have a different meaning but there is only one word difference between these two sentences even though the meaning is different.

Source: Generative AI with Large Language Models (Coursera)

When it comes to measuring how well LLMs are performing, there are a few tools in the toolbox - such as ROUGE, BLEU, HELM, GLUE, SUPERGLUE and few others.

ROUGE Score

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. ROUGE is a set of metrics that evaluate the quality of summaries by comparing them to a set of reference summaries. This metric is particularly useful for evaluating the effectiveness of text-summarization tasks.

ROUGE scores fall on a scale from 0 to 1, where a score closer to 1 suggests that the LLMs summary is strikingly similar to the human-generated one. There are different variants of ROUGE, such as ROUGE-1, ROUGE-2, and ROUGE-L.

ROUGE-1: This version of ROUGE focuses on the simplest form of comparison. It measures how many individual words (unigrams) from the LLM-generated summary can also be found in the human-generated reference. This is a basic check of precision, seeing how much of the LLM's output directly matches words from the reference.

ROUGE-2 steps things up from ROUGE-1 by not just looking at individual words but at pairs of words, known as bigrams. This approach is similar to ROUGE-1 in terms of how it's calculated, but ROUGE-2 checks how many two-word phrases from the LLM-generated summary show up in the human-generated reference. By evaluating bigrams, ROUGE-2 addresses some of the limitations of ROUGE-1 related to the sequence or order of words, offering a slightly more nuanced view.

ROUGE-L takes a different approach compared to ROUGE-1 and ROUGE-2. Instead of focusing on unigrams or bigrams, ROUGE-L evaluates the longest common subsequence (LCS) between the LLM-generated summary and the human-generated reference. This method looks for the longest string of words that appear in the same order in both texts, providing a deeper insight into the overall coherence and order of the content in the summaries.

Implementing ROUGE on Python

To calculate the ROUGE score in Python is really straightforward way - simply compare a generated text against a reference text.

领英推荐

9 Trends in Artificial Intelligence

Jarno Duursma 7 个月前

Continuous LLM Monitoring - Observability to ensure…

Sanjay Basu PhD 1 年前

GPTNext in November 2024 and should we pull the plug?!

Igor van Gemert 6 个月前

Here’s how you can compute ROUGE scores:

Simple code to evaluate ROUGE score between a reference and a generation.

The output will give you scores for ROUGE-1, ROUGE-2, and ROUGE-L. Each of these will include:

f (F1-score): Harmonic mean of precision and recall.
p (Precision): The proportion of words in the generated summary that are also in the reference summary.
r (Recall): The proportion of words in the reference summary that are also captured in the generated summary.

BLEU Score

BLEU stands for Bilingual Evaluation Understudy. BLEU is commonly used to evaluate the quality of machine-generated translations by comparing them to human reference translations.

It measures the precision of the generated text by counting how many n-grams in the generated text overlap with the reference translations.

BLEU Score = Average (precision across a range of n-grams)

BLEU is computed based on precision, while higher BLEU scores indicate better precision and higher similarity to the reference translations.

Implementing BLEU on Python

Here’s how you can compute BLEU scores:

Simple code to evaluate BLEU score between a reference and a candidate.

The output will give you scores for BLEU. The closer the candidate is to the reference the higher the BLEU score is.

Evaluation Benchmarks

LLMs are complex, and simple evaluation metrics like ROUGE and BLEU scores are limited, particularly if the generated words are all present but in a different sequence, it presents a challenge.

In order to measure and compare LLMs more holistically, it's possible to use pre-existing datasets, and associated benchmarks that have been established by LLM researchers specifically for this purpose.

GLUE is a collection of natural language tasks, such as sentiment analysis and question-answering. GLUE was created to encourage the development of models that can generalize across multiple tasks, and the benchmark can be used to measure and compare the model performance.
SuperGLUE was introduced to address limitations of GLUE. It consists of a series of tasks, some of which are not included in GLUE, and some of which are more challenging versions of the same tasks. SuperGLUE includes tasks such as multi-sentence reasoning and reading comprehension.
HELM (Holistic Evaluation of Language Models) is designed to increase transparency and provide insight into how models perform in specific tasks. What sets HELM apart is its focus on more than just accuracy, like F1 score precision - it also considers fairness, bias, and toxicity metrics. These are crucial as language models become more advanced, capable of human-like language generation, and potentially harmful behavior.
MMLU (Massive Multitask Language Understanding) is designed specifically for modern LLMs. To perform well models must possess extensive world knowledge and problem-solving ability. Models are tested on elementary mathematics, US history, computer science, law, and more.
BIG-bench: This test includes over 200+ different tasks that cover a wide range of topics such as linguistics, childhood development, math, common sense reasoning, biology, physics, social bias, software development and more.

Benchmarks for evaluating LLMs are essential. They help us assess the quality of outputs of LLMs and also consider fairness, bias, and toxicity metrics.

As AI keeps evolving, these benchmarks will need to evolve too, making sure our LLMs are not only smart but also align with the ethical standards and real-world applicability.

要查看或添加评论，请登录

Anuk Dissanayake的更多文章

Challenges and Solutions for Deploying LLM Agents in Production

2024年6月5日

Challenges and Solutions for Deploying LLM Agents in Production

These days LLM agents have become incredibly powerful tools, capable of handling a wide range of tasks, from generating…

2 条评论
Create your own AI Agent using Llama-3

2024年5月11日

Create your own AI Agent using Llama-3

Agents the next big thing? I wanted to be a secret agent myself during my childhood ever since watching this movie…

22 条评论
Fine-Tuning LLMs: PEFT with LoRA and Prompt Tuning

2024年4月30日

Fine-Tuning LLMs: PEFT with LoRA and Prompt Tuning

Lately, fine-tuning has been drawing significant attention, especially with the rise of the LLM era. This article aims…
Building an LLM Agent Using Semantic Kernel SDK

2024年4月25日

Building an LLM Agent Using Semantic Kernel SDK

So, one of my recent projects involved building a fitness plugin to provide personalized fitness advice and a meal…
Mastering Azure ML Prompt Flow on Azure ML Studio: A Step-by-Step Guide

2024年4月22日

Mastering Azure ML Prompt Flow on Azure ML Studio: A Step-by-Step Guide

Azure has recently expanded its machine learning capabilities with some exciting new features in Azure ML Studio. I've…

7 条评论
How RLHF is Revolutionizing LLMs for the Better

2024年4月20日

How RLHF is Revolutionizing LLMs for the Better

In 2020, researchers at OpenAI published a paper that explored the use of fine-tuning with human feedback to train a…
Azure ML: Choose the Right Compute without Confusion

2024年4月14日

Azure ML: Choose the Right Compute without Confusion

Imagine stepping into Azure ML - it's like walking into a candy store bursting with choices - chocolates, gummies…

2 条评论
Retrieval Augmented Generation (RAG) for Beginners

2024年3月17日

Retrieval Augmented Generation (RAG) for Beginners

Have you got RAGged trying to understand RAG? When I first started learning about RAG, I ran into a lot of challenges…

1 条评论
Let's build a chat bot on weather updates using Python - it's that simple!

2024年3月9日

Let's build a chat bot on weather updates using Python - it's that simple!

So, it's super hot in Colombo these days, right? And here's my mom, asking me if it’s sunny enough to dry some…

16 条评论

See all articles

A Practical Guide to Benchmarks for LLM Evaluation

Anuk Dissanayake

Data Scientist

ROUGE Score

Implementing ROUGE on Python

领英推荐

BLEU Score

Implementing BLEU on Python

Evaluation Benchmarks

Anuk Dissanayake的更多文章

社区洞察

其他会员也浏览了

How to leverage Gen-AI in the enterprise and avoid the pitfalls

AI in Two Steps

Going Beyond Prompts: Advanced Model Customization using Fine-tuning, Embedding and Function Calling.

The knowledge transfer paradox

OpenAI's O1 Model Series: Ushering in a New Era of AI Reasoning

FOD#19: The Convergence of Reasoning and Action in AI

Explore the Future with Gen AI: Your Weekly Passport to Innovation!

Mastering LLM Performance: A Guide to Four Key Optimization Approaches

Unveiling the AI Superstars: A Deep Dive into the Top 10 Performing AI Models on HuggingFace's Open LLM Leaderboard

The AI ToolBox #2: Vector Search in Machine Learning and AI

ROUGE Score

Implementing ROUGE on Python

领英推荐

BLEU Score

Implementing BLEU on Python

Evaluation Benchmarks

Anuk Dissanayake的更多文章

Challenges and Solutions for Deploying LLM Agents in Production

Create your own AI Agent using Llama-3

Fine-Tuning LLMs: PEFT with LoRA and Prompt Tuning

Building an LLM Agent Using Semantic Kernel SDK

Mastering Azure ML Prompt Flow on Azure ML Studio: A Step-by-Step Guide

How RLHF is Revolutionizing LLMs for the Better

Azure ML: Choose the Right Compute without Confusion

Retrieval Augmented Generation (RAG) for Beginners

Let's build a chat bot on weather updates using Python - it's that simple!

社区洞察

其他会员也浏览了

How to leverage Gen-AI in the enterprise and avoid the pitfalls

AI in Two Steps

Going Beyond Prompts: Advanced Model Customization using Fine-tuning, Embedding and Function Calling.

The knowledge transfer paradox

OpenAI's O1 Model Series: Ushering in a New Era of AI Reasoning

FOD#19: The Convergence of Reasoning and Action in AI

Explore the Future with Gen AI: Your Weekly Passport to Innovation!

Mastering LLM Performance: A Guide to Four Key Optimization Approaches

Unveiling the AI Superstars: A Deep Dive into the Top 10 Performing AI Models on HuggingFace's Open LLM Leaderboard

The AI ToolBox #2: Vector Search in Machine Learning and AI