ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Understanding RAG Evaluation Algorithms

Dr. Rabi Prasad Padhy

Generative AI Practice Head

å‘å¸ƒæ—¥æœŸ: 2024å¹´10æœˆ11æ—¥

Retrieval-Augmented Generation (RAG) is a powerful approach for improving text generation tasks by integrating external knowledge retrieval with natural language generation models. However, evaluating the accuracy and relevance of RAG systems presents unique challenges. RAG evaluation algorithms are used to measure how well the generated text matches the ground truth or reference text.

RAG evaluation algorithms can be broadly classified into two categories based on how the ground truth is obtained:

Where the ground truth is provided by the evaluator or user.
Where the ground truth is generated by another Large Language Model (LLM).

These categories are further divided into subcategories that evaluate text on different levelsâ€”characters, words, embeddings, and other methods. Let's explore each of these in detail, along with simple examples.

Break Down the RAG Components

A RAG system involves two primary components:

Retrieval: Retrieves relevant documents or knowledge snippets from a knowledge base.
Generation: Generates responses based on the retrieved information using a language model.

The evaluation of a RAG system needs to assess both componentsâ€”retrieval accuracy and the quality of the generated response.

Define the Ground Truth

First, identify the ground truth data against which the retrieval and generation components will be evaluated. You can define the ground truth in two ways:

Manually provided ground truth: Domain experts or users provide the ideal response for a given query.
Generated by another LLM: In some cases, a secondary language model can provide reference answers.

Choose Appropriate Metrics

Based on the earlier classification, you need to decide which evaluation metric applies to each part of the system.

For Retrieval Evaluation

The focus is on how well the retrieval system fetches relevant documents. Common metrics include:

Precision@k: Measures how many of the top-k retrieved documents are relevant.
Recall: Measures how well the system retrieves all the relevant documents from the knowledge base.
F1 Score: A harmonic mean of precision and recall, useful for balancing both.

For Generation Evaluation

Once the retrieval is complete, the generation component produces text based on the retrieved documents. You can evaluate the generated responses using:

Character-based metrics: Measures the difference between the generated text and the ground truth at the character level (e.g., Edit Distance).
Word-based metrics: Metrics like BLEU, ROUGE, or METEOR evaluate the overlap between generated text and ground truth at the word level.
Embedding-based metrics: Semantic similarity between the generated and ground truth texts can be computed using embeddings and metrics like cosine similarity.

Category 1: Evaluations with Ground Truth Provided by the Evaluator

In this approach, the evaluator provides the ideal answer or ground truth, and the RAG system output is compared against it. This is the traditional evaluation method for text generation models. There are three subcategories:

é¢†è‹±æŽ¨è

The Future of Retrieval-Augmented Generation (RAG)

Sanjay Kumar MBA,MS,PhD 3 å‘¨å‰

RAG Chunking Strategies with LlamaIndex: Optimizing Your Retrieval Pipeline

RAG Chunking Strategies with LlamaIndex: Optimizingâ€¦

Broadifi Technologies 2 å‘¨å‰

Steps to Build a Large Language Model (LLM)

Sanjay Kumar MBA,MS,PhD 1 å¹´å‰

1.1 Character-Based Evaluation

This method compares the output at the most granular level: characters. It calculates how many characters in the RAG-generated output match with the ground truth and penalizes differences.

Example: Ground truth: "Hello World" RAG Output: "Helo Wrld" Character-based score: The difference lies in missing letters ('l' and 'o' are missing), and the score reflects this character-level mismatch. Metric Used: Edit Distance.

1.2 Word-Based Evaluation

This method works at the word level. It compares the words in the ground truth and the RAG output, counting the number of correct words and penalizing incorrect or missing words.

Example: Ground truth: "The cat is on the mat. "RAG Output: "The cat is on mat. "Word-based score: The output misses "the" before "mat," resulting in a slightly lower score compared to the ground truth. Metrics Used: METEOR, WER (Word Error Rate), BLEU, ROGUE.

1.3 Embedding-Based Evaluation

Embedding-based methods focus on the semantic meaning of the text rather than character or word-level differences. Both the ground truth and generated text are converted into vector representations (embeddings), and the similarity between these vectors is calculated using measures like cosine similarity.

Example: Ground truth: "The weather is nice today. "RAG Output: "It's a pleasant day." Embedding-based score: While the words differ, both sentences have a similar meaning. Embedding-based evaluation will recognize the semantic similarity and give a high score. Metrics Used: BERT Score, Mover Score.

Category 2: Evaluations with Ground Truth Generated by LLMs

In this approach, another LLM generates the ground truth, which is compared against the RAG systemâ€™s output. This method is particularly useful when human-generated ground truths are unavailable.

2.1 Mathematical Framework (RAGAS Score)

The RAGAS framework is a mathematical method that evaluates the retrieval and generation aspects of a RAG system separately. It uses measures like Recall and Precision to calculate how accurately the RAG system retrieves relevant information and how closely the generated text matches the ideal output.

Example: Ground truth retrieval: The system retrieves relevant information about "climate change." Generated text: A summary is created from the retrieved information. RAGAS Score: Based on how accurate the retrieval is and how relevant the generated summary is.

2.2 Experimental-Based Framework (GPT Score)

In this framework, an LLM evaluates the effectiveness of the RAG-generated text across various tasks. The GPT score can assess the generated text on multiple evaluation aspects like fluency, coherence, and factual correctness.

Example: Task: Generate a report on "renewable energy trends. "RAG Output: A report generated using relevant data sources. GPT Score: The output is evaluated based on fluency, coherence, and alignment with the original input task.

Conclusion

RAG evaluation algorithms offer various ways to assess the quality of retrieval and generation tasks. Whether itâ€™s through character-level differences, word-by-word comparison, or embedding-based methods, the key is to ensure that the generated output is accurate and semantically meaningful. Additionally, frameworks like RAGAS and GPT Score provide more sophisticated methods to evaluate the performance of RAG systems, especially in the absence of human-generated ground truths.

By understanding these evaluation methods, AI/ML engineers can better fine-tune their RAG models and improve the quality of their text generation tasks.

Anand Deshmukh

Entrepreneur | Building UPTIQ (Enterprise AI in FinServ) & Previu Health (Oncology Care & Healthtech) | CTO

5 ä¸ªæœˆ

Very helpful Dr Rabi Prasad Padhy

èµž

å›žå¤

1 æ¬¡å›žåº”

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Dr. Rabi Prasad Padhyçš„æ›´å¤šæ–‡ç«

Gen AI Observability & Monitoring

2024å¹´11æœˆ9æ—¥

Gen AI Observability & Monitoring

Understanding Gen AI Observability & Monitoring Gen AI observability and monitoring is the practice of systematicallyâ€¦

1 æ¡è¯„è®º
Beyond Retrieval: How Agentic RAG is Transforming Autonomous AI

2024å¹´11æœˆ6æ—¥

Beyond Retrieval: How Agentic RAG is Transforming Autonomous AI

[ 1 ] Simple RAG Definition: Retrieves relevant documents based on the query and uses them to generate an answerâ€¦
Large Language Models (LLMs/LSTMs/BERT)

2024å¹´11æœˆ6æ—¥

Large Language Models (LLMs/LSTMs/BERT)

Large Language Models (LLMs) are a category of artificial intelligence models specifically designed to understandâ€¦
Selecting the Right Foundation Model for Your Use Case

2024å¹´11æœˆ4æ—¥

Selecting the Right Foundation Model for Your Use Case

Choosing the ideal foundation model for a given use case involves evaluating several critical factors. With a wideâ€¦
Comparing LlamaIndex vs LangChain

2024å¹´10æœˆ31æ—¥

Comparing LlamaIndex vs LangChain

LlamaIndex: LlamaIndex is a framework for organizing and retrieving information, designed to make data easier to findâ€¦
Decoding the Data Analytics Value Chain: Building a Modern Data Architecture

2024å¹´10æœˆ30æ—¥

Decoding the Data Analytics Value Chain: Building a Modern Data Architecture

The data analytics value chain represents the entire journey of dataâ€”from its raw form in various sources to meaningfulâ€¦
Open or Closed? A Practical Guide to Gen AI Model Selection

2024å¹´10æœˆ29æ—¥

Open or Closed? A Practical Guide to Gen AI Model Selection

What Are Open-Source and Closed-Source Generative AI Models? Before diving into specific model options, let's clarifyâ€¦
How Databases Evolved from Transactions to Analytics and Contextual Search

2024å¹´10æœˆ28æ—¥

How Databases Evolved from Transactions to Analytics and Contextual Search

Databases have come a long way from their origins as simple transactional systems. Today, the database ecosystem is aâ€¦
The Modern LLM Tech Stack

2024å¹´10æœˆ27æ—¥

The Modern LLM Tech Stack

The Modern LLM Tech Stack In the world of Generative AI, a well-structured and versatile tech stack is essential forâ€¦
Fine-Tuning LLMs Made Easy: A Comparison of LoRA and QLoRA

2024å¹´10æœˆ26æ—¥

Fine-Tuning LLMs Made Easy: A Comparison of LoRA and QLoRA

Large language models (LLMs) like OpenAIâ€™s GPT, Metaâ€™s LLaMA, and Googleâ€™s PaLM have become essential tools for a wideâ€¦

See all articles

Understanding RAG Evaluation Algorithms

Dr. Rabi Prasad Padhy

Generative AI Practice Head

Break Down the RAG Components

Define the Ground Truth

Choose Appropriate Metrics

For Retrieval Evaluation

For Generation Evaluation

Category 1: Evaluations with Ground Truth Provided by the Evaluator

é¢†è‹±æŽ¨è

1.1 Character-Based Evaluation

1.2 Word-Based Evaluation

1.3 Embedding-Based Evaluation

Category 2: Evaluations with Ground Truth Generated by LLMs

2.1 Mathematical Framework (RAGAS Score)

2.2 Experimental-Based Framework (GPT Score)

Conclusion

Dr. Rabi Prasad Padhyçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

The Art and Science of RAG: Mastering Prompt Templates and Contextual Understanding

Comparison of the Most Useful Text Processing APIs

Exploring RAG (Retrieval Augmented Generation)

Building Retrieval-Augmented Generation (RAG) Agents with Large Language Models (LLMs)

Information Retrieval | Language models

Deep Dive into Advanced RAG Applications in LLM Based?Systems

Long-Context Language Models (Gemini 1.5) as a Potential Replacement for RAG Methods

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

Evaluating Retrieval-Augmented Generation (RAG) Applications with RAGAS and LangChain

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

Break Down the RAG Components

Define the Ground Truth

Choose Appropriate Metrics

For Retrieval Evaluation

For Generation Evaluation

Category 1: Evaluations with Ground Truth Provided by the Evaluator

é¢†è‹±æŽ¨è

1.1 Character-Based Evaluation

1.2 Word-Based Evaluation

1.3 Embedding-Based Evaluation

Category 2: Evaluations with Ground Truth Generated by LLMs

2.1 Mathematical Framework (RAGAS Score)

2.2 Experimental-Based Framework (GPT Score)

Conclusion

Dr. Rabi Prasad Padhyçš„æ›´å¤šæ–‡ç«

Gen AI Observability & Monitoring

Beyond Retrieval: How Agentic RAG is Transforming Autonomous AI

Large Language Models (LLMs/LSTMs/BERT)

Selecting the Right Foundation Model for Your Use Case

Comparing LlamaIndex vs LangChain

Decoding the Data Analytics Value Chain: Building a Modern Data Architecture

Open or Closed? A Practical Guide to Gen AI Model Selection

How Databases Evolved from Transactions to Analytics and Contextual Search

The Modern LLM Tech Stack

Fine-Tuning LLMs Made Easy: A Comparison of LoRA and QLoRA

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

The Art and Science of RAG: Mastering Prompt Templates and Contextual Understanding

Comparison of the Most Useful Text Processing APIs

Exploring RAG (Retrieval Augmented Generation)

Building Retrieval-Augmented Generation (RAG) Agents with Large Language Models (LLMs)

Information Retrieval | Language models

Deep Dive into Advanced RAG Applications in LLM Based?Systems

Long-Context Language Models (Gemini 1.5) as a Potential Replacement for RAG Methods

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

Evaluating Retrieval-Augmented Generation (RAG) Applications with RAGAS and LangChain

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†