ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Understanding RAG Evaluation: A Practical Approach to Retrieval Metrics

Vijay Chaudhary

Lead Software Engineer

å‘å¸ƒæ—¥æœŸ: 2025å¹´3æœˆ16æ—¥

Retrieval-Augmented Generation (RAG) systems are gaining popularity, helping users find relevant documents to answer their queries. In this article we will try to explore how to measure the performance of RAG based retrieval system against benchmarking/ground truth set. ?Letâ€™s take a use case and imagine building a RAG system for Hollywood movie fans - tool that helps users find the movie based on their mood, genre preferences, or favorite actors. Queries like â€œA 90s action movieâ€ or â€œ80s time travel movieâ€ can be answered by retrieving the most relevant movie descriptions from a large movie database.?

To evaluate the effectiveness of this retriever, we need a ground truth â€” a set of queries paired with the ideal (expected) movie results. For example, the query â€œSci-fi movie with Tom from the 2000sâ€ should ideally return movies like Minority Report and War of the Worlds. This ground truth acts as a benchmark, allowing us to measure how well our retriever ranks the correct movies when compared to these expected results. By setting up this ground truth early, we create the foundation for applying retrieval evaluation metrics.?

We will use LlamaIndex RetievarEvaluator to generate eval metrics. These metrics can be used to measure the effectiveness of the retrieval systems. By measuring these metrics, developers can improve the experience by finetuning the search results to match what users actually want.?

Now letâ€™s go through an example scenario to evaluate a RAG based implementation - how to evaluate a RAG pipeline using Hollywood movies as a use case. Goal is to measure the effectiveness of a retriever in fetching relevant movies based on user queries using LLamaIndexâ€™s RetrieverEvaluator. First look at the movie data set which will help us to set up the RAG based retrieval system. Then ground truth set, where query and corresponding response is recorded to evaluate the accuracy of RAG based retrieval and get evaluation metrics for each query retrieval. Note that consolidating these metrics could be a good indicator for the overall performance of retrieval system.

[Movie Data Set]?

[Ground Truth]

Note â€“ Large natural texts in queries are used here, as smaller LLM is used to set up the retrieval.

[Building RAG]?

[1] Import/Install required dependencies

[2] Load Movie dataset and configure embedding

Note â€“ Ensure that Google Drive is mounted where your files are kept. Set LLM to none to remove dependency on Open AI (this will use mock LLM). ?

[3] Prepare documents and build the index and retriever.

[4] Test the RAG based retriever with a query

[Evaluation Metrics Generation]

[1] Initialize the LlamaIndex RetrieverEvaluator to measure how well the retriever ranks relevant movies using metrics like MRR, Hit Rate, Precision, and NDCG.

Note â€“ You may have to apply nest_asyncio, which allows running asynchronous code inside Jupyter notebooks without errors related to nested event loops?

[2] Load the ground truth from dataset?

[3] Create a mapping (title_to_node_id) where movie titles are extracted from document content and mapped to their corresponding document node IDs in the index.

Note - This ensures that retrieved results and expected results are compared using document IDs instead of raw text. To prevent mismatches during evaluation.?

[4] Extract queries from the ground truth dataset and converts expected movie titles into their corresponding document IDs using above mapping.

[5] Iterate through each query, retrieves relevant document, extract movie titles from the retrieved content. Then evaluates the retrieverâ€™s performance by comparing the retrieved document IDs with the expected ones. ?

[6] Print/store the expected movie title, retrieved movie titles, and evaluation metrics (MRR, Hit Rate, Precision, NDCG) for analysis.?

[Successful Hit Results]

Note the results for the query and returned result as Ad Astra Movie, ?

MRR - ?Ad Astra is at the 2nd position, so MRR = 1/2 = 0.5.?

Hit Rate - At least one relevant movie (Ad Astra) is found, so Hit Rate = 1.0.?

Precision - 1 out of 5 retrieved movies is relevant, so Precision = 1/5 = 0.2.?

NDCG (0.63) - A higher NDCG means relevant results are ranked well, with Ad Astra at position 2?

?Note the metrics for the query and returned result as Bloodshot Movie?

MRR -?Bloodshot is ranked 1st, so MRR = 1.0 (best possible score).?

Hit Rate -?A relevant movie is retrieved, so Hit Rate = 1.0.?

Precision â€“ Same 0.2?

NDCG - Since Bloodshot is at position 1, NDCG = 1.0 (perfect ranking).?

[Un-Successful Hit Results]?

All metrics are zero, as the returned results does not contain the expected result.

[Summary]?

We explored how to evaluate a RAG-based implementation for movie search system. Our goal was to measure the effectiveness of a retriever in fetching relevant movie titles based on user queries using LlamaIndexâ€™s RetrieverEvaluator.?

We started with a movie dataset containing titles and descriptions for 1000 movies for RAG based retrieval. To ensure a reliable evaluation, we create a ground truth dataset for 200 records, containing mapping sample queries to the expected movie results. For instance, a query like â€œmovies about space explorationâ€ should ideally return films like Interstellar or Ad Astra. These ground truths acted as a benchmark set to evaluate retrieval accuracy.?

Then we built a RAG pipeline using vector embeddings and document indexing, allowing the system to fetch similar movies (top 5) based on query embeddings. By running real test cases with ground truth set, we analyze both successful and unsuccessful retrievals. These results can help development team to see how accurately the RAG based system is returning the results, if the consolidated metrics are on lower side then retrieval system would need finetuning. This structured approach can help in testing & optimizing RAG-based search systems. ?

AI-ML & Automations

1,581 ä½å…³æ³¨è€…

è®¢é˜…

Ch Sujata

Intern Digital Marketing & Lead Generation | AI CERTS

1 å‘¨

Great insights, Vijay! For those interested in enhancing their AI skills, there's a free webinar on "Mastering AI Development: Building Smarter Applications with Machine Learning" on March 20, 2025. Participants will receive a certification, so be sure to register at: https://bit.ly/s-ai-development-machine-learning.

èµž

å›žå¤

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Vijay Chaudharyçš„æ›´å¤šæ–‡ç«

Agents with a Mind: A Practical Start to Agentic AI

2025å¹´3æœˆ23æ—¥

Agents with a Mind: A Practical Start to Agentic AI

The hype around Agentic AI comes from its promise to move beyond simple task execution toward systems that can reasonâ€¦
Splitting Text Right Way - NLTK, SpaCy or Markdown

2025å¹´3æœˆ2æ—¥

Splitting Text Right Way - NLTK, SpaCy or Markdown

For natural language processing (NLP) working with large pieces of text can be challenging. Many language models haveâ€¦

1 æ¡è¯„è®º
Unlocking Entities and Relations: Creating Knowledge Graphs with AI

2025å¹´2æœˆ16æ—¥

Unlocking Entities and Relations: Creating Knowledge Graphs with AI

GraphRAG is something which is picking up recently, in this article we will try to get to the basics of GraphRagâ€¦
Structured Outputs from LLMs: LangChain Output Parsers

2025å¹´2æœˆ9æ—¥

Structured Outputs from LLMs: LangChain Output Parsers

LLMs are good at generating human-like text (hence called Generative AI), but when it comes to integrating toâ€¦
Handling Sensitive Data: Redaction, Masking and Compliance

2025å¹´2æœˆ2æ—¥

Handling Sensitive Data: Redaction, Masking and Compliance

In today's data-driven world, digital documents containing sensitive information pose challenges to privacy andâ€¦
Optimizing AI Workflows with LangChain - A Practical Introduction

2025å¹´1æœˆ25æ—¥

Optimizing AI Workflows with LangChain - A Practical Introduction

LangChain is a framework for developing applications powered by large language models (LLMs). It helps in simplifyingâ€¦
Effortlessly Organize Mixed Documents with GCP's Custom Splitter Feature

2025å¹´1æœˆ19æ—¥

Effortlessly Organize Mixed Documents with GCP's Custom Splitter Feature

In real-world scenarios, it's common to encounter multiple documents combined into a single, multi-page image or PDFâ€¦
Improving AI Contextual Understanding -Retrieval Augmented Generation (RAG)

2025å¹´1æœˆ4æ—¥

Improving AI Contextual Understanding -Retrieval Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a technique in natural language processing that uses knowledgebase informationâ€¦

2 æ¡è¯„è®º
Understanding Custom Classifiers in Google Document AI

2024å¹´12æœˆ29æ—¥

Understanding Custom Classifiers in Google Document AI

There are three categories of models or services in GCP Document AI â€“ General Document processors (Layout, Form and Docâ€¦
Processing with GCP Document AI: Exploring Pretrained Parsers

2024å¹´12æœˆ15æ—¥

Processing with GCP Document AI: Exploring Pretrained Parsers

GCP Document AI offers multiple products to process documents for information for different use cases. Belowâ€¦

2 æ¡è¯„è®º

See all articles

[Summary]?

AI-ML & Automations

1,581 ä½å…³æ³¨è€…

Vijay Chaudharyçš„æ›´å¤šæ–‡ç«

Agents with a Mind: A Practical Start to Agentic AI

Splitting Text Right Way - NLTK, SpaCy or Markdown

Unlocking Entities and Relations: Creating Knowledge Graphs with AI

Structured Outputs from LLMs: LangChain Output Parsers

Handling Sensitive Data: Redaction, Masking and Compliance

Optimizing AI Workflows with LangChain - A Practical Introduction

Effortlessly Organize Mixed Documents with GCP's Custom Splitter Feature

Improving AI Contextual Understanding -Retrieval Augmented Generation (RAG)

Understanding Custom Classifiers in Google Document AI

Processing with GCP Document AI: Exploring Pretrained Parsers

ç¤¾åŒºæ´žå¯Ÿ

1,581 ä½å…³æ³¨è€…