Understanding RAG Evaluation: A Practical Approach to Retrieval Metrics

Understanding RAG Evaluation: A Practical Approach to Retrieval Metrics

Retrieval-Augmented Generation (RAG) systems are gaining popularity, helping users find relevant documents to answer their queries. In this article we will try to explore how to measure the performance of RAG based retrieval system against benchmarking/ground truth set. ?Let’s take a use case and imagine building a RAG system for Hollywood movie fans - tool that helps users find the movie based on their mood, genre preferences, or favorite actors. Queries like “A 90s action movie” or “80s time travel movie” can be answered by retrieving the most relevant movie descriptions from a large movie database.?

To evaluate the effectiveness of this retriever, we need a ground truth — a set of queries paired with the ideal (expected) movie results. For example, the query “Sci-fi movie with Tom from the 2000s” should ideally return movies like Minority Report and War of the Worlds. This ground truth acts as a benchmark, allowing us to measure how well our retriever ranks the correct movies when compared to these expected results. By setting up this ground truth early, we create the foundation for applying retrieval evaluation metrics.?

We will use LlamaIndex RetievarEvaluator to generate eval metrics. These metrics can be used to measure the effectiveness of the retrieval systems. By measuring these metrics, developers can improve the experience by finetuning the search results to match what users actually want.?

Now let’s go through an example scenario to evaluate a RAG based implementation - how to evaluate a RAG pipeline using Hollywood movies as a use case. Goal is to measure the effectiveness of a retriever in fetching relevant movies based on user queries using LLamaIndex’s RetrieverEvaluator. First look at the movie data set which will help us to set up the RAG based retrieval system. Then ground truth set, where query and corresponding response is recorded to evaluate the accuracy of RAG based retrieval and get evaluation metrics for each query retrieval. Note that consolidating these metrics could be a good indicator for the overall performance of retrieval system.

[Movie Data Set]?

[Ground Truth]

Note – Large natural texts in queries are used here, as smaller LLM is used to set up the retrieval.

[Building RAG]?

[1] Import/Install required dependencies

[2] Load Movie dataset and configure embedding

Note – Ensure that Google Drive is mounted where your files are kept. Set LLM to none to remove dependency on Open AI (this will use mock LLM). ?

[3] Prepare documents and build the index and retriever.

[4] Test the RAG based retriever with a query

[Evaluation Metrics Generation]

[1] Initialize the LlamaIndex RetrieverEvaluator to measure how well the retriever ranks relevant movies using metrics like MRR, Hit Rate, Precision, and NDCG.

Note – You may have to apply nest_asyncio, which allows running asynchronous code inside Jupyter notebooks without errors related to nested event loops?

[2] Load the ground truth from dataset?

[3] Create a mapping (title_to_node_id) where movie titles are extracted from document content and mapped to their corresponding document node IDs in the index.

Note - This ensures that retrieved results and expected results are compared using document IDs instead of raw text. To prevent mismatches during evaluation.?

[4] Extract queries from the ground truth dataset and converts expected movie titles into their corresponding document IDs using above mapping.

[5] Iterate through each query, retrieves relevant document, extract movie titles from the retrieved content. Then evaluates the retriever’s performance by comparing the retrieved document IDs with the expected ones. ?

[6] Print/store the expected movie title, retrieved movie titles, and evaluation metrics (MRR, Hit Rate, Precision, NDCG) for analysis.?

[Successful Hit Results]

Note the results for the query and returned result as Ad Astra Movie, ?

  • MRR - ?Ad Astra is at the 2nd position, so MRR = 1/2 = 0.5.?

  • Hit Rate - At least one relevant movie (Ad Astra) is found, so Hit Rate = 1.0.?

  • Precision - 1 out of 5 retrieved movies is relevant, so Precision = 1/5 = 0.2.?

  • NDCG (0.63) - A higher NDCG means relevant results are ranked well, with Ad Astra at position 2?

?Note the metrics for the query and returned result as Bloodshot Movie?

  • MRR -?Bloodshot is ranked 1st, so MRR = 1.0 (best possible score).?

  • Hit Rate -?A relevant movie is retrieved, so Hit Rate = 1.0.?

  • Precision – Same 0.2?

  • NDCG - Since Bloodshot is at position 1, NDCG = 1.0 (perfect ranking).?

[Un-Successful Hit Results]?

All metrics are zero, as the returned results does not contain the expected result.

[Summary]?

We explored how to evaluate a RAG-based implementation for movie search system. Our goal was to measure the effectiveness of a retriever in fetching relevant movie titles based on user queries using LlamaIndex’s RetrieverEvaluator.?

We started with a movie dataset containing titles and descriptions for 1000 movies for RAG based retrieval. To ensure a reliable evaluation, we create a ground truth dataset for 200 records, containing mapping sample queries to the expected movie results. For instance, a query like “movies about space exploration” should ideally return films like Interstellar or Ad Astra. These ground truths acted as a benchmark set to evaluate retrieval accuracy.?

Then we built a RAG pipeline using vector embeddings and document indexing, allowing the system to fetch similar movies (top 5) based on query embeddings. By running real test cases with ground truth set, we analyze both successful and unsuccessful retrievals. These results can help development team to see how accurately the RAG based system is returning the results, if the consolidated metrics are on lower side then retrieval system would need finetuning. This structured approach can help in testing & optimizing RAG-based search systems. ?

?




Ch Sujata

Intern Digital Marketing & Lead Generation | AI CERTS

1 周

Great insights, Vijay! For those interested in enhancing their AI skills, there's a free webinar on "Mastering AI Development: Building Smarter Applications with Machine Learning" on March 20, 2025. Participants will receive a certification, so be sure to register at: https://bit.ly/s-ai-development-machine-learning.

赞
回复

要查看或添加评论,请登录

Vijay Chaudhary的更多文章

社区洞察