7 Retrieval Metrics for Better RAG Systems
A Simple Guide to Retrieval Augmented Generation
Everything you need to know about Retrieval Augmented Generation in one human-friendly guide
Context is Key : Retrieval Augmented Generation
Large Language Models, or LLMs, is a generative AI technology that has gained tremendous popularity in the last two years. However, when it comes to using LLMs in real scenarios, we still grapple with the knowledge limitations and hallucinations of the LLMs. Retrieval Augmented Generation, or RAG, addresses these issues by providing the LLM with additional memory and context. In 2024 has emerged to be one of the most popular techniques in the applied generative AI world. In fact, one can assume that no LLM-powered application doesn’t use RAG in one way or the other.
RAG Evaluation: Beyond Naivety
For RAG to live up to the promise of grounding the LLM responses in data, we need to go beyond the simple implementation of indexing, retrieval, augmentation and generation. However, to improve something, we need to first measure the performance. RAG evaluations help in setting up the baseline of your RAG system performance for you to then improve it.
Evaluation of RAG Pipelines for more reliable LLM applications
Building a PoC RAG pipeline is not overtly complex. It is achievable through brief training and verification on a limited set of examples. However, to enhance its robustness, thorough testing on a dataset that accurately mirrors the production use case is imperative. RAG pipelines can suffer from hallucinations of their own. From a high level, there are three points of failures for RAG systems.
In this article, we will focus on a few evaluation metrics that focus on the first point of failure — “The retriever fails to retrieve the entire context or retrieves irrelevant context”. In other words, metrics that evaluate the quality of the retriever.
Retrieval Metrics
Evaluation metrics for the assessment of RAG systems can be categorized into three broad categories —
The retrieval component of RAG can be evaluated independently to determine how well the retrievers are satisfying the user query. Let us look at seven popular metrics that are not only used in RAG, but also in information retrieval tasks like search engines, recommendations etc.
Knowledge Base : The concept of Knowledge Base is important in RAG. It is non-parametric memory that stores all the documents that the RAG system will work on.
1. Accuracy
Accuracy is typically defined as the proportion of correct predictions (both true positives and true negatives) among the total number of cases examined. If you are aware of classification problems in supervised machine learning space you may already know this metric in that form. In the context of retrieval and RAG, it is interpreted as follows —
Even though accuracy is a simple, intuitive metric, it is not the primary metric for retrieval. In a large knowledge base, majority of documents are usually irrelevant to any given query, which can lead to misleadingly high accuracy scores. It does not consider ranking of the retrieved results.
2. Precision
Precision focusses on the quality of the retrieved results. It measures the proportion of retrieved documents that are relevant to the user query. It answers the question, “Of all the documents that were retrieved, how many were actually relevant?”
A higher precision will mean that the retriever is performing well and retrieving mostly relevant documents.
Note: precision is also a metric that is popular in classification tasks where it is defined as the proportion of actual positive cases amongst the ones that the model has predicted to be positive or True Positives/(True Positives + False Positives)
Precision@k: Precision@k is a variation of precision that measures the proportion of relevant documents amongst the top ‘k’ retrieved results. It is particularly important because it focusses on the top results rather than all the retrieved documents. For RAG it is important because only the top results are most likely to be used for augmentation. For example, if our RAG system considers top 5 documents for augmentation, then Precision@5 becomes important.
So, a precision@5 of 0.8 or 4/5 will mean that out of the top 5 results, 4 are relevant.
Precision@k is also useful to compare systems when the total number of results retrieved may be different in different systems. However, the limitation is that the choice of ‘k’ can be arbitrary, and this metric doesn’t look beyond the chosen ‘k’.
3. Recall
Recall focusses on the coverage that the retriever provides. It measures the proportion of the relevant documents retrieved from all the relevant documents in the corpus. It answers the question, “Of all the relevant documents, how many were actually retrieved?”
Note that, unlike precision, calculation of recall requires a prior knowledge of the total number of relevant documents. This can become challenging in large scale systems that have many documents in the knowledge base.
Like precision, recall also doesn’t consider the ranking of the retrieved documents. It can also be misleading as retrieving all documents in the knowledge base will result in a perfect recall value.
Like Precision@k, Recall@k is also a metric that is sometimes considered. Recall@k is the proportion of relevant documents retrieved amongst the top ‘k’ results from the total relevant documents
4. F1-score
F1-score is the harmonic mean of precision and recall. It provides a single metric that balances both the quality and coverage of the retriever.
The equation is such that F1-score penalizes either variable having a low score; a High F1 score is only possible when both recall and precision values are high. This means that the score cannot be positively skewed by a single variable
F1-score provides and single, balanced measure that can be used to easily compare different systems. However, it does not take ranking into account and gives equal weightage to precision and recall which might not always be ideal.
“Relevant” Documents: Most of the metrics we discussed talk about a concept of relevant documents. For example, precision is calculated as the number of relevant documents retrieved divided by the total number of retrieved documents. The question that arises is — How does one establish that a document is relevant? The simple answer is a human evaluation approach. A subject matter expert looks at the documents and determines the relevance. Human evaluation brings in subjectivity and, therefore, human evaluations are done by a panel of experts rather than an individual. But human evaluations are restrictive from a scale and a cost perspective. Any data that can reliably establish relevance, consequently, becomes extremely useful. Ground truth is information that is known to be real or true. In RAG, and Generative AI domain in general, Ground Truth is a prepared set of Prompt-Context-Response or Question-Context-Response example, akin to labelled data in Supervised Machine Learning parlance. Ground truth data that is created for your knowledge base can be used for evaluation of your RAG system.
The first four metrics do not take ranking of the documents into account. They are evaluating the efficacy of the system from an overall retrieval perspective. The next three metrics will also consider ranking of the results.
This article draws inspiration from Chapter 5 of A Simple Guide to Retrieval Augmented Generation. If you're looking to understand RAG, in depth, this book will be a great starting point. Consider purchasing a copy here - https://mng.bz/jXJ9
5. Mean Reciprocal Rank (MRR)
Mean Reciprocal Rank or MRR, is particularly useful in evaluating the rank of the relevant document. It measures the reciprocal of the ranks of the first relevant document in the list of results. MRR is calculated over a set of queries.
MRR is particularly useful when you’re interested in how quickly the system can find a relevant document and considers the ranking of the results. However, since it doesn’t look at anything beyond the first relevant result, it may not be useful when multiple relevant results are important.
6. Mean Average Precision (MAP)
Mean Average Precision or MAP is a metric that combines precision and recall at different cut-off levels of ‘k’ i.e. the cut-off number for the top results. It calculates a measure called Average Precision and then averages it across all queries.
Mean Average Precision is the mean of the average precision (shown above) over all the ’N’ queries
MAP provides a single measure of quality across recall levels. It is quite suitable when result ranking is important, but it is complex to calculate.
7. Normalized Discounted Cumulative Gain (nDCG)
nDCG evaluates the ranking quality by considering the position of relevant documents in the result list and assigning higher scores to relevant documents appearing earlier. It is particularly effective for scenarios where documents have varying degrees of relevance. To calculate discounted cumulative gain (DCG), each document in the retrieved list is assigned a relevance score, rel and a discount factor reduces the weight of documents as their rank position increases.
Here rel(i) is the graded relevance of document at position i. IDCG is the ideal DCG which is the DCG for perfect ranking. nDCG is calculated as the ratio between actual discounted cumulative gain (DCG) and the ideal discounted cumulative gain (IDCG)
nDCG is quite a complex metric to calculate. It requires documents to have a relevance score which may lead to subjectivity and the choice of the discount factor affects the values significantly, but it accounts for varying degrees of relevance in documents and gives more weightage to higher ranked items.
nDCG is quite a complex metric to calculate. It requires documents to have a relevance score which may lead to subjectivity and the choice of the discount factor affects the values significantly, but it accounts for varying degrees of relevance in documents and gives more weightage to higher-ranked items.
Retrieval systems are not just used in RAG but in a variety of other application areas like web and enterprise search engines, e-commerce product search and personalised recommendations, social media ad retrieval, archival systems, databases, virtual assistants and more. The retrieval metrics help in assessing and improving the performance to effectively meet user needs.
What do you think? Are there any other metrics that you’d add to the list? Do let us know.
If you're looking to understand RAG, in depth, A Simple Guide to Retrieval Augmented Generation will be a great starting point. Consider purchasing a copy here - https://mng.bz/jXJ9