Evaluating LLM and RAG Systems
Sanjay Basu PhD
MIT Alumnus|Fellow IETE |AI/Quantum|Executive Leader|Author|5x Patents|Life Member-ACM,AAAI,Futurist
Focusing on key Metrics
There is no single metric that can fully capture the functionality of Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) systems. In this document, we discuss the metrics that are particularly important when evaluating LLM and RAG systems, both the atomic components and the holistic system.
Key Metrics for RAG Evaluation:
1. Faithfulness
Definition: Measures the factual consistency of the generated answer against the given context.
Break the generated answer into individual statements. Verify if each statement can be inferred from the given context.
Scaled to a range of 0 to 1, with higher scores indicating better faithfulness.
2. Answer Relevance
Definition: Assesses how pertinent or relevant the generated answer is to the given prompt.
Generate multiple variants of the question from the generated answer using an LLM. Measure the mean cosine similarity between these generated questions and the original question. Higher scores indicate better relevancy.
To measure the mean cosine similarity between generated questions and the original question, you can use the following formula:
Steps to Calculate Mean Cosine Similarity:
Step 1: Convert the original question ?? and each generated question ???? into vector representations (e.g., using a pre-trained language model to generate embeddings). Step 2: Compute the cosine similarity between the original question vector ?? and each generated question vector ????. Step 3: Calculate the mean cosine similarity across all generated questions.
where:
n is the number of generated questions. Q is the vector representation of the original question. Gi is the vector representation of the i-th generated questions.
Example Calculation Suppose you have the original question ?? and three generated questions ??1, ??2, and ??3. The steps to compute the mean cosine similarity are:
Step 1: Convert Q, G1, G2, and G3 into vectors.
Step 2: Compute the cosine similarities
Step 3: Calculate the mean cosine similarity
Mean Cosine Similarity = 1/3 (Cosine Similarity1+Cosine Similarity2+Cosine Similarity3)
This approach provides a quantitative measure of how closely the generated questions align with the original question in terms of semantic similarity.
3. Context Precision
Definition: Evaluates whether all ground-truth relevant items present in the contexts are ranked higher.
Identify relevant and irrelevant chunks in the retrieved context.
Calculate precision@k for each chunk. Compute the mean of precision@k scores. Higher scores are bett
Suppose the retrieved context has 5 sentences:
“France is in Western Europe.” (relevant)
“Its capital is Paris.” (relevant)
“France is known for its cuisine.” (relevant)
“The official language is French.” (relevant)
“The Eiffel Tower is a famous landmark in Paris.” (not relevant)
Identify the relevant sentences in the retrieved context:
“France is in Western Europe.” (relevant) “Its capital is Paris.” (relevant) “France is known for its cuisine.” (relevant) “The official language is French.” (relevant) Calculate the precision:
Number of relevant sentences: 4 Total number of retrieved sentences: 5 Context Precision = 4/5 = 0.8
In this example, the context precision is 0.8, indicating that 80% of the retrieved sentences are relevant to the question.
4. Context Relevancy
Informative: Measures the informativeness to the question of the retrieved contexts.
Identify sentences within the retrieved context relevant to the question (|S|). Count the total number of sentences in the context (|T|). Use the formula: Relevancy = |S|/|T|
Higher scores indicate better relevancy.
Example Calculation Suppose the retrieved context has 5 sentences:
“France is in Western Europe.” (relevant)
“Its capital is Paris.” (relevant)
“France is known for its cuisine.” (relevant)
“The official language is French.” (relevant)
“The Eiffel Tower is a famous landmark in Paris.” (not relevant)
Identify the relevant sentences in the retrieved context:
“France is in Western Europe.” (relevant) “Its capital is Paris.” (relevant) “France is known for its cuisine.” (relevant) “The official language is French.” (relevant) Calculate the relevancy:
Number of relevant sentences: 4 Total number of sentences in the retrieved context: 5 Context Relevancy = 4/5 = 0.8 In this example, the context relevancy is 0.8, indicating that 80% of the sentences in the retrieved context are relevant to the question.
Difference Between Context Precision and Context Relevancy
Context Precision and Context Relevancy are both metrics used to evaluate the performance of Retrieval Augmented Generation (RAG) systems, but they measure different aspects of the retrieved context’s quality.
Context Precision
Context Precision measures the proportion of relevant information correctly retrieved out of all the information retrieved. It evaluates the accuracy of the retrieved context in terms of relevance to the given qu
Focus: This metric focuses on how accurately the retrieval mechanism identifies and retrieves relevant information. A higher precision means that a higher proportion of the retrieved information is relevant.
Example: If a system retrieves 5 sentences and 4 of them are relevant, the context precision is: Context Precision = 4/5 = 0.8
Context Relevancy
Context Relevancy measures the proportion of the retrieved context that is pertinent to the given question. It evaluates how much of the retrieved context is relevant, in comparison to the total amount of information in the context.
Focus: This metric focuses on the proportion of relevant information in the retrieved context relative to the entire context. A higher relevancy score means that the retrieved context is more focused and relevant to the question.
Example: If a system retrieves 10 sentences and 7 of them are relevant, the context relevancy is: Context Relevancy = 7/10 = 0.7
Key Differences
Scope of Measurement Context Precision: Measures the accuracy of the retrieved context by comparing the number of relevant sentences to the total number of retrieved sentences. Context Relevancy: Measures the relevance of the entire retrieved context by comparing the number of relevant sentences to the total sentences in the context.
Purpose Context Precision: Focuses on the precision of retrieval, indicating how many of the retrieved sentences are actually relevant. Context Relevancy: Focuses on the overall relevance of the context, indicating how much of the retrieved context is useful for answering the question.
Implications A high context precision score indicates that the retrieval system is good at identifying relevant information without retrieving much irrelevant information. A high context relevancy score indicates that the majority of the retrieved context is pertinent to the question, suggesting a focused retrieval.
In other words, Context Precision is about the accuracy of the retrieval mechanism in fetching relevant information, whereas, Context Relevancy is about the proportion of the relevant information within the entire retrieved context.
Both metrics are important for evaluating the quality of retrieval in RAG systems, but they provide different perspectives on the performance of the retrieval mechanism.
5. Context Recall
Definition: Measures the extent to which the retrieved context aligns with the annotated answer (treated as the ground truth). It is a ratio of relevant information retrieved to the total relevant information available..
Break the ground truth answer into individual statements. Verify if each statement can be attributed to the retrieved context. Use the formula to calculate recall. Higher scores indicate better recall.
The formula for context recall
More formally, context recall =
领英推荐
Where:
Example Calculation
Suppose the ground truth answer has 4 statements:
And the retrieved context includes the following statements:
Context Recall = 2/4 = 0.5
In this example, the context recall is 0.5, indicating that 50% of the relevant statements from the ground truth are present in the retrieved context.
6. Context Entities Recall
Context Entities Recall measures the fraction of relevant entities in the ground truth that are correctly retrieved in the context. It evaluates how well the retrieval mechanism captures the relevant entities from the ground truth. Higher scores indicate better entity recall.
Steps to Calculate Context Entities Recall
Identify entities in the ground truth.
Identify entities in the retrieved context.
Calculate the recall by using the ratio of the number of relevant entities retrieved to the total number of relevant entities in the ground truth.
Example Calculation Suppose the ground truth contains the following entities:
Ground truth entities (GE): { “Taj Mahal”, “Yamuna”, “Agra”, “1631”, “Shah Jahan”, “Mumtaz Mahal” }
And the retrieved context contains the following entities:
Retrieved context entities (CE): { “Taj Mahal”, “Agra”, “Shah Jahan”, “Mumtaz Mahal”, “India” } Identify entities present in both the ground truth and the retrieved context:
Common entities: { “Taj Mahal”, “Agra”, “Shah Jahan”, “Mumtaz Mahal” } Calculate the recall:
Number of common entities: 4 Total number of entities in the ground truth: 6 Context Entities Recall = 4/6 = 0.67 In this example, the context entities recall is 0.67, indicating that 67% of the relevant entities from the ground truth are present in the retrieved context.
7. Answer Semantic Similarity
Answer Semantic Similarity measures the semantic resemblance between the generated answer and the ground truth. This metric helps assess how closely the meaning of the generated answer aligns with the meaning of the ground truth.
Steps to Calculate Answer Semantic Similarity
Cosine Similarity Formula
Formula for Answer Semantic Similar
Example Calculation
Suppose the generated answer and the ground truth are converted to the following vectors:
Calculate the dot product:
[1.2,0.7,0.4]?[1.0,0.8,0.6]=(1.2×1.0)+(0.7×0.8)+(0.4×0.6)=1.2+0.56+0.24=2.0
Calculate the magnitudes:
∥[1.2,0.7,0.4]∥=(1.2)2+(0.7)2+(0.4)2=1.44+0.49+0.16=2.09≈1.445
∥[1.0,0.8,0.6]∥=(1.0)2+(0.8)2+(0.6)2=1.0+0.64+0.36=2.0≈1.414
Calculate the cosine similari
In this example, the Answer Semantic Similarity is approximately 0.979, indicating a high degree of semantic similarity between the generated answer and the ground truth.
8. Answer Correctness
Answer Correctness measures the accuracy of the generated answer when compared to the ground truth. It typically involves two critical aspects: factual similarity and semantic similarity.
Components of Answer Correctness
Factual Correctness: This measures the factual overlap between the generated answer and the ground truth.
Semantic Similarity: This assesses the semantic resemblance between the generated answer and the ground truth.
Steps to Calculate Answer Correctness
Calculate Factual Correctness: Determine the number of true positives (TP), false positives (FP), and false negatives (FN) based on the factual elements in the generated answer compared to the ground truth.
Calculate Semantic Similarity: Use cosine similarity between the vector representations of the generated answer and the ground truth.
Combine Factual Correctness and Semantic Similarity: Use a weighted average to combine these two metrics into a single score for answer corre
Example Calculation
Suppose the factual correctness and semantic similarity are calculated as follows:
Calculate Precision and Recall:
Precision=3/(3+1)=3/4=0.75
Recall=3/(3+1)=3/4=0.75
Calculate F1 Score:
F1 Score=2×(0.75×0.75)/(0.75+0.75)=(2×0.5625)/1.5=0.75
Calculate Semantic Similarity (assuming it is already calculated as):
Semantic Similarity = 0.85
Combine Factual Correctness and Semantic Similarity:
In this example, the Answer Correctness score is 0.8, indicating a good balance between factual correctness and semantic similarity.
Additional Metrics
9. Latency (Time to First Token and Response to Last Token)
Definition: It measures how long it takes for the system to send its first token, and the total time it takes for it to finish the response from start to finish.
Importance: Critical for real-time applications where response time impacts user experience.
Example Calculation:
Time to First Token: Measure the time when the input came into the system to the time that the first token is generated.
Response to Last Token: Adds up the total time from when the input was received to when the entire response is emitted.
Lower latency is better, indicating a faster system. These latency metrics are crucial for understanding the performance of LLMs, particularly in real-time applications where quick responses are essential.
The evaluation of LLM and RAG systems is a multilevel process, where each evaluation metric taking a slice of the pie and providing an image from a different aspect of performance. As it is evident from the evaluation results, by having intertwined metrics of evaluation, a robust comparison and assessment is accomplished, which in turn leads to improvement of these models and their practical application in the real world.
freelancer
2 周If you're looking to evaluate systems like LLMs or RAG more effectively, I’d recommend checking out whatsinmy video; it’s a great resource for insights like this. I've found it super helpful for all kinds of topics.
Multi-dimensional technology executive with skills in business strategy, communications, and technology
3 个月Nice reference!
?? Strategic Cloud Product Leader | MS | PMP | AI/ML, Gen AI, Databases, CCSP, NVIDIA, Azure, AWS & OCI Architecture Certified | Sr. Principal at Oracle | Former Deloitte & PwC | Navigating Gen AI | Views are my own.
3 个月Insightful and highly relevant topic Sanjay! Evaluation and measurement of LLMs is one of the most challenging tasks. It would be interesting if some of the popular LLMs could also publish relevant metrics along with their responses!. One challenge for inference is the added computational cost and increased latency. For example, generating questions from responses, vectorizing them, and finding the most similar questions all incur additional expenses. How to appropriately calibrate these measures for use in healthcare, banking, or other mission-critical applications will decide the large-scale adoption and public trust.