Retrieval-Augmented Generation (RAG)-Evaluation
RAG is a approach for enhancing the performance of generative models by providing related external knowledge during the generation process along with the query given by the "user". Key benefits of using a RAG has been observed in increase in respoonse accuracy of genAI models response, amproved contextual relevance and a reduction in hallucinations (i.e., incorrect or nonsensical information). Some of the good application of RAG are application designed for customer support, domain specific content creation, legal research assistants, medical diagnosis systems, application for study aids like educational chat tool, and many more;
A RAG system consist of two primary components: Retrieval and Generation
Retrieval typically involves following steps:
1. Vectorizing the query input into an embedding using an embedding model;
2. Performing a vector on the knowledge-base search using the embedded query;
3. Rerank the retrieved nodes : Initially done on the basis of top_k , top_p and later on reranking models can also be used;
Generation, typically involves following steps
1. Prompt design, based on the initial query and retrieved results from knowledge base;
2. Providing the designed prompt to choice of your LLM;
A general RAG architecture is as follows:
KB: Knowledge Base; KB-Doc: Knowledge Base Documents Preparation;
D: Documents; DE: Documents Embeddings;
Q+EQ: Query + Embedded Query;
P+Q+R: Prompt + Query + Retrieved Content;
LLM: Large Language Models;
Tech Stack:
VectorDB: Pinecone {other options: ChromaDB, Faiss etc};
Embedding: LLM's embedding models;
LLM: Any of the Generative LLM;
Preferred option: AWS-Bedrock;
Evaluation
Designing an architure which can serve huge user base is one of the toughest task however, evaluating the architure is always the key component during architure design and need continuous evaluation even after successfull deployment for end users. RAG architure also consist of multiple evaluation strategies which keep tracks of performance. Let us see in brief about the evaluation used during retrieval and generation stages;
Retrieval Evaluation
1. Precision@k: It measures the proportion of relevant documents among the top-k retrieved documents from the knowledge base using the embedded query. This is a order-unaware metric; More Info
Precision@k=(Number?of?relevant?documents?in?top-k) / k
-> k is the number of top documents retrieved.
2. Recall@k: It assesses the proportion of relevant documents retrieved from the total number of relevant documents available. This is a order-unaware metric; More Info
Recall@k=
(|Relevant?Documents?Retrieved?in?Top?k|) / ( |Total Relevant Documents|)
-> Relevant Documents Retrieved in Top k: It is the the number of relevant documents among the top k documents retrieved from the knowledge-base by the system.
-> Total Relevant Documents: It is the total number of relevant documents available in the entire dataset for a given query.
3. Discounted Cumulative Gain (DCG) and Normalized DCG (NDCG): Both of the metrics account for the position of relevant documents, and apply penalties for lower-ranked items to reflect their diminished value. These are order-aware metric; More Info
Where:
IDCG@k is the DCG score of the ranking up to position k. The ranking is computed by sorting the documents on the basis of their relevance in descending order. This gives the highest possible DCG score for the given set of documents;
Generation Evaluation
1. BLEU: This metric is commonly used in machine translation tasks and is also preferred in LLM model evaluations. Here, we measures n-gram overlap between the generated and reference text, commonly used for machine translation; More Info
2. ROUGE: This metric is commonly used in summarization tasks and is also preferred in LLM model evaluations. Here, we evaluates overlap of n-grams;
Key ROUGE Metrics are: More Info
2.1 ROUGE-N: It measures, n-gram overlap between reference and generated text.
2.2 ROUGE-L: It meaures, longest common subsequence (LCS) between reference and generated text.
2.3 ROUGE-W: This is a weighted version of ROUGE-L where high scores are assigned to longest matches.
3. METEOR: Considers precision, recall, and synonymy, providing a more nuanced evaluation; More Info
Other than the above mentioned evaluation, Human Evaluation is also preferred for RAG. It Involves human judges assessing the relevance, coherence, and informativeness of generated responses.
Common libraries used for RAG Evaluation
Let us suppose there is a humman-annotated or pre-design eval_dataset which can be used during the RAG piepline evaluations;
Retrieval Evaluation Pseudocode
from sklearn.metrics import precision_score, recall_score
from ragas import evaluate
from ragas.metrics import context_precision, context_recall
# Use RAGAS for additional metrics
result = evaluate(
dataset=eval_data,
metrics=[context_precision, context_recall],
)
print(result)
Generation Evaluation Pseudocode
from transformers import pipeline
from datasets import load_metric
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
# Initialize model and metric
generator = pipeline('text2text-generation', model='llmmodel')
# Generate response
query = "What is RAG?"
generated_response = generator(query)[0]['generated_text']
reference_response = "RAG stands for Retrieval-Augmented Generation."
# Use RAGAS for generation evaluation
eval_data = {
"question": query,
"context": generated_response,
"answer": reference_response
}
result = evaluate(
dataset=eval_data,
metrics=[faithfulness, answer_relevancy],
)
print(result)
# Traditional metrics (optional)
metric_bleu = load_metric('bleu')
metric_rouge = load_metric('rouge')
bleu_score = metric_bleu.compute(predictions=[generated_response], references=[reference_response])
rouge_score = metric_rouge.compute(predictions=[generated_response], references=[reference_response])
print(f'BLEU: {bleu_score}')
print(f'ROUGE: {rouge_score}')
With this exponentially growing AI field, there will be several other libraries, frameworks and techniques to evaluate the pipeline and retrieval and generation parts of any RAG architure. The above mentioned are some of the few I have used during my work and found interesting and thought of sharing it to everyone and also archiving the details for my future reads;
Thank you for reading! ?? And special thanks for enduring my not-so-great hand drawings!??
?? Connect with me: Satyam's LinkedIn
Also, visit my blogs where I share my work implementations and learning to write: Satyam's Blogs
PS: Article thumbnail is generated using AI
Thanks to everyone for teaching the concepts in details on youtube, blog-posts.
References:
Wikipedia contributors. (2023). Evaluation measures (information retrieval). In Wikipedia, The Free Encyclopedia. Retrieved from https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)
For more information on the ROUGE metric, you can refer to the detailed explanation on https://en.wikipedia.org/wiki/ROUGE_(metric)