登录查看更多内容

Retrieval-Augmented Generation (RAG)-Evaluation

Satyam M.

AI-ML Software Engineer | GenAI & MLOps | Google Dev Student Club

发布日期: 2024年7月15日

RAG is a approach for enhancing the performance of generative models by providing related external knowledge during the generation process along with the query given by the "user". Key benefits of using a RAG has been observed in increase in respoonse accuracy of genAI models response, amproved contextual relevance and a reduction in hallucinations (i.e., incorrect or nonsensical information). Some of the good application of RAG are application designed for customer support, domain specific content creation, legal research assistants, medical diagnosis systems, application for study aids like educational chat tool, and many more;

A RAG system consist of two primary components: Retrieval and Generation

Retrieval typically involves following steps:

1. Vectorizing the query input into an embedding using an embedding model;

2. Performing a vector on the knowledge-base search using the embedded query;

3. Rerank the retrieved nodes : Initially done on the basis of top_k , top_p and later on reranking models can also be used;

Generation, typically involves following steps

1. Prompt design, based on the initial query and retrieved results from knowledge base;

2. Providing the designed prompt to choice of your LLM;

A general RAG architecture is as follows:

KB: Knowledge Base; KB-Doc: Knowledge Base Documents Preparation;

D: Documents; DE: Documents Embeddings;

Q+EQ: Query + Embedded Query;

P+Q+R: Prompt + Query + Retrieved Content;

LLM: Large Language Models;

Tech Stack:

VectorDB: Pinecone {other options: ChromaDB, Faiss etc};

Embedding: LLM's embedding models;

LLM: Any of the Generative LLM;

Preferred option: AWS-Bedrock;

Evaluation

Designing an architure which can serve huge user base is one of the toughest task however, evaluating the architure is always the key component during architure design and need continuous evaluation even after successfull deployment for end users. RAG architure also consist of multiple evaluation strategies which keep tracks of performance. Let us see in brief about the evaluation used during retrieval and generation stages;

Retrieval Evaluation

1. Precision@k: It measures the proportion of relevant documents among the top-k retrieved documents from the knowledge base using the embedded query. This is a order-unaware metric; More Info

Precision@k=(Number?of?relevant?documents?in?top-k) / k

-> k is the number of top documents retrieved.

2. Recall@k: It assesses the proportion of relevant documents retrieved from the total number of relevant documents available. This is a order-unaware metric; More Info

Recall@k=  
      (|Relevant?Documents?Retrieved?in?Top?k|) / ( |Total Relevant Documents|)

-> Relevant Documents Retrieved in Top k: It is the the number of relevant documents among the top k documents retrieved from the knowledge-base by the system.

-> Total Relevant Documents: It is the total number of relevant documents available in the entire dataset for a given query.

3. Discounted Cumulative Gain (DCG) and Normalized DCG (NDCG): Both of the metrics account for the position of relevant documents, and apply penalties for lower-ranked items to reflect their diminished value. These are order-aware metric; More Info

Where:

rel<sub>i</sub> : is the relevance score of the document at position i. It is often determined based on how much relevancy the retrieved document contains with the given query;
k : is the number of top documents considered in the evaluation;

领英推荐

Towards Advanced RAG

Relevance AI 9 个月前

How AI can Enhance your Resource Modeling

GEOVIA 11 个月前

??Top ML Papers of the Week

DAIR.AI 10 个月前

IDCG@k is the DCG score of the ranking up to position k. The ranking is computed by sorting the documents on the basis of their relevance in descending order. This gives the highest possible DCG score for the given set of documents;

Generation Evaluation

1. BLEU: This metric is commonly used in machine translation tasks and is also preferred in LLM model evaluations. Here, we measures n-gram overlap between the generated and reference text, commonly used for machine translation; More Info

2. ROUGE: This metric is commonly used in summarization tasks and is also preferred in LLM model evaluations. Here, we evaluates overlap of n-grams;

Key ROUGE Metrics are: More Info

2.1 ROUGE-N: It measures, n-gram overlap between reference and generated text.

2.2 ROUGE-L: It meaures, longest common subsequence (LCS) between reference and generated text.

2.3 ROUGE-W: This is a weighted version of ROUGE-L where high scores are assigned to longest matches.

3. METEOR: Considers precision, recall, and synonymy, providing a more nuanced evaluation; More Info

Other than the above mentioned evaluation, Human Evaluation is also preferred for RAG. It Involves human judges assessing the relevance, coherence, and informativeness of generated responses.

Common libraries used for RAG Evaluation

RAGAS (Retrieval-Augmented Generation Assessment): This is a framework designed to evaluate the performance of RAG pipelines. This framework focuses on metrics such as faithfulness, answer relevancy, context precision, and context recall. This can be used for both retrieval and generation and easy to use;
Hugging Face Transformers: For model implementation and generation evaluation;
Pyserini: For retrieval and ranking evaluation;
NLTK: For BLEU and METEOR evaluation;
ROUGE: For ROUGE evaluation;
BERTScore: For semantic similarity evaluation using BERT embeddings;
Scikit-learn: For statistical analysis and evaluation metrics;

Let us suppose there is a humman-annotated or pre-design eval_dataset which can be used during the RAG piepline evaluations;

Retrieval Evaluation Pseudocode

from sklearn.metrics import precision_score, recall_score
from ragas import evaluate
from ragas.metrics import context_precision, context_recall

# Use RAGAS for additional metrics
result = evaluate(
    dataset=eval_data,
    metrics=[context_precision, context_recall],
)
print(result)

Generation Evaluation Pseudocode

from transformers import pipeline
from datasets import load_metric
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

# Initialize model and metric
generator = pipeline('text2text-generation', model='llmmodel')

# Generate response
query = "What is RAG?"
generated_response = generator(query)[0]['generated_text']
reference_response = "RAG stands for Retrieval-Augmented Generation."

# Use RAGAS for generation evaluation
eval_data = {
    "question": query,
    "context": generated_response,
    "answer": reference_response
}
result = evaluate(
    dataset=eval_data,
    metrics=[faithfulness, answer_relevancy],
)
print(result)

# Traditional metrics (optional)
metric_bleu = load_metric('bleu')
metric_rouge = load_metric('rouge')
bleu_score = metric_bleu.compute(predictions=[generated_response], references=[reference_response])
rouge_score = metric_rouge.compute(predictions=[generated_response], references=[reference_response])

print(f'BLEU: {bleu_score}')
print(f'ROUGE: {rouge_score}')

With this exponentially growing AI field, there will be several other libraries, frameworks and techniques to evaluate the pipeline and retrieval and generation parts of any RAG architure. The above mentioned are some of the few I have used during my work and found interesting and thought of sharing it to everyone and also archiving the details for my future reads;

Thank you for reading! ?? And special thanks for enduring my not-so-great hand drawings!??

?? Connect with me: Satyam's LinkedIn

Also, visit my blogs where I share my work implementations and learning to write: Satyam's Blogs

PS: Article thumbnail is generated using AI

Thanks to everyone for teaching the concepts in details on youtube, blog-posts.

References:

Wikipedia contributors. (2023). Evaluation measures (information retrieval). In Wikipedia, The Free Encyclopedia. Retrieved from https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)

For more information on the ROUGE metric, you can refer to the detailed explanation on https://en.wikipedia.org/wiki/ROUGE_(metric)

https://en.wikipedia.org/wiki/BLEU

https://en.wikipedia.org/wiki/METEOR

要查看或添加评论，请登录

Satyam M.的更多文章

Agentic AI Design Patterns

2025年2月9日

Agentic AI Design Patterns

The evolution of large language models(LLMs) has opened doors to building autonomous AI systems capable of reasoning…

4 条评论
What Are AI Agents?

2025年1月7日

What Are AI Agents?

AI agents are systems that leverage advanced algorithms, massive data processing, and machine learning to interpret…
AI Architectures: LLMs, LAMs, LCMs, and LFMs

2024年12月14日

AI Architectures: LLMs, LAMs, LCMs, and LFMs

Artificial Intelligence (AI) has seen a rapid evolution, giving rise to a variety of architectures tailored to address…

2 条评论
Pydantic AI : Agent Framework

2024年12月7日

Pydantic AI : Agent Framework

The Pydantic AI Agent Framework is a powerful tool for building agentic AI systems with robust data validation…

1 条评论
World : A New Identity and Financial Network

2024年10月19日

World : A New Identity and Financial Network

The Worldcoin project envisions creating a globally inclusive identity and financial network, accessible to the…

3 条评论
??Evaluating fairness in ChatGPT

2024年10月17日

??Evaluating fairness in ChatGPT

This article from OpenAI is interesting where they have talked about nature of #bias in AI ?????????? ???? ????????…
Self-Taught Optimizer (STOP): Recursive Self-Improvement in Code Generation

2024年10月14日

Self-Taught Optimizer (STOP): Recursive Self-Improvement in Code Generation

Published at COLM 2024, the Self-Taught Optimizer (STOP) represents a leap forward in recursive code optimization…

4 条评论
Combining Insights from Chroma and Anthropic: A Unified Approach to Advanced Retrieval Systems

2024年10月4日

Combining Insights from Chroma and Anthropic: A Unified Approach to Advanced Retrieval Systems

Both Chroma and Anthropic’s research illustrate the evolving landscape of retrieval systems and how chunking plays a…

3 条评论
Multi-Agent AI Query System

2024年9月1日

Multi-Agent AI Query System

Introduction Recently, I set out to build a tool that could help me learn from both LlamaIndex and LangChain…

8 条评论
Opensearch-Vectorestore

2024年7月31日

Opensearch-Vectorestore

Opensearch is an open-source search and analytics suite derived from Elasticsearch and Kibana and offers a robust…

See all articles

Retrieval-Augmented Generation (RAG)-Evaluation

Satyam M.

AI-ML Software Engineer | GenAI & MLOps | Google Dev Student Club

Retrieval typically involves following steps:

Generation, typically involves following steps

Evaluation

Retrieval Evaluation

领英推荐

Generation Evaluation

Common libraries used for RAG Evaluation

References:

Satyam M.的更多文章

社区洞察

其他会员也浏览了

??Top ML Papers of the Week

OpenAI o3 vs. DeepSeek r1: A Comparative Analysis of Reasoning Models

Optimizing Large Language Model Inference A Performance Engineering Approach

Enterprises Need RAG, Not Fine-Tuning

Microsoft Designer App, OpenAI GPT-4o Mini, New AI Architectures, and Mistral's Codestral Mamba

Best Practices for Annotating Images in Computer Vision Datasets

The Role of Data Annotation in the Development of AR Applications

Assassin GPT or Saviour GP

NEURODATA develop an advanced AI tool to Extract Text or Data from low quality Image.

Open-source vs commercial solutions: Choosing the right OCR

Retrieval typically involves following steps:

Generation, typically involves following steps

Evaluation

Retrieval Evaluation

领英推荐

Generation Evaluation

Common libraries used for RAG Evaluation

References:

Satyam M.的更多文章

Agentic AI Design Patterns

What Are AI Agents?

AI Architectures: LLMs, LAMs, LCMs, and LFMs

Pydantic AI : Agent Framework

World : A New Identity and Financial Network

??Evaluating fairness in ChatGPT

Self-Taught Optimizer (STOP): Recursive Self-Improvement in Code Generation

Combining Insights from Chroma and Anthropic: A Unified Approach to Advanced Retrieval Systems

Multi-Agent AI Query System

Opensearch-Vectorestore

社区洞察

其他会员也浏览了

??Top ML Papers of the Week

OpenAI o3 vs. DeepSeek r1: A Comparative Analysis of Reasoning Models

Optimizing Large Language Model Inference A Performance Engineering Approach

Enterprises Need RAG, Not Fine-Tuning

Microsoft Designer App, OpenAI GPT-4o Mini, New AI Architectures, and Mistral's Codestral Mamba

Best Practices for Annotating Images in Computer Vision Datasets

The Role of Data Annotation in the Development of AR Applications

Assassin GPT or Saviour GP

NEURODATA develop an advanced AI tool to Extract Text or Data from low quality Image.

Open-source vs commercial solutions: Choosing the right OCR