Retrieval-Augmented Generation (RAG)-Evaluation

Retrieval-Augmented Generation (RAG)-Evaluation

RAG is a approach for enhancing the performance of generative models by providing related external knowledge during the generation process along with the query given by the "user". Key benefits of using a RAG has been observed in increase in respoonse accuracy of genAI models response, amproved contextual relevance and a reduction in hallucinations (i.e., incorrect or nonsensical information). Some of the good application of RAG are application designed for customer support, domain specific content creation, legal research assistants, medical diagnosis systems, application for study aids like educational chat tool, and many more;

A RAG system consist of two primary components: Retrieval and Generation

Retrieval typically involves following steps:

1. Vectorizing the query input into an embedding using an embedding model;

2. Performing a vector on the knowledge-base search using the embedded query;

3. Rerank the retrieved nodes : Initially done on the basis of top_k , top_p and later on reranking models can also be used;

Generation, typically involves following steps

1. Prompt design, based on the initial query and retrieved results from knowledge base;

2. Providing the designed prompt to choice of your LLM;

A general RAG architecture is as follows:

RAG Pipeline

KB: Knowledge Base; KB-Doc: Knowledge Base Documents Preparation;

D: Documents; DE: Documents Embeddings;

Q+EQ: Query + Embedded Query;

P+Q+R: Prompt + Query + Retrieved Content;

LLM: Large Language Models;

Tech Stack:

VectorDB: Pinecone {other options: ChromaDB, Faiss etc};

Embedding: LLM's embedding models;

LLM: Any of the Generative LLM;

Preferred option: AWS-Bedrock;

Evaluation

Designing an architure which can serve huge user base is one of the toughest task however, evaluating the architure is always the key component during architure design and need continuous evaluation even after successfull deployment for end users. RAG architure also consist of multiple evaluation strategies which keep tracks of performance. Let us see in brief about the evaluation used during retrieval and generation stages;

Retrieval Evaluation

1. Precision@k: It measures the proportion of relevant documents among the top-k retrieved documents from the knowledge base using the embedded query. This is a order-unaware metric; More Info

Precision@k=(Number?of?relevant?documents?in?top-k) / k

-> k is the number of top documents retrieved.        

2. Recall@k: It assesses the proportion of relevant documents retrieved from the total number of relevant documents available. This is a order-unaware metric; More Info

Recall@k=  
      (|Relevant?Documents?Retrieved?in?Top?k|) / ( |Total Relevant Documents|)

-> Relevant Documents Retrieved in Top k: It is the the number of relevant documents among the top k documents retrieved from the knowledge-base by the system.

-> Total Relevant Documents: It is the total number of relevant documents available in the entire dataset for a given query.        

3. Discounted Cumulative Gain (DCG) and Normalized DCG (NDCG): Both of the metrics account for the position of relevant documents, and apply penalties for lower-ranked items to reflect their diminished value. These are order-aware metric; More Info

DCG

Where:

  • rel<sub>i</sub> : is the relevance score of the document at position i. It is often determined based on how much relevancy the retrieved document contains with the given query;
  • k : is the number of top documents considered in the evaluation;

NDCG

IDCG@k is the DCG score of the ranking up to position k. The ranking is computed by sorting the documents on the basis of their relevance in descending order. This gives the highest possible DCG score for the given set of documents;

Generation Evaluation

1. BLEU: This metric is commonly used in machine translation tasks and is also preferred in LLM model evaluations. Here, we measures n-gram overlap between the generated and reference text, commonly used for machine translation; More Info

2. ROUGE: This metric is commonly used in summarization tasks and is also preferred in LLM model evaluations. Here, we evaluates overlap of n-grams;

Key ROUGE Metrics are: More Info

2.1 ROUGE-N: It measures, n-gram overlap between reference and generated text.

2.2 ROUGE-L: It meaures, longest common subsequence (LCS) between reference and generated text.

2.3 ROUGE-W: This is a weighted version of ROUGE-L where high scores are assigned to longest matches.

3. METEOR: Considers precision, recall, and synonymy, providing a more nuanced evaluation; More Info

Other than the above mentioned evaluation, Human Evaluation is also preferred for RAG. It Involves human judges assessing the relevance, coherence, and informativeness of generated responses.

Common libraries used for RAG Evaluation

  • RAGAS (Retrieval-Augmented Generation Assessment): This is a framework designed to evaluate the performance of RAG pipelines. This framework focuses on metrics such as faithfulness, answer relevancy, context precision, and context recall. This can be used for both retrieval and generation and easy to use;
  • Hugging Face Transformers: For model implementation and generation evaluation;
  • Pyserini: For retrieval and ranking evaluation;
  • NLTK: For BLEU and METEOR evaluation;
  • ROUGE: For ROUGE evaluation;
  • BERTScore: For semantic similarity evaluation using BERT embeddings;
  • Scikit-learn: For statistical analysis and evaluation metrics;


Let us suppose there is a humman-annotated or pre-design eval_dataset which can be used during the RAG piepline evaluations;

Retrieval Evaluation Pseudocode

from sklearn.metrics import precision_score, recall_score
from ragas import evaluate
from ragas.metrics import context_precision, context_recall

# Use RAGAS for additional metrics
result = evaluate(
    dataset=eval_data,
    metrics=[context_precision, context_recall],
)
print(result)        

Generation Evaluation Pseudocode

from transformers import pipeline
from datasets import load_metric
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

# Initialize model and metric
generator = pipeline('text2text-generation', model='llmmodel')

# Generate response
query = "What is RAG?"
generated_response = generator(query)[0]['generated_text']
reference_response = "RAG stands for Retrieval-Augmented Generation."

# Use RAGAS for generation evaluation
eval_data = {
    "question": query,
    "context": generated_response,
    "answer": reference_response
}
result = evaluate(
    dataset=eval_data,
    metrics=[faithfulness, answer_relevancy],
)
print(result)

# Traditional metrics (optional)
metric_bleu = load_metric('bleu')
metric_rouge = load_metric('rouge')
bleu_score = metric_bleu.compute(predictions=[generated_response], references=[reference_response])
rouge_score = metric_rouge.compute(predictions=[generated_response], references=[reference_response])

print(f'BLEU: {bleu_score}')
print(f'ROUGE: {rouge_score}')        

With this exponentially growing AI field, there will be several other libraries, frameworks and techniques to evaluate the pipeline and retrieval and generation parts of any RAG architure. The above mentioned are some of the few I have used during my work and found interesting and thought of sharing it to everyone and also archiving the details for my future reads;



Thank you for reading! ?? And special thanks for enduring my not-so-great hand drawings!??

?? Connect with me: Satyam's LinkedIn

Also, visit my blogs where I share my work implementations and learning to write: Satyam's Blogs

PS: Article thumbnail is generated using AI

Thanks to everyone for teaching the concepts in details on youtube, blog-posts.

References:

Wikipedia contributors. (2023). Evaluation measures (information retrieval). In Wikipedia, The Free Encyclopedia. Retrieved from https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)


For more information on the ROUGE metric, you can refer to the detailed explanation on https://en.wikipedia.org/wiki/ROUGE_(metric)

https://en.wikipedia.org/wiki/BLEU

https://en.wikipedia.org/wiki/METEOR

要查看或添加评论,请登录

Satyam M.的更多文章

  • Agentic AI Design Patterns

    Agentic AI Design Patterns

    The evolution of large language models(LLMs) has opened doors to building autonomous AI systems capable of reasoning…

    4 条评论
  • What Are AI Agents?

    What Are AI Agents?

    AI agents are systems that leverage advanced algorithms, massive data processing, and machine learning to interpret…

  • AI Architectures: LLMs, LAMs, LCMs, and LFMs

    AI Architectures: LLMs, LAMs, LCMs, and LFMs

    Artificial Intelligence (AI) has seen a rapid evolution, giving rise to a variety of architectures tailored to address…

    2 条评论
  • Pydantic AI : Agent Framework

    Pydantic AI : Agent Framework

    The Pydantic AI Agent Framework is a powerful tool for building agentic AI systems with robust data validation…

    1 条评论
  • World : A New Identity and Financial Network

    World : A New Identity and Financial Network

    The Worldcoin project envisions creating a globally inclusive identity and financial network, accessible to the…

    3 条评论
  • ??Evaluating fairness in ChatGPT

    ??Evaluating fairness in ChatGPT

    This article from OpenAI is interesting where they have talked about nature of #bias in AI ?????????? ???? ????????…

  • Self-Taught Optimizer (STOP): Recursive Self-Improvement in Code Generation

    Self-Taught Optimizer (STOP): Recursive Self-Improvement in Code Generation

    Published at COLM 2024, the Self-Taught Optimizer (STOP) represents a leap forward in recursive code optimization…

    4 条评论
  • Combining Insights from Chroma and Anthropic: A Unified Approach to Advanced Retrieval Systems

    Combining Insights from Chroma and Anthropic: A Unified Approach to Advanced Retrieval Systems

    Both Chroma and Anthropic’s research illustrate the evolving landscape of retrieval systems and how chunking plays a…

    3 条评论
  • Multi-Agent AI Query System

    Multi-Agent AI Query System

    Introduction Recently, I set out to build a tool that could help me learn from both LlamaIndex and LangChain…

    8 条评论
  • Opensearch-Vectorestore

    Opensearch-Vectorestore

    Opensearch is an open-source search and analytics suite derived from Elasticsearch and Kibana and offers a robust…

社区洞察

其他会员也浏览了