Evaluating Retrieval-Augmented Generation (RAG) Applications with RAGAS and LangChain
Picture above from: medium.aiplanet.com/evaluating-naive-rag-and-advanced-rag-pipeline-using-langchain-v-0-1-0-and-ragas-17d24e74e5cf

Evaluating Retrieval-Augmented Generation (RAG) Applications with RAGAS and LangChain

Retrieval-Augmented Generation (RAG) is a powerful approach for enhancing Large Language Models (LLMs) with external knowledge, leading to more accurate and informative responses. However, evaluating the performance of these RAG pipelines is a multifaceted challenge. The ragas framework, in combination with LangChain, provides a robust solution to this challenge, offering a structured and comprehensive approach to evaluating both retrieval and generation components of your RAG application.


The Need for Rigorous Evaluation:

Developing a proof-of-concept RAG application might be relatively straightforward, but ensuring its production readiness is a different ballgame. This is where evaluation becomes paramount. By meticulously assessing your RAG pipeline, you gain valuable insights into its strengths and weaknesses, enabling targeted improvements for optimal performance.


Ragas: Your RAG Evaluation Toolkit

Ragas is an evaluation framework designed to assess the performance of RAG pipelines on a component level. It offers a variety of metrics that gauge the quality of both the retrieval and generation processes, providing a holistic view of your application's capabilities.


Ragas: Your RAG Evaluation Toolkit

Ragas is an evaluation framework designed to assess the performance of RAG pipelines on a component level. It offers a variety of metrics that gauge the quality of both the retrieval and generation processes, providing a holistic view of your application's capabilities.

Key RAGAS Metrics:

  1. Context Precision: This metric measures the signal-to-noise ratio of the retrieved context, ensuring that the information fetched is relevant to the query.
  2. Context Recall: This metric verifies whether all necessary information required to answer the query has been retrieved, ensuring the completeness of the retrieved context.
  3. Faithfulness: This metric evaluates the factual accuracy of the generated answer against the retrieved context, ensuring that the generated response aligns with the factual information provided.
  4. Answer Relevancy: This metric determines the relevance of the generated answer to the question, guaranteeing that the model's response directly addresses the user's query.
  5. Answer Correctness: This metric, often relying on human-annotated ground truth labels, measures the factual accuracy of the generated answer against the ideal response, providing a direct measure of the model's accuracy.
  6. Answer Similarity: This metric compares the semantic similarity between the generated answer and the ground truth, assessing how closely the model's response aligns with the expected answer.


Synthetic Data Generation with RAGAS

One of the most powerful features of RAGAS is its ability to generate synthetic evaluation datasets. This streamlines the evaluation process by automatically creating diverse question-answer pairs along with relevant context snippets and corresponding ground truths. This not only saves time and resources but also ensures a broader range of test cases for more robust evaluation.


I created an example of how to generate a dataset, build an RAG app with your data set, and evaluate the RAG app using RAGAS:

1) Install your Framework and libraries

2) Import your Openai API key

3) Load your data, for me I am using a markdown file

4) Split your data using the chunk size and chunk overlap method

5) Select any embedding model of your choice

6) Choice any vector store you want, for me I will be using FAISS from Meta

7) Create a retriever

8) Create a prompt template



9) Setting Up the Basic QA Chain, now we can instantiate the basic RAG chain!


10) Time to create our dataset from our document, I will use gpt 3.5 to generate our dataset and gpt 4o to criticize and review our dataset. RAGAS generates our dataset which includes the ground truth, context, question, and evolution type


11) We will use a more powerful retriever which is the Multi query retriever from LangChain

12) First, I will create a chain to stuff the documents into my context, create the retrieval chain, and test it


13) I will now create a pipeline to collect the pipeline's contexts and answers and convert it into a dataset

14) I will not evaluate it on the metrics available on RAGAS. I chose the faithfulness, answer relevancy, context recall, context precision, and answer correctness metrics.


15) These are the results I got from my RAGAS metrics: faithfulness: 0.8053, answer_relevancy: 0.8226, context_recall: 0.9388, context_precision: 0.8830, answer_correctness: 0.8726. You can also improve this by using other retriever methods from LangChain like Ensemble Retriever and Parent Document Retriever and compare their result to see which is better.


You can also check out my GitHub repo to see the full code above and how I created a UI for the RAG app I evaluated using ragas : https://github.com/Emarhnuel/Insurance_Chatbot_evaluation/tree/main


If you are interested in topics relating to:

- Python

- AI agents

- LLMs/AI Engineering

Connect and follow me for more content

Himanshu Bamoria


要查看或添加评论,请登录

社区洞察

其他会员也浏览了