Measuring Quality Metrics For RAG Based GEN AI Apps To Enhance Software Engineering Productivity Using RAGAS
Rahul Khandelwal
Director & Chief Architect @ Capgemini | Leading Digital Transformation | Leading Big Deals & Presales | Gen AI Evangelist | Thought Leader | Cloud Evangelist | Building Strategic partnership to shape Industry offers
Build & Design Credits : Talif Andalib Pathan Sameer Saurabh
Problem Statement:
Assessing the performance of the deployed Large Language Model (LLM) app in production is crucial for improvement. Rigorous evaluation is essential, as the saying goes, "If you don't measure it, you can't improve it." This is particularly true in the ever-evolving world of Large Language Models. One way to measure LLM performance is by evaluating each prompt's response individually. However, this approach can be time-consuming and resource-intensive. With the growing complexity of language models, especially LLMs, the limitations of traditional metrics become more apparent. This shift necessitates a re-evaluation of how we measure success and effectiveness in NLP, prompting the exploration of more refined metrics provided by a library or framework that can keep pace with advancements in the field. RAGAS, an abbreviation for 'RAG Assessment,' is emerging as a leading library for evaluating RAG pipelines, offering in-depth analysis and hybrid evaluation of RAG performance.
Solution overview:
The below architecture uses Amazon Bedrock, a fully managed service offering a choice of high-performing foundation models (FMs) through a single API, provides a comprehensive set of capabilities to develop generative AI applications with a focus on security, privacy, and responsible AI. Amazon Bedrock is used as a Vector DB for storing the embeddings. Ragas is a framework that helps you evaluate your QA pipelines across these different aspects. The architecture also incorporates LangChain, a framework designed for developing applications powered by language models.
?
Key components of the solution:
1.???? Amazon Bedrock
2.???? Amazon OpenSearch
3.???? LangChain
4.???? Ragas
?RAG Evaluation Pipeline:
The Steps involved in evaluating a RAG pipeline using RAGAS are as follows:
?
To evaluate a Retrieval Augmented Generation (RAG) pipeline using RAGAS and understand its evaluation metrics, follow these steps:
Step 1: Prerequisites
Ensure the installation of necessary Python packages such as boto3, elastisearch, requests, requests-aws4auth, opensearch-py, langchain, openai for the RAG pipeline, and RAGAS for evaluating the RAG pipeline.
Step 2: Setting up the RAG Application
领英推荐
Create clients for Amazon Bedrock LLM, Amazon OpenSearch Vector DB, and Amazon Bedrock Embeddings model. Establish a prompt template and combine it with the retriever component to create a RAG pipeline.
Step 3: Preparing the Evaluation Data
As RAGAS aims to be a reference-free evaluation framework, minimal preparation of the evaluation dataset is required. Questions, answers and contexts are prepared.
Step 4: Evaluating the RAG Application
The desired metrics are imported from RAGAS.metrics. The evaluate() function is utilized, passing in the relevant metrics and the prepared dataset.
Let's delve into the inner workings of Ragas. Understanding the below metrics can provide insights into the parameters used to measure effectiveness of generated response for the prompts provided by RAGAS framework.
1. Faithfulness: Ragas assesses the factual accuracy of generated answers within the provided context through two steps. Firstly, using an LLM, it identifies the statements made by the generated answer based on the given question. Then, it verifies the validity of these statements against the provided context. The score for a given example is calculated by dividing the number of correct statements by the total number of statements in the generated answer.
?2. Answer Relevancy: This metric evaluates how relevant and concise the answer is to the question asked. Ragas employs an LLM to identify potential questions that the generated answer could address and calculates the similarity between these potential questions and the actual question asked.
?3. Context Relevancy: Ragas measures the signal-to-noise ratio in the retrieved contexts. It uses an LLM to identify sentences from the retrieved context that are necessary to answer the question. The score is determined by the ratio of required sentences to the total number of sentences in the context.
?4. Context Recall: This metric assesses the retriever's ability to retrieve all the essential information needed to answer the question. Ragas compares the provided ground_truth answer with the retrieved context using an LLM. If any statement from the ground_truth answer cannot be found in the retrieved context, it indicates that the retriever failed to retrieve the necessary information to support that statement.
Inference
In our pursuit to enhance software engineering productivity through the utilization of RAG-based GEN AI applications, we have established a robust framework leveraging AWS services. Specifically, we employed OpenSearch as a vector database to integrate proprietary content seamlessly into Large Language Models (LLMs), enabling the generation of pertinent content for various software engineering tasks, including user story recommendations, epic creation, and design element suggestions.
This initiative involved the creation of over 150 tailored prompts and the implementation of the RAGAS framework to gauge the efficacy of responses generated by LLMs. Our evaluation encompassed both public LLM models and proprietary models integrated within the RAG framework. Comparative analysis revealed notable improvements in efficiency when utilizing RAG in contrast to private LLM models, particularly in terms of answer relevancy, response completeness, and response faithfulness.
This meticulous approach to measuring the quality of generated content not only enhances the effectiveness of our vector database setup but also ensures the validation of referenced content, thereby underlining the significance of our methodology in advancing software engineering practices.
Conclusion
RAG significantly transforms the LLM app landscape, introducing a seismic shift. Creating a proof-of-concept RAG application is straightforward, but achieving production-ready performance proves challenging. Like any machine learning project, one should assess the RAG pipeline's performance using a validation dataset and an evaluation metric. Despite their potential, evaluating these pipelines is crucial to ensure their accuracy and robustness. However, since a RAG pipeline comprises multiple components that require separate and combined evaluations, a set of evaluation metrics becomes necessary. Additionally, obtaining a high-quality validation dataset from human annotators is a demanding, time-consuming, and expensive task.
?To navigate this dynamic landscape, a robust framework is necessary. The primary objective in evaluating LLM performance is quantification, facilitated through the use of RAGAS. The RAGAS framework introduces four evaluation metrics — context_relevancy, context_recall, faithfulness, and answer_relevancy — forming the RAGAS score. Furthermore, RAGAS utilizes LLMs underneath for reference-free evaluation, aiming to reduce costs. Capgemini’s Generative AI for Software Engineering team assessed the effectiveness of RAG pipelines by employing metrics such as faithfulness, answer relevance, context relevance, context recall, context precision, answer semantic similarity, and answer correctness. The use of synthetic data generation techniques streamlines the evaluation process, minimizing manual effort and enabling the creation of diverse QA samples for a4 more comprehensive assessment. As the demand for reliable language models grows, the RAGAS framework emerges as a valuable asset, ensuring the robustness and efficacy of RAG pipelines in handling various queries and contexts, contributing to their continuous advancement and real-world application.
AI Consultant Python, Gen AI, LLM, Machine Learning, Deep Learning
4 个月Query: we have 100 financial documents and keywords covid is present in 10 documents. If we ask 'what is effect of covid???'... I am expecting answer from document 1.. but.. from retriever+rerank process.. we get data from all 10 documents.. now problem is what to do to make sure I get result from document 1 only.. OR We have 100 financial documents, and keyword profit is present in most of the documents. Now, if you ask question like 'what is the profit earned in Jan 2023'.. now I am expecting answer from document 1 and getting different files which are not relevant to my question.. So,.. how to solve this issue .. i know self query or agent concept may work but open source model inefficient and they costly also.. So we need work on retriever+rerank only.. as of now.. I am getting recall@20 is 0.106 I need to increase atleast 0.4 to 0.5... if any one worked on this... Please guide me
Immediate Joiner | Cloud Engineer at i2k2 Networks | 2x AWS Certified | DevOps | Linux | Python | MySQL | MS SQL | GGSIPU
4 个月Great post! Your insights on evaluating LLM performance metrics are incredibly useful.
GEN AI Evangelist | #TechSherpa | #LiftOthersUp
4 个月Intriguing stuff. Quantifying LLM performance is key for optimal utilization. Keen to explore RAGAS's unique approach. Rahul Khandelwal
Excited to explore the innovative approach of RAGAS for optimizing LLM performance! ?? Great insights on measuring quality metrics for RAG-Based GEN AI Apps!
Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer
4 个月You talked about the critical aspect of measuring quality metrics for RAG-based GEN AI Apps, emphasizing the significance of evaluating LLM performance. I'm curious about the implementation of RAGAS in assessing real-time conversational AI systems for customer service applications. How would you leverage RAGAS to ensure both accuracy and efficiency in handling complex customer queries across diverse industries?