Death by RAG Evals
Archana Vaidheeswaran
Building Community for AI Safety | Board Director| Machine Learning Consultant| Singapore 100 Women in Tech 2023
Welcome back to Quick Bites! This month, we're keeping it short and sweet, ensuring our busy readers get their dose of insightful content. As January unfolds, the buzz around AI at the World Economic Forum in Davos is hard to miss. The conference put a spotlight on a conscious approach to AI, emphasizing its application across various sectors and its intersection with other technologies, all while prioritizing people-first strategies.
Among business leaders, there's a growing concern about the 'impending doom' of AI overreach. But the real head-scratcher is the evaluation of these evolving models. For instance, the challenge in RAG (Retrieval-Augmented Generation) applications is significant. These systems need to be assessed for not just the accuracy and relevance of their responses, but also for their ability to retrieve and apply pertinent context.
Typically, human annotation is the go-to method for such evaluations. However, its drawbacks include being time-consuming, error-prone, and unable to handle real-time systems. While metrics like perplexity can assess the language model, they fall short of the complete RAG system.
Enter the world of self-evaluating systems, like RAGAs, which use LLMs (Large Language Models) for reference-free evaluations. But this raises an intriguing dilemma: how objectively can a system evaluate its own output?
Evaluating the quality of RAG applications in production is a considerable challenge. The evaluation needs to account for not only the quality and faithfulness of the generation but also the ability to identify and retrieve relevant context.
Human annotation is the most accurate evaluation method. However, it is slow and prone to errors and biases. Moreover, you cannot use human evaluators for real-time systems. Metrics like perplexity can be used to evaluate the performance of the language model itself but not the performance of the entire RAG system.
The holy grail for RAG evaluations is self-contained and reference-free, meaning you do not need human-annotated reference answers. RAGAs is one of the popular frameworks for doing so. However, LLMs are used to evaluate the generated answers to make their system reference-free. Herein lies the problem
A Typical RAGAs Evaluation
RAGAs use OpenAI’s API by default to calculate four main metrics: answer relevancy, faithfulness, context recall, and context precision. The default model is GPT-3.5-turbo. However, you can use your own LLM model. Taking the harmonic mean of the four metrics gives you the ragas score, which “is a single measure of the performance of your QA system across all the important aspects.”
To run your evaluation, you provide RAGAs with the metrics you want to calculate, the query, the answer, and the context used to arrive at the answer.
领英推荐
The results show that my RAG response was faithful, and the retrieved context was relevant to the question. However, I can improve my context recall, which measures “the ability of the retriever to retrieve all the necessary information needed to answer the question.” Overall, since my answer was faithful or factually relevant to the provided context, I can serve this answer to my user with high confidence!
But what is the cost of running this eval?
I ran RAGAs evaluation on our RAG application data. If you plot the number of tokens sent to OpenAI, on average, about ~90% of the tokens are used for running the evaluation. Just ~10% of my tokens were used to generate the response!
But at least it was fast, right? Nope, for my application, each evaluation (4 metrics) takes, on average, 15-20 seconds to run and involves five requests to OpenAI.
Finally, each evaluation costs somewhere between $0.10 to $0.15 in API costs.
And so, while our evaluation and monitoring system is like buying a Ferrari to watch over a bicycle, OpenAI is not just baking the cake but gleefully devouring it as well.
Death by RAG Evals
My first concern is regarding the tokens. There is going to be an overhead when using LLMs for evaluation. The problem is that the eval query includes the RAG query, context, and answer. The generated outputs are also quite long. This results in the eval requiring about 9x the number of tokens needed for the original query and response pair
This can be reduced by running fewer eval metrics than all four, but each metric in RAGAs is essential and gives a good overview of the RAG system's performance.
Secondly, it takes at least 10 seconds to run all the evaluation metrics per RAG Query. For larger query-answer pairs, it can take more than 20 seconds. If you are using evals to ensure that your responses are truthful and accurate before serving them, then this will increase the latency.
Finally, I am not sure about using LLMs to evaluate the output of other LLMs. Different LLMs will have different scores for the same response. The creators also allude to this in their docs here. So, do we choose an LLM that gives us the best scores? Or do we fix a scoring LLM and then try to improve our RAG output based on that? Or do we finetune a RAG scoring LLM for our domain? Doesn't that defeat the purpose?
AI, Psychology, Humanity and Life, Specializing in Leadership, Communication & Performance Development of Organizations and Individuals
1 年Speed is no friend of quality.
Cybersecurity Analyst | Penetration Tester | System Administrator | CompTIA Security+
1 年I loved the way you illustrated the silliness of using LLMs to correct LLMs
?? Build reliable agents / Check out -> Agent Leaderboard, Hallucination Index, BRAG
1 年Evaluation is costly and hard. Hence we are working on these at ?? Galileo. Solving one at a time.
A.I. Writer, researcher and curator - full-time Newsletter publication manager.
1 年Excellent coverage!