Death by RAG Evals

Death by RAG Evals

Welcome back to Quick Bites! This month, we're keeping it short and sweet, ensuring our busy readers get their dose of insightful content. As January unfolds, the buzz around AI at the World Economic Forum in Davos is hard to miss. The conference put a spotlight on a conscious approach to AI, emphasizing its application across various sectors and its intersection with other technologies, all while prioritizing people-first strategies.

Among business leaders, there's a growing concern about the 'impending doom' of AI overreach. But the real head-scratcher is the evaluation of these evolving models. For instance, the challenge in RAG (Retrieval-Augmented Generation) applications is significant. These systems need to be assessed for not just the accuracy and relevance of their responses, but also for their ability to retrieve and apply pertinent context.

Typically, human annotation is the go-to method for such evaluations. However, its drawbacks include being time-consuming, error-prone, and unable to handle real-time systems. While metrics like perplexity can assess the language model, they fall short of the complete RAG system.


Enter the world of self-evaluating systems, like RAGAs, which use LLMs (Large Language Models) for reference-free evaluations. But this raises an intriguing dilemma: how objectively can a system evaluate its own output?

Evaluating the quality of RAG applications in production is a considerable challenge. The evaluation needs to account for not only the quality and faithfulness of the generation but also the ability to identify and retrieve relevant context.

Basic RAG(Retrieval-Augmented Generation) System

Human annotation is the most accurate evaluation method. However, it is slow and prone to errors and biases. Moreover, you cannot use human evaluators for real-time systems. Metrics like perplexity can be used to evaluate the performance of the language model itself but not the performance of the entire RAG system.

The holy grail for RAG evaluations is self-contained and reference-free, meaning you do not need human-annotated reference answers. RAGAs is one of the popular frameworks for doing so. However, LLMs are used to evaluate the generated answers to make their system reference-free. Herein lies the problem

A Typical RAGAs Evaluation

RAGAs use OpenAI’s API by default to calculate four main metrics: answer relevancy, faithfulness, context recall, and context precision. The default model is GPT-3.5-turbo. However, you can use your own LLM model. Taking the harmonic mean of the four metrics gives you the ragas score, which “is a single measure of the performance of your QA system across all the important aspects.”

The four main RAGAs metrics. The harmonic mean of the metrics gives you the ragas score. Taken from the RAGAs Docs


To run your evaluation, you provide RAGAs with the metrics you want to calculate, the query, the answer, and the context used to arrive at the answer.

Sample RAGAs code and result


The results show that my RAG response was faithful, and the retrieved context was relevant to the question. However, I can improve my context recall, which measures “the ability of the retriever to retrieve all the necessary information needed to answer the question.” Overall, since my answer was faithful or factually relevant to the provided context, I can serve this answer to my user with high confidence!

But what is the cost of running this eval?

I ran RAGAs evaluation on our RAG application data. If you plot the number of tokens sent to OpenAI, on average, about ~90% of the tokens are used for running the evaluation. Just ~10% of my tokens were used to generate the response!

Just ~10% of my tokens were used to generate the response! The rest was the RAGAs Evaluation overhead!


But at least it was fast, right? Nope, for my application, each evaluation (4 metrics) takes, on average, 15-20 seconds to run and involves five requests to OpenAI.

Finally, each evaluation costs somewhere between $0.10 to $0.15 in API costs.

And so, while our evaluation and monitoring system is like buying a Ferrari to watch over a bicycle, OpenAI is not just baking the cake but gleefully devouring it as well.

Death by RAG Evals

My first concern is regarding the tokens. There is going to be an overhead when using LLMs for evaluation. The problem is that the eval query includes the RAG query, context, and answer. The generated outputs are also quite long. This results in the eval requiring about 9x the number of tokens needed for the original query and response pair

This can be reduced by running fewer eval metrics than all four, but each metric in RAGAs is essential and gives a good overview of the RAG system's performance.

Secondly, it takes at least 10 seconds to run all the evaluation metrics per RAG Query. For larger query-answer pairs, it can take more than 20 seconds. If you are using evals to ensure that your responses are truthful and accurate before serving them, then this will increase the latency.

Finally, I am not sure about using LLMs to evaluate the output of other LLMs. Different LLMs will have different scores for the same response. The creators also allude to this in their docs here. So, do we choose an LLM that gives us the best scores? Or do we fix a scoring LLM and then try to improve our RAG output based on that? Or do we finetune a RAG scoring LLM for our domain? Doesn't that defeat the purpose?



Max Meinold

AI, Psychology, Humanity and Life, Specializing in Leadership, Communication & Performance Development of Organizations and Individuals

1 年

Speed is no friend of quality.

回复
Daethyra C.

Cybersecurity Analyst | Penetration Tester | System Administrator | CompTIA Security+

1 年

I loved the way you illustrated the silliness of using LLMs to correct LLMs

Pratik Bhavsar

?? Build reliable agents / Check out -> Agent Leaderboard, Hallucination Index, BRAG

1 年

Evaluation is costly and hard. Hence we are working on these at ?? Galileo. Solving one at a time.

Michael Spencer

A.I. Writer, researcher and curator - full-time Newsletter publication manager.

1 年

Excellent coverage!

要查看或添加评论,请登录

Archana Vaidheeswaran的更多文章

社区洞察

其他会员也浏览了