登录查看更多内容

Death by RAG Evals

Archana Vaidheeswaran

Building Community for AI Safety | Board Director| Machine Learning Consultant| Singapore 100 Women in Tech 2023

发布日期: 2024年1月31日

Welcome back to Quick Bites! This month, we're keeping it short and sweet, ensuring our busy readers get their dose of insightful content. As January unfolds, the buzz around AI at the World Economic Forum in Davos is hard to miss. The conference put a spotlight on a conscious approach to AI, emphasizing its application across various sectors and its intersection with other technologies, all while prioritizing people-first strategies.

Among business leaders, there's a growing concern about the 'impending doom' of AI overreach. But the real head-scratcher is the evaluation of these evolving models. For instance, the challenge in RAG (Retrieval-Augmented Generation) applications is significant. These systems need to be assessed for not just the accuracy and relevance of their responses, but also for their ability to retrieve and apply pertinent context.

Typically, human annotation is the go-to method for such evaluations. However, its drawbacks include being time-consuming, error-prone, and unable to handle real-time systems. While metrics like perplexity can assess the language model, they fall short of the complete RAG system.

Enter the world of self-evaluating systems, like RAGAs, which use LLMs (Large Language Models) for reference-free evaluations. But this raises an intriguing dilemma: how objectively can a system evaluate its own output?

Evaluating the quality of RAG applications in production is a considerable challenge. The evaluation needs to account for not only the quality and faithfulness of the generation but also the ability to identify and retrieve relevant context.

Basic RAG(Retrieval-Augmented Generation) System

Human annotation is the most accurate evaluation method. However, it is slow and prone to errors and biases. Moreover, you cannot use human evaluators for real-time systems. Metrics like perplexity can be used to evaluate the performance of the language model itself but not the performance of the entire RAG system.

The holy grail for RAG evaluations is self-contained and reference-free, meaning you do not need human-annotated reference answers. RAGAs is one of the popular frameworks for doing so. However, LLMs are used to evaluate the generated answers to make their system reference-free. Herein lies the problem

A Typical RAGAs Evaluation

RAGAs use OpenAI’s API by default to calculate four main metrics: answer relevancy, faithfulness, context recall, and context precision. The default model is GPT-3.5-turbo. However, you can use your own LLM model. Taking the harmonic mean of the four metrics gives you the ragas score, which “is a single measure of the performance of your QA system across all the important aspects.”

The four main RAGAs metrics. The harmonic mean of the metrics gives you the ragas score. Taken from the RAGAs Docs

To run your evaluation, you provide RAGAs with the metrics you want to calculate, the query, the answer, and the context used to arrive at the answer.

领英推荐

What OpenAI’s Deep research means for search

Azeem Azhar 1 个月前

What if AGI happens and nobody notices?

VentureBeat 6 个月前

Why Llama 3.1's Release is an Important Step in the…

Data Science Dojo 7 个月前

The results show that my RAG response was faithful, and the retrieved context was relevant to the question. However, I can improve my context recall, which measures “the ability of the retriever to retrieve all the necessary information needed to answer the question.” Overall, since my answer was faithful or factually relevant to the provided context, I can serve this answer to my user with high confidence!

But what is the cost of running this eval?

I ran RAGAs evaluation on our RAG application data. If you plot the number of tokens sent to OpenAI, on average, about ~90% of the tokens are used for running the evaluation. Just ~10% of my tokens were used to generate the response!

Just ~10% of my tokens were used to generate the response! The rest was the RAGAs Evaluation overhead!

But at least it was fast, right? Nope, for my application, each evaluation (4 metrics) takes, on average, 15-20 seconds to run and involves five requests to OpenAI.

Finally, each evaluation costs somewhere between $0.10 to $0.15 in API costs.

And so, while our evaluation and monitoring system is like buying a Ferrari to watch over a bicycle, OpenAI is not just baking the cake but gleefully devouring it as well.

Death by RAG Evals

My first concern is regarding the tokens. There is going to be an overhead when using LLMs for evaluation. The problem is that the eval query includes the RAG query, context, and answer. The generated outputs are also quite long. This results in the eval requiring about 9x the number of tokens needed for the original query and response pair

This can be reduced by running fewer eval metrics than all four, but each metric in RAGAs is essential and gives a good overview of the RAG system's performance.

Secondly, it takes at least 10 seconds to run all the evaluation metrics per RAG Query. For larger query-answer pairs, it can take more than 20 seconds. If you are using evals to ensure that your responses are truthful and accurate before serving them, then this will increase the latency.

Finally, I am not sure about using LLMs to evaluate the output of other LLMs. Different LLMs will have different scores for the same response. The creators also allude to this in their docs here. So, do we choose an LLM that gives us the best scores? Or do we fix a scoring LLM and then try to improve our RAG output based on that? Or do we finetune a RAG scoring LLM for our domain? Doesn't that defeat the purpose?

ScaleDown Newsletter

1,178 位关注者

Max Meinold

AI, Psychology, Humanity and Life, Specializing in Leadership, Communication & Performance Development of Organizations and Individuals

1 年

Speed is no friend of quality.

Daethyra C.

Cybersecurity Analyst | Penetration Tester | System Administrator | CompTIA Security+

1 年

I loved the way you illustrated the silliness of using LLMs to correct LLMs

1 次回应

Pratik Bhavsar

?? Build reliable agents / Check out -> Agent Leaderboard, Hallucination Index, BRAG

1 年

Evaluation is costly and hard. Hence we are working on these at ?? Galileo. Solving one at a time.

2 次回应

Michael Spencer

A.I. Writer, researcher and curator - full-time Newsletter publication manager.

1 年

Excellent coverage!

1 次回应

查看更多评论

要查看或添加评论，请登录

Archana Vaidheeswaran的更多文章

When LLMs Made Everyone a Coder

2025年1月1日

When LLMs Made Everyone a Coder

This story starts with Jenny Erpenbeck's "Kairos," a novel that arrived in my mailbox a week after I moved to Berlin…

1 条评论
Humans of AI Safety with Gunnar Zarncke

2024年8月7日

Humans of AI Safety with Gunnar Zarncke

In this edition, we talk to Gunnar Zarncke , the Managing Director at aintelope UG At the heart of the AI safety…
Building RAG apps is tough. Can RAGaaS help?

2024年5月25日

Building RAG apps is tough. Can RAGaaS help?

Forget vendor-wiring and broken dependencies-Why RAGaaS might just be what you are looking for? I remember a particular…

1 条评论
AI Safety: The Missing Piece in the AI Development Puzzle

2024年4月12日

AI Safety: The Missing Piece in the AI Development Puzzle

Bridging the Divide: Translating AI Safety Research into Actionable Insights If you are like me, you stumbled here…

9 条评论
Is a Claude Subscription Really Worth Your Dollars?

2024年3月25日

Is a Claude Subscription Really Worth Your Dollars?

Looking through everyday prompts to decide if Claude 3 Opus is worth the GPT subscription? Everywhere you look…

4 条评论
Tokenomics 101: Navigating the Nuances of LLM Product Pricing

2024年2月21日

Tokenomics 101: Navigating the Nuances of LLM Product Pricing

Hi, everyone; we are back with another quick bites of ScaleDown. Are you someone who has your sleeves rolled up to put…

6 条评论
Watt's in our Query? Decoding the Energy of AI Interactions

2024年1月11日

Watt's in our Query? Decoding the Energy of AI Interactions

As we greet the New Year with aspirations and resolutions, let's add a critical one to our list: sustainability in our…

2 条评论
The Carbon Impact of Large Language Models: AI's Growing Environmental Cost

2023年12月10日

The Carbon Impact of Large Language Models: AI's Growing Environmental Cost

A guide to the Energy Demands and CO2 Emissions of Leading LLMs in a Sustainability-Conscious Era In a world…

4 条评论
MythBusting LLMs: From GPU-rich Dreams to GPT-4's Gleam!

2023年9月19日

MythBusting LLMs: From GPU-rich Dreams to GPT-4's Gleam!

Hey there! Lately, I've been hopping on more Zoom calls with investors and VCs than I'd like to admit. No, it's not my…
Local Llama

2023年8月16日

Local Llama

Hey there, loyal readers! Why did the Local Llama cross the road? To help deploy Large Language Models locally, of…

See all articles

Death by RAG Evals

Archana Vaidheeswaran

Building Community for AI Safety | Board Director| Machine Learning Consultant| Singapore 100 Women in Tech 2023

A Typical RAGAs Evaluation

领英推荐

But what is the cost of running this eval?

Death by RAG Evals

ScaleDown Newsletter

1,178 位关注者

Archana Vaidheeswaran的更多文章

社区洞察

其他会员也浏览了

#41 OpenAI’s “innovation,” LLM Quantization, Feature Selection, and more!

GPT-4: A Potential Stepping Stone on the Path to Artificial General Intelligence AGI

Issue #314 - The ML Engineer ??

?? Mamba > Transformers?

LLM 2.0, the New Generation of Large Language Models

GenAI Weekly — Edition 37

Five critical thoughts and a warning on “Situational Awareness: The Decade Ahead.”

LLM Paper Reading Notes - September 2024

OpenAI Project Strawberry and Q*, Sam Altman’s Tease, and OpenAI DevDay 2024 — What Can We Expect?

GenAI Weekly — Edition 16

A Typical RAGAs Evaluation

领英推荐

But what is the cost of running this eval?

Death by RAG Evals

ScaleDown Newsletter

1,178 位关注者

Archana Vaidheeswaran的更多文章

When LLMs Made Everyone a Coder

Humans of AI Safety with Gunnar Zarncke

Building RAG apps is tough. Can RAGaaS help?

AI Safety: The Missing Piece in the AI Development Puzzle

Is a Claude Subscription Really Worth Your Dollars?

Tokenomics 101: Navigating the Nuances of LLM Product Pricing

Watt's in our Query? Decoding the Energy of AI Interactions

The Carbon Impact of Large Language Models: AI's Growing Environmental Cost

MythBusting LLMs: From GPU-rich Dreams to GPT-4's Gleam!

Local Llama

社区洞察

其他会员也浏览了

#41 OpenAI’s “innovation,” LLM Quantization, Feature Selection, and more!

GPT-4: A Potential Stepping Stone on the Path to Artificial General Intelligence AGI

Issue #314 - The ML Engineer ??

?? Mamba > Transformers?

LLM 2.0, the New Generation of Large Language Models

GenAI Weekly — Edition 37

Five critical thoughts and a warning on “Situational Awareness: The Decade Ahead.”

LLM Paper Reading Notes - September 2024

OpenAI Project Strawberry and Q*, Sam Altman’s Tease, and OpenAI DevDay 2024 — What Can We Expect?

GenAI Weekly — Edition 16