登录查看更多内容

Beyond Basics: Evaluating LLMs – Uncovering the Truth (Part 2: Advanced RAG Techniques)

Ajith Aravind

GenAI Solution Architect: React | Python | Langchain | Javascript | Node.js | Blockchain | Generate exponential value using NextGen technologies

发布日期: 2024年4月30日

In part 1 of the blog post (url), we looked at how to create a simple RAG chain. We asked some questions and got some reasonable responses. Responses looked to be OK, but can we take them at face value? How do we really know the answers given by our application are, in fact, correct? Are they really coming from the context (our blog posts), or is the LLM pulling a fast one on us? Also, are we addressing the user's question comprehensively, or are we giving some half-baked answers? LLMs are very good at making things up (we use a nice euphemism for it - hallucination lol). They sound so convincing that we might not even bat an eye!

So what can we do about it? An obvious step is to generate a set of our own questions and corresponding right answers (aka ground truth answers). Then, we could ask another LLM to play judge, evaluating our LLM's answers against the truth. Sounds like a solid plan, right? Well, the trouble is, doing this manually is like pulling teeth – how many question-and-answer pairs can we realistically come up with on our own? That's where LLM evaluation frameworks come to our rescue. They make our lives a whole lot easier. Plus, evaluation frameworks give us the hard data – both quantitative and qualitative – that we need to really kick our application up a notch. So, before we dive headfirst into more RAG techniques, let's take our little application for a spin with an evaluation framework.

To save the trouble, I already went ahead and set up the evaluation framework, even ran a few quick tests. Setting one up is a blast, but that's a whole different story – we'll save that for another blog post. For now, here's what I've done so far:

Whipped up a set of 10 question/answer pairs using the Giskard library (Giskard's pretty slick, more on that later).
Ran those tests through the RAGAS framework (check out my other blog post about RAGAS (check out my other blog post about RAGAS)
Stashed the question/answer pairs and RAGAS evaluation results in Langsmith (a lifesaver for building test suites into our CI pipeline).

If any of that sounded like a foreign language, don't sweat it! We'll break down those steps in an upcoming blog post. Now for the juicy part – the results:

Answer correctness - .54
Answer relevancy - .94
Context precision - .92
Context recall - .70
Context relevancy - .16
Faithfulness - .74

Here's the source from Langsmith dashboard

Yikes! Our application isn't exactly setting the world on fire. That context relevancy score is a real eyesore – a measly .16 (the higher the number, the better). Let's just say the other scores aren't winning any awards either, but first let's look at the outlier - context relevancy. So, what exactly is this context relevancy all about?

Below is the official definition from RAGAS. It might sound a bit technical, but bear with me:

领英推荐

Metadata development plans

Crossref 5 个月前

Part V: Bringing it all together - RAG with ollama and…

INSiGENe 7 个月前

Practical Strategies to Enhance LLMs Performance!

Pavan Belagatti 9 个月前

This metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy. It is calculated by below formulae where S is ‘sentences within the retrieved context that are relevant for answering the given question’.

In plain English, it means "how much of the stuff our retriever found is actually useful for answering the question." Clearly, our retriever isn't giving us the best material to work with. Time to roll up our sleeves and fix this!

Where did our retriever go off the rails? Let's step back and think about its job. It compares the semantic similarity between user question and document embeddings and fetch the relevant documents. Let's break down what could be the issue:

Quantity: By default the retriever fetches 4 documents, but what if we bumped it up to 8 or higher? Would that lead us to better stuff? Worth a try!

Embeddings: We're using OpenAI Embeddings to turn text into numbers. What if we switched to a different algorithm? Could that be the key?

Document Size: Are those documents big or small? If they're too small, did we lose important context when we broke them into chunks (remember the chunking that we did from the first blog)?

Out of scope: Or are we asking questions that are not available in the context?

Alright, looks like we've got some detective work ahead of us! Let's get to the bottom of this together in the next blog. See you soon!

要查看或添加评论，请登录

Ajith Aravind的更多文章

How DeepSeek R1's "Thinking" Can Elevate Smaller Models

2025年2月16日

How DeepSeek R1's "Thinking" Can Elevate Smaller Models

The buzz around DeepSeek R1 is undeniable, and for good reason. While its performance is certainly impressive, it was…
Best small model - Mistral Small 3

2025年2月4日

Best small model - Mistral Small 3

Mistral AI OpenRouter Mistral Small 3 must be the best small model that provides quality output, support structured…
Building an Army of Software Engineers with Vanilla LLM Function Calling

2024年11月24日

Building an Army of Software Engineers with Vanilla LLM Function Calling

#AIEngineering #LLM #FunctionCalling #OpenAI #SoftwareEngineering #AI #TechInnovation #DeveloperTools #langgraph…

3 条评论
Super Excited about ColPali and Byaldi: The Best RAG Technique I've Tried for Complex PDFs

2024年9月13日

Super Excited about ColPali and Byaldi: The Best RAG Technique I've Tried for Complex PDFs

If you ever work with complex PDF documents, one of the most challenging aspects has been how to retrieve relevant…

1 条评论
The Transformer's Complete Love Story: Finding the Perfect Match

2024年9月1日

The Transformer's Complete Love Story: Finding the Perfect Match

Once upon a time, in the land of Language, there lived a charming quartet: The, Cat, Sits, and On. They were a…
Plan and Execute Agents pack a punch!

2024年8月19日

Plan and Execute Agents pack a punch!

We have seen Plan and Execute Agent in action in the last post. Quite amazing that such an approach flow engineered…
Visual LangGraph Generator

2024年8月17日

Visual LangGraph Generator

A fun little project to visually generate graph boilerplate, inspired by langgraph-engineer. The key difference? We're…
Plan and Execute Agents with Chain of Reasoning: An Improved Approach to Agentic Systems

2024年8月12日

Plan and Execute Agents with Chain of Reasoning: An Improved Approach to Agentic Systems

#langchain #openai #anthropic #langgraph #agents #agenticsystems In a previous post, we looked at a simple agentic…
Agentic Systems

2024年7月20日

Agentic Systems

LangChain Developer OpenAI #langgraph #agents #agenticsystems Let’s talk about agents today. Hardly a day goes by…

2 条评论
Beyond Basics: Exploring Advanced RAG Techniques

2024年4月28日

Beyond Basics: Exploring Advanced RAG Techniques

Ajith Aravind , EcoWiz Are you ready to elevate your RAG (Retrieval-Augmented Generation) applications? This blog…

See all articles

Beyond Basics: Evaluating LLMs – Uncovering the Truth (Part 2: Advanced RAG Techniques)

Ajith Aravind

GenAI Solution Architect: React | Python | Langchain | Javascript | Node.js | Blockchain | Generate exponential value using NextGen technologies

领英推荐

Ajith Aravind的更多文章

社区洞察

其他会员也浏览了

?? Top LLM Papers of the Week (December Week 1, 2024)

Marvelous MLOps #36: Creating Vector Database with OpenSearch

LLM part 4

Top LLM Papers of the Week (August Week 3, 2024)

Rerank RAG for complex Financial Services Use Cases

Mastering Stream Processing - Windowing time semantics

Is It Time To Get Rid Of Ontologists?

Top RAG Papers of the Week (August Week 3, 2024)

The Simpsons, Borges and data science

How to (Almost) Destroy Your System While Trying to Remove the French Language Pack

领英推荐

Ajith Aravind的更多文章

How DeepSeek R1's "Thinking" Can Elevate Smaller Models

Best small model - Mistral Small 3

Building an Army of Software Engineers with Vanilla LLM Function Calling

Super Excited about ColPali and Byaldi: The Best RAG Technique I've Tried for Complex PDFs

The Transformer's Complete Love Story: Finding the Perfect Match

Plan and Execute Agents pack a punch!

Visual LangGraph Generator

Plan and Execute Agents with Chain of Reasoning: An Improved Approach to Agentic Systems

Agentic Systems

Beyond Basics: Exploring Advanced RAG Techniques

社区洞察

其他会员也浏览了

?? Top LLM Papers of the Week (December Week 1, 2024)

Marvelous MLOps #36: Creating Vector Database with OpenSearch

LLM part 4

Top LLM Papers of the Week (August Week 3, 2024)

Rerank RAG for complex Financial Services Use Cases

Mastering Stream Processing - Windowing time semantics

Is It Time To Get Rid Of Ontologists?

Top RAG Papers of the Week (August Week 3, 2024)

The Simpsons, Borges and data science

How to (Almost) Destroy Your System While Trying to Remove the French Language Pack