Beyond Basics: Evaluating LLMs – Uncovering the Truth (Part 2: Advanced RAG Techniques)

Beyond Basics: Evaluating LLMs – Uncovering the Truth (Part 2: Advanced RAG Techniques)

In part 1 of the blog post (url), we looked at how to create a simple RAG chain. We asked some questions and got some reasonable responses. Responses looked to be OK, but can we take them at face value? How do we really know the answers given by our application are, in fact, correct? Are they really coming from the context (our blog posts), or is the LLM pulling a fast one on us? Also, are we addressing the user's question comprehensively, or are we giving some half-baked answers? LLMs are very good at making things up (we use a nice euphemism for it - hallucination lol). They sound so convincing that we might not even bat an eye!

So what can we do about it? An obvious step is to generate a set of our own questions and corresponding right answers (aka ground truth answers). Then, we could ask another LLM to play judge, evaluating our LLM's answers against the truth. Sounds like a solid plan, right? Well, the trouble is, doing this manually is like pulling teeth – how many question-and-answer pairs can we realistically come up with on our own? That's where LLM evaluation frameworks come to our rescue. They make our lives a whole lot easier. Plus, evaluation frameworks give us the hard data – both quantitative and qualitative – that we need to really kick our application up a notch. So, before we dive headfirst into more RAG techniques, let's take our little application for a spin with an evaluation framework.

To save the trouble, I already went ahead and set up the evaluation framework, even ran a few quick tests. Setting one up is a blast, but that's a whole different story – we'll save that for another blog post. For now, here's what I've done so far:

  • Whipped up a set of 10 question/answer pairs using the Giskard library (Giskard's pretty slick, more on that later).
  • Ran those tests through the RAGAS framework (check out my other blog post about RAGAS (check out my other blog post about RAGAS)
  • Stashed the question/answer pairs and RAGAS evaluation results in Langsmith (a lifesaver for building test suites into our CI pipeline).

If any of that sounded like a foreign language, don't sweat it! We'll break down those steps in an upcoming blog post. Now for the juicy part – the results:

  • Answer correctness - .54
  • Answer relevancy - .94
  • Context precision - .92
  • Context recall - .70
  • Context relevancy - .16
  • Faithfulness - .74

Here's the source from Langsmith dashboard

Yikes! Our application isn't exactly setting the world on fire. That context relevancy score is a real eyesore – a measly .16 (the higher the number, the better). Let's just say the other scores aren't winning any awards either, but first let's look at the outlier - context relevancy. So, what exactly is this context relevancy all about?

Below is the official definition from RAGAS. It might sound a bit technical, but bear with me:

This metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy. It is calculated by below formulae where S is ‘sentences within the retrieved context that are relevant for answering the given question’.

In plain English, it means "how much of the stuff our retriever found is actually useful for answering the question." Clearly, our retriever isn't giving us the best material to work with. Time to roll up our sleeves and fix this!

Where did our retriever go off the rails? Let's step back and think about its job. It compares the semantic similarity between user question and document embeddings and fetch the relevant documents. Let's break down what could be the issue:

Quantity: By default the retriever fetches 4 documents, but what if we bumped it up to 8 or higher? Would that lead us to better stuff? Worth a try!

Embeddings: We're using OpenAI Embeddings to turn text into numbers. What if we switched to a different algorithm? Could that be the key?

Document Size: Are those documents big or small? If they're too small, did we lose important context when we broke them into chunks (remember the chunking that we did from the first blog)?

Out of scope: Or are we asking questions that are not available in the context?

Alright, looks like we've got some detective work ahead of us! Let's get to the bottom of this together in the next blog. See you soon!

要查看或添加评论,请登录

Ajith Aravind的更多文章

社区洞察

其他会员也浏览了