Assessing your custom AI with RAGA
Justo Hidalgo
Chief AI Officer at Adigital. Highly interested in Responsible AI and Behavioral Psychology. PhD in Computer Science. Book author, working on my fourth one!
I have written a few times here about how to build applications that make the most of your own data and Large Language Models, so you get a chat that "talks to your data". These are now known as RAG apps (Retrieval Augmented Generation) as the choice of the pieces of data that answer your question come from a "local" retrieval engine before these chunks of data are sent forward to an existing LLM like GPT, Llama2 or Mixtral. You can find here an example with code that I built a few months ago with GPT, LangChain and Pinecone.
While the example shown in that link works very well for personal use, production-ready RAGs require much more work both in the architectural side -which I won't discuss here today- and in the evaluation side - which... I will :)
Because, how do you assess or evaluate the quality of the answers of your RAG app? A few months ago my answer would have been "I have no idea, it just works well enough". But little by little different techniques have appeared that provide some kind of assessment on the results provided by the RAG-enabled chat. One of them is called RAGA (Retrieval Augmented Generation Assessment). The abstract of the original paper back from September last year, summarizes it quite well:
Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With RAGAS, we put forward a suite of metrics which can be used to evaluate these different dimensions without having to rely on ground truth human annotations.
While ground truth human annotations are actually required for one of the metrics, it minimizes the need to manually add human-approved answers to related questions.
I found a great description of the method in this Medium post by Leoni Monigatti that shows how RAGA works. The code is in the post so I won't repeat it here, but to summarize:
- Once you have the RAG application, you can use this RAGAS framework in Python.
- This framework enables four different assessment metrics: (1) Context precision which shows whether the ground truth answers are ranked at the top of the results. (2) Context recall measures if all the relevant information required to answer the question was retrieved from the ground truth. (3) Faithfulness measures whether the actual answer's elements are factually correct. Finally, (4) Answer relevance measures how relevant and complete the generated answer is.
- You will need to prepare the ground truths if you want to obtain values from some of these metrics. These are prepared as question/truth tuples that should be related to the documentation of the RAG.
- That's it. Now you only need to select the metrics you want to analyze against which dataset. The result is a Pandas dataframe that you can visualize directly or convert to a file. Below is the result of the example of the Medium post after I executed it in my local laptop.
领英推è
While you cannot see the full answers in the image, the context precision in the first and second one, or the faithfulness in the second question clearly indicates that there may be some issue to analyze there.
I have tested it against my BehPM AI :) with ten ground truth tuples and the results were mixed, clearly telling me that I still need to finetune my system :)
I then also did a newer test in which I had GPT4 generate additional questions without ground truths. Because of this, I was missing information from the context_recall metric (which requires ground truths), but was able to increase the number of tests. That was really interesting.
It seems there is still lots of work to do regarding a proper consistency in LLMs and RAG-based applications. That is why Responsible AI in general and, specifically, tools like RAGA, are more and more important as the state of the art advances. However, companies that want to take advantage of where we stand now and gain a strategic positioning must realize that we are on quicksand here, and that almost every day or week there are theoretical, technical and product advances.