ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Assessing your custom AI with RAGA

Justo Hidalgo

Chief AI Officer at Adigital. Highly interested in Responsible AI and Behavioral Psychology. PhD in Computer Science. Book author, working on my fourth one!

å‘å¸ƒæ—¥æœŸ: 2024å¹´1æœˆ9æ—¥

I have written a few times here about how to build applications that make the most of your own data and Large Language Models, so you get a chat that "talks to your data". These are now known as RAG apps (Retrieval Augmented Generation) as the choice of the pieces of data that answer your question come from a "local" retrieval engine before these chunks of data are sent forward to an existing LLM like GPT, Llama2 or Mixtral. You can find here an example with code that I built a few months ago with GPT, LangChain and Pinecone.

While the example shown in that link works very well for personal use, production-ready RAGs require much more work both in the architectural side -which I won't discuss here today- and in the evaluation side - which... I will :)

Because, how do you assess or evaluate the quality of the answers of your RAG app? A few months ago my answer would have been "I have no idea, it just works well enough". But little by little different techniques have appeared that provide some kind of assessment on the results provided by the RAG-enabled chat. One of them is called RAGA (Retrieval Augmented Generation Assessment). The abstract of the original paper back from September last year, summarizes it quite well:

Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With RAGAS, we put forward a suite of metrics which can be used to evaluate these different dimensions without having to rely on ground truth human annotations.

While ground truth human annotations are actually required for one of the metrics, it minimizes the need to manually add human-approved answers to related questions.

I found a great description of the method in this Medium post by Leoni Monigatti that shows how RAGA works. The code is in the post so I won't repeat it here, but to summarize:

Once you have the RAG application, you can use this RAGAS framework in Python.
This framework enables four different assessment metrics: (1) Context precision which shows whether the ground truth answers are ranked at the top of the results. (2) Context recall measures if all the relevant information required to answer the question was retrieved from the ground truth. (3) Faithfulness measures whether the actual answer's elements are factually correct. Finally, (4) Answer relevance measures how relevant and complete the generated answer is.
You will need to prepare the ground truths if you want to obtain values from some of these metrics. These are prepared as question/truth tuples that should be related to the documentation of the RAG.
That's it. Now you only need to select the metrics you want to analyze against which dataset. The result is a Pandas dataframe that you can visualize directly or convert to a file. Below is the result of the example of the Medium post after I executed it in my local laptop.

é¢†è‹±æŽ¨è

Survey of Multimodal LLMs; Meet GOAT-7B-Community Model; AWSâ€™ Amazon Bedrock With More Capabilities; Using OpenAI & Langchain To Build App; and More

Survey of Multimodal LLMs; Meet GOAT-7B-Communityâ€¦

Danny Butvinik 1 å¹´å‰

Watch#7: Small Tweaks with Big Impact

Pascal Biese 1 å¹´å‰

LLMs, Agents, OpenAI, LangChain, RAG, Llama - New Courses and Certifications

LLMs, Agents, OpenAI, LangChain, RAG, Llama - Newâ€¦

Vincent Granville 3 ä¸ªæœˆå‰

While you cannot see the full answers in the image, the context precision in the first and second one, or the faithfulness in the second question clearly indicates that there may be some issue to analyze there.

I have tested it against my BehPM AI :) with ten ground truth tuples and the results were mixed, clearly telling me that I still need to finetune my system :)

I then also did a newer test in which I had GPT4 generate additional questions without ground truths. Because of this, I was missing information from the context_recall metric (which requires ground truths), but was able to increase the number of tests. That was really interesting.

It seems there is still lots of work to do regarding a proper consistency in LLMs and RAG-based applications. That is why Responsible AI in general and, specifically, tools like RAGA, are more and more important as the state of the art advances. However, companies that want to take advantage of where we stand now and gain a strategic positioning must realize that we are on quicksand here, and that almost every day or week there are theoretical, technical and product advances.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Justo Hidalgoçš„æ›´å¤šæ–‡ç«

Watch your brain think: basic concepts of EEG data analysis

2025å¹´1æœˆ14æ—¥

Watch your brain think: basic concepts of EEG data analysis

The brain and its mysteries are some of the most amazing things for me. I try to read about neuroscience, psychologyâ€¦
The Fabric of AI Governance

2024å¹´12æœˆ1æ—¥

The Fabric of AI Governance

I am reading The Fabric of Reality, by David Deutsch. It is a quite compelling and at the same time complex bookâ€¦
An AI Moral Parliament built with a multi-agent approach

2024å¹´10æœˆ30æ—¥

An AI Moral Parliament built with a multi-agent approach

During the summer I had the privilege of attending an AI Safety course organized by the renowned Center for AI Safetyâ€¦

9 æ¡è¯„è®º
Another AI workshop with children. Validating before scaling

2024å¹´5æœˆ27æ—¥

Another AI workshop with children. Validating before scaling

A little more than a month ago I wrote about the wonderful experience we at Adigital had by leading a workshop on AIâ€¦

2 æ¡è¯„è®º
My notes on the Global Education Forum's roundtable about Ethical Implications of AI (and 4): Future outlook

2024å¹´5æœˆ16æ—¥

My notes on the Global Education Forum's roundtable about Ethical Implications of AI (and 4): Future outlook

This is my last post on Education and AI after the Global Education Forum 's roundtable on Ethics and AI I had theâ€¦

1 æ¡è¯„è®º
My notes on the Global Education Forum's roundtable about Ethical Implications of AI (3): Transparency and Accountability

2024å¹´5æœˆ14æ—¥

My notes on the Global Education Forum's roundtable about Ethical Implications of AI (3): Transparency and Accountability

The Global Education Forum 's roundtable on Ethics and AI continued after talking about Privacy and Bias and Fairnessâ€¦
My notes on the Global Education Forum's roundtable about Ethical Implications of AI (2): Privacy

2024å¹´5æœˆ13æ—¥

My notes on the Global Education Forum's roundtable about Ethical Implications of AI (2): Privacy

This is a continuation of the previous post on some reflections on AI and Education we shared during the Globalâ€¦
My notes on the Global Education Forum's roundtable about Ethical Implications of AI (1): Bias and fairness

2024å¹´5æœˆ9æ—¥

My notes on the Global Education Forum's roundtable about Ethical Implications of AI (1): Bias and fairness

As I mentioned some time ago, I had the great chance to participate in the SEK Education Group's Global Education Forumâ€¦
The Gutenberg Parenthesis

2024å¹´4æœˆ19æ—¥

The Gutenberg Parenthesis

Since my early times at 24symbols Online Reading, when I started to read everything related to the book publishingâ€¦
Igniting curiosity on responsible technologies: A workshop on Artificial Intelligence with children

2024å¹´4æœˆ17æ—¥

Igniting curiosity on responsible technologies: A workshop on Artificial Intelligence with children

Last week I had an extraordinary experience. I typically have the chance to speak with experts or discuss the mainâ€¦

5 æ¡è¯„è®º

See all articles

Assessing your custom AI with RAGA

Justo Hidalgo

Chief AI Officer at Adigital. Highly interested in Responsible AI and Behavioral Psychology. PhD in Computer Science. Book author, working on my fourth one!

é¢†è‹±æŽ¨è

Justo Hidalgoçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Solving Complex Problems Using FastAPI, LangChain, and GPT-4 Enhanced by OCR and Graph-Based Tools

Pixtral-12B: A 12B Multimodal Model with a 128K Context Window from Mistral AI??

Is OpenAIâ€™s O1 Model a Scam? An In-Depth Look at the Debate

AI2â€™s AllenNLP, Grover, and GPT-2 For Practical Content Generation

Finetuning LLMs Using Axolotl with QLora

Vector DataBases: Enhancing the Power of Large Language Models

About truly Open Source LLMs and AI Models

AI Innovations: Unveiling the Latest Breakthroughs

Harnessing the Power of Gemma 2B with LangChain's LLMChain

Prompting GPT-4-Turbo and Claude-3-Opus APIs using R

é¢†è‹±æŽ¨è

Justo Hidalgoçš„æ›´å¤šæ–‡ç«

Watch your brain think: basic concepts of EEG data analysis

The Fabric of AI Governance

An AI Moral Parliament built with a multi-agent approach

Another AI workshop with children. Validating before scaling

My notes on the Global Education Forum's roundtable about Ethical Implications of AI (and 4): Future outlook

My notes on the Global Education Forum's roundtable about Ethical Implications of AI (3): Transparency and Accountability

My notes on the Global Education Forum's roundtable about Ethical Implications of AI (2): Privacy

My notes on the Global Education Forum's roundtable about Ethical Implications of AI (1): Bias and fairness

The Gutenberg Parenthesis

Igniting curiosity on responsible technologies: A workshop on Artificial Intelligence with children

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Solving Complex Problems Using FastAPI, LangChain, and GPT-4 Enhanced by OCR and Graph-Based Tools

Pixtral-12B: A 12B Multimodal Model with a 128K Context Window from Mistral AI??

Is OpenAIâ€™s O1 Model a Scam? An In-Depth Look at the Debate

AI2â€™s AllenNLP, Grover, and GPT-2 For Practical Content Generation

Finetuning LLMs Using Axolotl with QLora

Vector DataBases: Enhancing the Power of Large Language Models

About truly Open Source LLMs and AI Models

AI Innovations: Unveiling the Latest Breakthroughs

Harnessing the Power of Gemma 2B with LangChain's LLMChain

Prompting GPT-4-Turbo and Claude-3-Opus APIs using R

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†