登录查看更多内容

On Fact-checking RAG Outputs

Ahmed Abouzied

Senior Software Engineer at Adjust

发布日期: 2024年12月26日

+ 关注

Overview

In this article, we explore the challenges and methods of fact-checking RAG

(Retrieval-Augmented-Generation) output.

We begin with a philosophical rant about what exactly a fact is. And why I think knowledge is better represented in an “index” (encyclopedia) rather than in a “knowledge graph”.

Following that, we go into practical experiments, including:

Implementing a fact extractor using "GPT-4o-mini".
Utilizing "DeBERTa" for detecting contradictions between two sentences.
Building a fact-checker pipeline.
Finally, we test these tools in a controlled environment using a single-document Retrieval-Augmented Generation (RAG) setup to detect if the RAG system has given us false information.

The code accompanying this article can be found here on Github.

The interestingly lame question of what a fact is

I'm not a cynic (maybe a little), sure, but things do happen for a fact. It's just that our narrative of them is entirely relative. We can not check if something happened or not for real. What we can do is check one narrative against another. And hope that the narrative we consider as premise is a fact.

Sure, simple sentences are easy to, easy enough to comprehend. “Ahmed went to the doctor” is an easy one.

But let's consider “Ahmed went eagerly to the doctor”. Now that's very difficult. What does eagerly mean? Does it match with the other sentence “Ahmed wanted to go to the doctor”?

It's tricky!

Let's consider another sentence describing a scientific fact that says "Plants start photosynthesis early in the morning". Now does that mean that they start at 8:00? 6:00?

How would this sentence be matched against the sentence “Plants start photosynthesis in the morning”. One of them is more “correct” than the other.

Knowledge as a graph

Taking the simple sentence of "Plants do photosynthesis”, we can represent it as a graph. With “Plants” as a node, “Photosynthesis” as another node and they are connected through an edge with a label “do” or the whole sentence can be the label.

This constructs a knowledge graph.

But when sentences get complex, the graphs get complex and putting knowledge into the graph also gets complex. Again what about "Plants do photosynthesis early in the morning".

Encyclopedias (Indexes) are simpler

What has worked well so far to describe knowledge is simple text in encyclopedias. Plain old written knowledge in sentences using any language. And we take the entities from each piece of text and we put it in an index for easier lookup.

Extracting facts from text

LLMs can do a relatively good job of extracting the mentioned entities and their references in a piece of text. They can basically build an index. If we consider this text to be a true “Premise”, then we can consider the index of entities and their represented sentences as an index of entities and facts about those entities.

Now let's build a simple fact extractor using an LLM.

We define an abstract Python class called Extractor

Extractor class is a base class to extract info from a text using an LLM

And we implement it allowing using either llama3.2 or gpt-4o-mini. I observed that for this task, llama3.2 failed miserably.

And the return type of this extract method is this special type

The above fact extractor can extract entities and facts from a paragraph.

For example if we give it an input of "Go supports generics", it will give us something like this:

Go           ---> ["Go supports generics"]
generics ---> ["Go supports generics"]

领英推荐

Gian Andrea Inkof, Data Science Consultant at STAT-UP

STAT-UP Statistical Consulting & Data Science 10 个月前

A 12-month strategy to crack CAT 2023, in under 12…

EduCrack 2 年前

Data Science #35

Andriy Burkov 4 个月前

Detecting contradiction

Now when it comes to fact-checking, there is no general absolute fact checking. We can only compare a hypothesis against a premise.

Consider this:

Hypothesis: "Go supports generics".

Premise: "Go supports generics as of version go1.18".

They do not contradict, but they don't match 100%. Therefore, this fact checking should not return a boolean. It should give us a score and a classification. A good job for an encoder only transformer models which can do a good job of understanding text and classifying it.

I'll choose DeBERTa for detecting contradiction.

How does it differ from an LLM?

In short, LLMs excel in text generation and general-purpose tasks, while BERT and DeBERTa specialize in text comprehension and specific NLP applications.

An observation with BERT and DeBERTA

I observed through experimenting that passing two sentences to DeBERTa is far more accurate in detecting contradiction than passing a paragraph as the premise and a sentence or another paragraph as a hypothesis.

Now let's build a fact-checker with DeBERTa

Our fact checker class returns a score and a classification of either a

Contradiction.
Entailment.
Neutral.

Now let's utilize those 2 components into a RAG system to fact-check the RAG system LLM output against the context to decrease the likelihood of hallucination.

Building a test hallucinating RAG

We use langchain graph as we did in previous articles to spin up a simple RAG system. But we hijack the LLM answer with something false. And see how our fact-checker does.

Here we initialize the LLM, fact-checker, fact-extractor and the RAG state. We also hard code one blog post to use in our retriever.

We define a fake retriever to use only this blog post as context. Also, we hijack the generate function to return something not correct.

Hijacking the retrieve and generate functions

Then we define a fact checking step to go over the facts in the answer and compare them with the facts found in the context.

It's probably a good idea to do this fact extraction operation on the documents while we load them into the stores. We should avoid extracting facts from them on every run.

Next we glue it all together with a main function and an output function to print the output

And we run it

Output of fact-checking false RAG output

Voile!

要查看或添加评论，请登录

Ahmed Abouzied的更多文章

An Experiment with RAG Search Accuracy

2024年12月20日

An Experiment with RAG Search Accuracy

Overview In a previous article, we explored the foundations of a basic Retrieval-Augmented Generation (RAG) system by…

1 条评论
Let’s build a Golang knowledge specific RAG system with a local llama LLM

2024年12月16日

Let’s build a Golang knowledge specific RAG system with a local llama LLM

Overview In this article, we’re going to build a Golang knowledge specific RAG AI Chat system using open source and…

5 条评论
On cancel propagation in web servers

2020年3月23日

On cancel propagation in web servers

This is about using the context package to propagate cancel events to save resources in web servers. Let's consider the…

On Fact-checking RAG Outputs

Ahmed Abouzied

Senior Software Engineer at Adjust

Overview

The interestingly lame question of what a fact is

Knowledge as a graph

Encyclopedias (Indexes) are simpler

Extracting facts from text

领英推荐

Detecting contradiction

How does it differ from an LLM?

An observation with BERT and DeBERTA

Now let's build a fact-checker with DeBERTa

Building a test hallucinating RAG

Ahmed Abouzied的更多文章

社区洞察

其他会员也浏览了

Four-and-a-Half Rules You Must Break: A Message to 2020 Grads.

BitTorrent Internals - Part 2 - Understanding the Torrent File

A book review of Alex Wright's "Informatica: Mastering Information through the Ages" (2023, Cornell University Press), 2nd edition.

Decision Trees: Introduction

More response to the response (DiffDock/DiffDock-L)

Unlock Your Potential: Enroll in a Data Science Course in Uttam Nagar Today!

Celebrating a Milestone: Completing my Second Book on my 50th Birthday!

Linked data

More Fun Math Problems for Machine Learning Practitioners

Overview

The interestingly lame question of what a fact is

Knowledge as a graph

Encyclopedias (Indexes) are simpler

Extracting facts from text

领英推荐

Detecting contradiction

How does it differ from an LLM?

An observation with BERT and DeBERTA

Now let's build a fact-checker with DeBERTa

Building a test hallucinating RAG

Ahmed Abouzied的更多文章

An Experiment with RAG Search Accuracy

Let’s build a Golang knowledge specific RAG system with a local llama LLM

On cancel propagation in web servers

社区洞察

其他会员也浏览了

Four-and-a-Half Rules You Must Break: A Message to 2020 Grads.

BitTorrent Internals - Part 2 - Understanding the Torrent File

A book review of Alex Wright's "Informatica: Mastering Information through the Ages" (2023, Cornell University Press), 2nd edition.

Decision Trees: Introduction

More response to the response (DiffDock/DiffDock-L)

Unlock Your Potential: Enroll in a Data Science Course in Uttam Nagar Today!

Celebrating a Milestone: Completing my Second Book on my 50th Birthday!

Linked data

More Fun Math Problems for Machine Learning Practitioners