On Fact-checking RAG Outputs

On Fact-checking RAG Outputs

Overview

In this article, we explore the challenges and methods of fact-checking RAG

(Retrieval-Augmented-Generation) output.

We begin with a philosophical rant about what exactly a fact is. And why I think knowledge is better represented in an “index” (encyclopedia) rather than in a “knowledge graph”.

Following that, we go into practical experiments, including:

  • Implementing a fact extractor using "GPT-4o-mini".
  • Utilizing "DeBERTa" for detecting contradictions between two sentences.
  • Building a fact-checker pipeline.
  • Finally, we test these tools in a controlled environment using a single-document Retrieval-Augmented Generation (RAG) setup to detect if the RAG system has given us false information.

The code accompanying this article can be found here on Github.

The interestingly lame question of what a fact is

I'm not a cynic (maybe a little), sure, but things do happen for a fact. It's just that our narrative of them is entirely relative. We can not check if something happened or not for real. What we can do is check one narrative against another. And hope that the narrative we consider as premise is a fact.

Sure, simple sentences are easy to, easy enough to comprehend. “Ahmed went to the doctor” is an easy one.

But let's consider “Ahmed went eagerly to the doctor”. Now that's very difficult. What does eagerly mean? Does it match with the other sentence “Ahmed wanted to go to the doctor”?

It's tricky!

Let's consider another sentence describing a scientific fact that says "Plants start photosynthesis early in the morning". Now does that mean that they start at 8:00? 6:00?

How would this sentence be matched against the sentence “Plants start photosynthesis in the morning”. One of them is more “correct” than the other.

Knowledge as a graph

Taking the simple sentence of "Plants do photosynthesis”, we can represent it as a graph. With “Plants” as a node, “Photosynthesis” as another node and they are connected through an edge with a label “do” or the whole sentence can be the label.

This constructs a knowledge graph.

Very simple knowledge graph

But when sentences get complex, the graphs get complex and putting knowledge into the graph also gets complex. Again what about "Plants do photosynthesis early in the morning".

Encyclopedias (Indexes) are simpler

What has worked well so far to describe knowledge is simple text in encyclopedias. Plain old written knowledge in sentences using any language. And we take the entities from each piece of text and we put it in an index for easier lookup.

Extracting facts from text

LLMs can do a relatively good job of extracting the mentioned entities and their references in a piece of text. They can basically build an index. If we consider this text to be a true “Premise”, then we can consider the index of entities and their represented sentences as an index of entities and facts about those entities.

Now let's build a simple fact extractor using an LLM.

We define an abstract Python class called Extractor

Extractor class is a base class to extract info from a text using an LLM


And we implement it allowing using either llama3.2 or gpt-4o-mini. I observed that for this task, llama3.2 failed miserably.

Fact extractor class

And the return type of this extract method is this special type


Facts type class

The above fact extractor can extract entities and facts from a paragraph.

For example if we give it an input of "Go supports generics", it will give us something like this:

Go           ---> ["Go supports generics"]
generics ---> ["Go supports generics"]        

Detecting contradiction

Now when it comes to fact-checking, there is no general absolute fact checking. We can only compare a hypothesis against a premise.

Consider this:

Hypothesis: "Go supports generics".

Premise: "Go supports generics as of version go1.18".

They do not contradict, but they don't match 100%. Therefore, this fact checking should not return a boolean. It should give us a score and a classification. A good job for an encoder only transformer models which can do a good job of understanding text and classifying it.

I'll choose DeBERTa for detecting contradiction.

How does it differ from an LLM?

In short, LLMs excel in text generation and general-purpose tasks, while BERT and DeBERTa specialize in text comprehension and specific NLP applications.


An observation with BERT and DeBERTA

I observed through experimenting that passing two sentences to DeBERTa is far more accurate in detecting contradiction than passing a paragraph as the premise and a sentence or another paragraph as a hypothesis.


Now let's build a fact-checker with DeBERTa

Fact checker with DeBERTa

Our fact checker class returns a score and a classification of either a

  • Contradiction.
  • Entailment.
  • Neutral.

Now let's utilize those 2 components into a RAG system to fact-check the RAG system LLM output against the context to decrease the likelihood of hallucination.

Building a test hallucinating RAG

We use langchain graph as we did in previous articles to spin up a simple RAG system. But we hijack the LLM answer with something false. And see how our fact-checker does.

Here we initialize the LLM, fact-checker, fact-extractor and the RAG state. We also hard code one blog post to use in our retriever.

Rag initialization

We define a fake retriever to use only this blog post as context. Also, we hijack the generate function to return something not correct.


Hijacking the retrieve and generate functions

Then we define a fact checking step to go over the facts in the answer and compare them with the facts found in the context.

It's probably a good idea to do this fact extraction operation on the documents while we load them into the stores. We should avoid extracting facts from them on every run.


Fact checking logic

Next we glue it all together with a main function and an output function to print the output


Main function gluing all of it together

And we run it


Output of fact-checking false RAG output


Voile!

要查看或添加评论,请登录

Ahmed Abouzied的更多文章

社区洞察

其他会员也浏览了