On Fact-checking RAG Outputs
Overview
In this article, we explore the challenges and methods of fact-checking RAG
(Retrieval-Augmented-Generation) output.
We begin with a philosophical rant about what exactly a fact is. And why I think knowledge is better represented in an “index” (encyclopedia) rather than in a “knowledge graph”.
Following that, we go into practical experiments, including:
The code accompanying this article can be found here on Github.
The interestingly lame question of what a fact is
I'm not a cynic (maybe a little), sure, but things do happen for a fact. It's just that our narrative of them is entirely relative. We can not check if something happened or not for real. What we can do is check one narrative against another. And hope that the narrative we consider as premise is a fact.
Sure, simple sentences are easy to, easy enough to comprehend. “Ahmed went to the doctor” is an easy one.
But let's consider “Ahmed went eagerly to the doctor”. Now that's very difficult. What does eagerly mean? Does it match with the other sentence “Ahmed wanted to go to the doctor”?
It's tricky!
Let's consider another sentence describing a scientific fact that says "Plants start photosynthesis early in the morning". Now does that mean that they start at 8:00? 6:00?
How would this sentence be matched against the sentence “Plants start photosynthesis in the morning”. One of them is more “correct” than the other.
Knowledge as a graph
Taking the simple sentence of "Plants do photosynthesis”, we can represent it as a graph. With “Plants” as a node, “Photosynthesis” as another node and they are connected through an edge with a label “do” or the whole sentence can be the label.
This constructs a knowledge graph.
But when sentences get complex, the graphs get complex and putting knowledge into the graph also gets complex. Again what about "Plants do photosynthesis early in the morning".
Encyclopedias (Indexes) are simpler
What has worked well so far to describe knowledge is simple text in encyclopedias. Plain old written knowledge in sentences using any language. And we take the entities from each piece of text and we put it in an index for easier lookup.
Extracting facts from text
LLMs can do a relatively good job of extracting the mentioned entities and their references in a piece of text. They can basically build an index. If we consider this text to be a true “Premise”, then we can consider the index of entities and their represented sentences as an index of entities and facts about those entities.
Now let's build a simple fact extractor using an LLM.
We define an abstract Python class called Extractor
And we implement it allowing using either llama3.2 or gpt-4o-mini. I observed that for this task, llama3.2 failed miserably.
And the return type of this extract method is this special type
The above fact extractor can extract entities and facts from a paragraph.
For example if we give it an input of "Go supports generics", it will give us something like this:
Go ---> ["Go supports generics"]
generics ---> ["Go supports generics"]
领英推荐
Detecting contradiction
Now when it comes to fact-checking, there is no general absolute fact checking. We can only compare a hypothesis against a premise.
Consider this:
Hypothesis: "Go supports generics".
Premise: "Go supports generics as of version go1.18".
They do not contradict, but they don't match 100%. Therefore, this fact checking should not return a boolean. It should give us a score and a classification. A good job for an encoder only transformer models which can do a good job of understanding text and classifying it.
I'll choose DeBERTa for detecting contradiction.
How does it differ from an LLM?
In short, LLMs excel in text generation and general-purpose tasks, while BERT and DeBERTa specialize in text comprehension and specific NLP applications.
An observation with BERT and DeBERTA
I observed through experimenting that passing two sentences to DeBERTa is far more accurate in detecting contradiction than passing a paragraph as the premise and a sentence or another paragraph as a hypothesis.
Now let's build a fact-checker with DeBERTa
Our fact checker class returns a score and a classification of either a
Now let's utilize those 2 components into a RAG system to fact-check the RAG system LLM output against the context to decrease the likelihood of hallucination.
Building a test hallucinating RAG
We use langchain graph as we did in previous articles to spin up a simple RAG system. But we hijack the LLM answer with something false. And see how our fact-checker does.
Here we initialize the LLM, fact-checker, fact-extractor and the RAG state. We also hard code one blog post to use in our retriever.
We define a fake retriever to use only this blog post as context. Also, we hijack the generate function to return something not correct.
Then we define a fact checking step to go over the facts in the answer and compare them with the facts found in the context.
It's probably a good idea to do this fact extraction operation on the documents while we load them into the stores. We should avoid extracting facts from them on every run.
Next we glue it all together with a main function and an output function to print the output
And we run it
Voile!