Eliminating Hallucinations Lesson 1: Named Entity Filtering (NEF)
I am going show how to apply the Noun-Phrase Dominance Model to eliminate LLM hallucinations. This new series provides the collection of steps needed to produce 100% accurate responses. In fact, this lesson includes the key to all the other lessons—the key to dispensing with hallucinations once and for all.
This tutorial presumes that you have already watched the video introduction of the method. The video is available at the link in the prior paragraph.
Introducing Named Entity Filtering
As discussed in the video, ChatGPT 4 and other LLMs often hallucinate when multiple names are semantically similar. The example used throughout the video was “Alfonso” and “Afonso.”
The Alfonso Debacle: If you ask a question about “Alfonso” but send information regarding “Afonso” the LLM will likely apply the “Afonso” information to “Alfonso”—even though they are two different names—as different as Chuck and Bartholomew.
The rest of this article proceeds as follows:
The objective of this series is to provide you with the full collection of steps that you need in order to apply the Noun-Phrase Dominance Model to fully eliminate hallucinations. Welcome to step one!
The Issue
I once queried ChatGPT 4 for information regarding the challenges that Cruise LLC was facing. (Cruise develops self-driving cars.) However, ChatGPT 4’s response interspersed challenges faced by Cruise LLC with challenges faced by the cruise industry (as in tourist boating).
Here’s something rarely discussed. Notice that ChatGPT did not intersperse challenges faced by the taco industry, or any other industry for that matter. On the contrary, the conflation was systematic. There’s a systematic reason why LLMs conflate specific topics and words.
We shall later explore why RAG does not inherently prevent this type of conflation. We shall also explore how Named Entity Filtering does fully resolve this issue. For now, let’s focus on the Alfonso Debacle to understand why ChatGPT 4 and other LLMs hallucinate, in order to understand how Named Entity Filtering fully eliminates the issue.
Alfonso Debacle
The Alfonso Debacle is discussed throughout the video. However, I recently provided new information regarding the Alfonso Debacle in “Why is ChatGPT Getting Worse?” To recap, a company called Vellum posted a ChatGPT-4 hallucination for the query: “Who was the mother of Afonso II, the third king of Portugal?”
ChatGPT 4 originally gave the wrong answer—Urraca of Castille. (You can verify this using gpt-4–0125-preview.)
OpenAI later fine-tuned ChatGPT 4 to provide the correct answer—Dulce of Aragon. I’ve already written how this fine-tuning increases hallucinations (the reason GPT-4’s performance steadily decreases).
However, not only does the fine tuning increase hallucinations elsewhere, but it only “fixes” the original query verbatim. For example, here is a query that I submitted to ChatGPT 4 today (September 2, 2024):
Notice GPT 4 hallucinated on multiple levels:
Today’s hallucination was triggered by changing “Afonso II” in the original query to “Alfonso II.” This demonstrates that OpenAI’s fine tuning did not overcome the original issue. In other words, ChatGPT 4 still treats “Alfonso” and “Afonso” as being the same (except where it is fine tuned to behave otherwise).
As demonstrated in the video, ChatGPT treats “Chuck” and “Bartholomew” as references to different people; yet treats “Alfonso” and “Afonso” as references to the same person. (See screenshots below.)
As humans, we know that two different names refer to two different people. As humans, we know that the company “Cruise” is different from the regular noun “cruise.”
But as shown throughout the video, LLM next-token selection is based on Noun-Phrase Routes that are largely driven by semantic similarity. This is only natural, given that vector embeddings are the only thing the LLM has to operate off of.
What’s RAG Got To Do With It?
Imagine you are creating a chatbot for Apple. Would the LLM conflate “iPhone 8” with “iPhone 11”? Very likely yes. Therefore, if your knowledge base contains information regarding multiple iPhone models, you are heading for the land of hallucinations.
In fact, this is a universal problem when building chatbots for companies that have multiple models and releases.
This should hopefully enlighten on why RAG sometimes works like magic and other times fails miserably. If your knowledge base solely contains information on Cruise LLC then it cannot conflate the company with the cruise industry. RAG works like magic.
However, the moment that you add Noun-Phrase Collisions (noun phrases that are semantically similar yet refer to different entities) RAG immediately falls apart.
Consider this all-too-common experience reported by a Reddit user:
I’ve been experimenting with an [ChatGPT] Assistant designed to act as a technical support assistant for a business. I loaded it up with a tremendous amount of technical and product information, and have been getting really mixed results. Some questions answered wonderfully, others completely wrong. So I decided to break it down to the basics. I uploaded a product brochure, and then I simply ask “how many widgets come in a box for SKU #X.”
It correctly answered this question repeatedly. Great. However, whenever I add various other content, results start going haywire.
(https://www.reddit.com/r/ChatGPTPro/comments/17xsch4/ unexpected_behavior_of_gpt_assistant_with/, last visited March 4, 2024, brackets added.)
领英推荐
But wouldn’t RAG’s semantic similarity search solve this issue during vector database retrieval? Absolutely not. Allow me to explain. In fact, the following section is not only the key to Lesson 1, but very much the key to it all.
Secret Behind the Alfonso Debacle
Please study this section carefully. It will guide you to 100% accurate chatbots once you fully internalize it.
All-important insights come from asking all-important questions. Here’s the all-important question regarding the Alfonso Debacle: Why did ChatGPT 4 consistently choose Alfonso VII’s mother over Afonso II’s mother even though Afonso II was referenced in the query?
Prior to OpenAI fine tuning the answer, ChatGPT 4 would routinely give Alfonso VII’s mother instead of Afonso II. Why?
Take a moment and think about this. In fact, the Noun-Phrase Dominance Model came from asking this same question on literally hundreds of queries. By searching for the answer on each query, a pattern emerged. In fact, the same pattern emerged every single time.
Let me give you a hint. Consider this webpage statement regarding Alfonso VII: “Alfonso’s Mother was Urraca (1079–1126) called the Reckless was Queen of Castile…”
Now, think about that statement with the original query in mind: “Who was the mother of Afonso II, the third king of Portugal?” Why would the above statement be such an attractive route? Remember, focus on the noun phrases. After all, they determine the route.
Hopefully you took time to study the noun phrases in the query along with the noun phrases in the statement. You will always find your answer here.
Did you notice that “mother” is in both the query and the Alfonso webpage statement? Did you notice that “mother” is right next to “Alfonso” in the statement? From the LLM’s perspective, “Alfonso” is basically the same as “Afonso” and “mother” is a direct match. Therefore, if there is no “mother” close to “Afonso” then the LLM will choose the Alfonso/mother combination. (And that’s exactly what the ChatGPT 4 did until it was specifically fine tuned to behave otherwise.)
Take time to compare the location of the word “mother” for Alfonso VII to the location of “mother” for Afonso II. Notice that the word “mother” is extremely disconnected from “Afonso” in the latter link. That’s why the Alfonso route wins over Afonso for this query. It’s also the key to it all.
Seemingly Complex Issue
So what does this have to do with our Apple iPhone example? Consider someone asking about a specific feature regarding iPhone 11. If that feature is more prominently coupled with iPhone 8 then chances are the iPhone 8 feature description will be given even though iPhone 11 was specified (just as Alfonso VII’s information was given even though Afonso II was specified).
Notice how this problem goes all the way back to chunk retrieval. Chunk retrieval often includes semantic similarity searches (such as cosine similarity searching on a vector database). In the above example, the iPhone 8 chunk will have greater semantic similarity than the iPhone 11 chunk. Creating a problem at both retrieval and reranking.
Hence, RAG not only doesn’t solve the issue, it’s even part of the problem.
NEF: Simple Yet Effective Solution
Here’s the great part of the Noun Phrase Dominance Model. Once you know the problem, fixing it is easy. The only reason eliminating hallucinations seemed impossible was because people were looking in the wrong place. Allow me to demonstrate.
Consider Alfonso and Afonso. If the query is about Afonso, then you need to filter out any references to any other names that are semantically similar. By removing the Alfonso VII chunks, the LLM cannot conflate the two. In other words, it cannot hallucinate.
Consider the iPhone 8 and iPhone 11 example. If the query is regarding iPhone 11 then you need to remove all chunks regarding iPhone 8. Now the LLM cannot conflate the two. In other words, it cannot hallucinate.
Consider Cruise LLC and the cruise industry. If the query is regarding Cruise LLC then all other semantically similar noun phrases must be filtered out (including the regular noun “cruise” which is a 100% match).
Implementing NEF
First, you need a good library for Named Entity Recognition (NER). Python’s Spacy is very good. In a future article I’ll write about how to combine Python’s Spacy with an LLM to achieve near 100% Named Entity Recognition. For now, you can start with Spacy alone.
Second, you need to identify all noun phrases in the retrieved chunks (to account for conflicts such as Cruise and cruise). Again, Spacy does this out of the box.
For simplicity, I’m going to refer to both named entity extraction and noun-phrase extraction as being “Noun-Phrase Identification (NPI).”
You can either do NPI at the time of query, or do NPI at the time of storage. Doing it at the time of query is cheaper, but significantly slows down the process of getting to a response. However, if you store NPI information as meta data during storage then the filtering is both easy and efficient at the time of query. Therefore, the latter is strongly recommended.
Solving RAG-Based Hallucinations
On the surface, the causes of hallucinations seem disparate. For example, references to time often cause hallucinations, as do citations. Yet, as the lessons shall show, Noun-Phrase Route Collisions not only explain each issue, but fortunately also provides the simple answer to addressing them.
Therefore, each lesson shall follow a common recipe (now I understand why people say I write like a chatbot):
Looking Ahead
Named Entity Filtering is great if the query asks about a single named entity (such as iphone 8). But what if the query itself contains two distinct named entities that are semantically similar? For example, what if the query asks about both a Roth IRA and Roth 401k? As shown in the video “Roth IRA” and “Roth 401k” are semantically similar; yet refer to two different things.
We shall discuss the solution to this situation in Lesson 2.
Acurai
You now have learned the first step in Acurai—100% accurate AI. With Acurai, you can finally build production-ready, hallucination-free chatbots.
Senior QA Engineer @ Luxoft | PhD Student and Researcher @ UTCLUJ | Cloud Solutions & API Testing | Innovating Quality in Software
1 个月Hi Michael Wood , I've been reading your latest article, (haven tried yet, Spacey, although, I'm looking forward to seeing how it works. I've just come across a study, which is kind of opposite of what you are claiming, (maybe, they are talking about different things in this article), anyway, I was wondering if you could have a look at this and tell me if indeed this contradicts your claim or is something else. https://arxiv.org/html/2409.05746v1
Generative AI (GenAI) | LLM, ChatGPT, OpenAI Enterprise | Tokenization | Blockchain
1 个月Great article! I read it this week and wrote a Python script to check the data used in training our digital assistant. While it's not the best practice to use an LLM to check another, the script provided useful insights.