登录查看更多内容

Why do LLMs Hallucinate?

Matthew Weaver

Experienced consultant - Team Leader | Gen AI | Machine Learning | Data Analytics | Problem Solver

发布日期: 2024年9月16日

Because response diversity cannot exist without it

Large language models (LLMs) generate text by predicting the most likely sequence of words based on a given prompt. While LLMs are powerful, they can sometimes produce factually incorrect or unrelated responses to the prompt. These mistakes are commonly known as ‘hallucinations’.

Hallucinations happen when the model creates information that sounds plausible but isn’t grounded in real-world facts. Hallucinations derive from the same mechanisms that allow LLMs to generate creative and flexible responses. This article will explore why hallucinations occur, the trade-off between hallucinations and creativity, and how we can reduce hallucinations without stifling the diversity in a model’s outputs.

How do LLMs Create a Response?

Let’s examine a high-level process of what happens — refer to Figure 1. We often call this approach Top-P, and it is commonly used to create LLM responses.

Figure 1: High-Level Process for creating an LLM Response

Step 1: Submit a Prompt

The first step is straightforward: we submit a prompt to our LLM of choice.

Step 2: Assign Probabilities to Words

Based on previous training, the model assigns probabilities to words in its vocabulary. I use ‘words’ here, although a vocabulary can include other items such as partial words and punctuation. A vocabulary can contain 50,000 items or more. As a comparison, a typical native English speaker will have learned between 15,000 and 35,000 words by the time they reach adulthood.

Figure 2: Assigning probabilities to words in an LLMs vocabulary

Figure 2 shows bars representing words and their assigned probabilities. The chart shows only the first nine items of a model’s vocabulary.

Step 3: Sort the Vocabulary by Probability

The vocabulary is sorted by probability, with the most likely words at the beginning (Figure 3).

Figure 3: The LLM vocabulary after sorting by probability

Step 4: Determine the Shortlist Threshold

Now we have a sorted vocabulary, it’s time to create a shortlist. To do this, we use a threshold, a value between zero and 1. A threshold of 0.9 is typical, although we can change this if necessary. We use this threshold to create a shortlist of candidate words.

Step 5: Create a Shortlist

A shortlist will contain enough words so that their cumulative probabilities are greater than, or equal to, the threshold. Beginning on the left hand side of our sorted vocabulary, we continue adding words until the statement above is satisfied. If the first probability, P1 equals or exceeds the threshold then our shortlist will contain only 1 item.

Figure 4: Creating a shortlist of candidate words

Figure 4 shows an example where five words are required to exceed the threshold such that:

领英推荐

Recursion: an endless journey

Pablo Conte 4 个月前

How to Choose a Reliable Language Test

Teaching English with Oxford 4 个月前

Paper Review: Pixel Aligned Language Models

Andrey Lukyanenko 1 年前

In other words:

The sum of the first four probabilities is less than the threshold value.
The sum of the first five probabilities is greater than or equal to the threshold value.

So, we now have a shortlist consisting of the first five words in the sorted vocabulary list.

Step 6: Randomly Choose the Next Word

In this step, the next word is chosen randomly from the shortlist. The process loops until the response is complete. A model has several ways to determine when to stop — we won’t go into the details of stopping here.

Any word in the shortlist has an equal chance of selection, regardless of its probability.

Once we have a shortlist, the word probabilities are no longer required. The model chooses an item from the shortlist entirely at random. This random selection is where the potential for hallucination exists. The mechanism that creates diversity in LLM responses is the same mechanism that causes hallucinations.

Shortlist Examples to Highlight Hallucinations

Including low-probability words in the shortlist can lead to hallucinations. Figure 5 shows three possible shortlists.

Scenario 1: A single candidate

The probability of the first word equals or exceeds the threshold, so this one single word is all we need. Single-word shortlists are more likely to occur for factual statements and everyday phrases such as:

The capital of England is … London
Once upon a … time

Lowering the threshold value will increase the chances of a single candidate. With lower thresholds, the response will have low diversity and is less likely to hallucinate.

Scenario 2: Candidates with similar probabilities

For this scenario, all candidates in the shortlist have similar probabilities. This scenario is beneficial for creative responses such as storytelling or ideation. Hallucinations can occur if the shortlist contains candidates not strongly related to the context of the prompt.

Scenario 3: Candidates with low probabilities

As we have seen, even though there is only one high-probability word, the random selection process in Top-p sampling treats all shortlisted words equally.

The Top-p process means the model is equally likely to choose one of the lower-probability words.

A low-probability word may be acceptable if you use an LLM to help you write a story. For fact-based responses, this scenario may not yield credible responses.

Can we Reduce Hallucinations?

We can do several things to reduce hallucinations, and it is an area of active research. The effort required for a particular approach may vary massively; examples include:

Reduce the Top-p threshold or the model’s response temperature.
Always choose the most probable word at the risk of removing diversity.
Use Top-k sampling where only the k most probable words are selected regardless of the length of the candidate list.
Integrate external information using RAG (Retrieval Augmented Generation).
Some form of verification after an LLM generates a response.
Carefully designed prompts COSTAR is a technique often used in prompt engineering.
Set a Min-p threshold where the model only considers words with a probability greater than this threshold. Combining Min-p and Top-p is a common strategy.
Train a model using highly curated and validated data.

Final Thoughts

There is currently no method or combination of techniques that will completely eradicate hallucinations. After all, models work by predicting outputs rather than retrieving them. The steps you take to reduce hallucination will depend on your use case. Compare a model that provides medical advice with one that writes children’s stories.

Always consider the trade-off between diversity/creativity and the potential for hallucinations. Because, at least for now, you can’t have one without the other.

要查看或添加评论，请登录

Matthew Weaver的更多文章

Why do your friends have more friends than you?

2024年12月12日

Why do your friends have more friends than you?

It’s more of a statistical trait than a personal one. Have you ever thought that your friends on social media seem to…
Ask the Right Questions, Get the Right?Answers

2024年12月5日

Ask the Right Questions, Get the Right?Answers

The ability to ask great questions is one of the biggest differentiators between a great leader and a mediocre one -…
The Devil’s in the Detail: Why Coastlines Defy Measurement

2024年9月9日

The Devil’s in the Detail: Why Coastlines Defy Measurement

We’ll use a simple fractal to make sense of it all Many of us have dreamt about owning a private island to retire to…

4 条评论
Don’t Be Fooled: Sneaky Stats Can Sabotage Your Sales Analysis

2024年9月5日

Don’t Be Fooled: Sneaky Stats Can Sabotage Your Sales Analysis

Uncover a hidden statistical trap that may be misleading you into making poor business decisions. Imagine you are the…

2 条评论
The long and winding road

2023年1月26日

The long and winding road

Last week, I received an email telling me a recent online purchase was out for delivery. There was a real-time tracking…

8 条评论
Not all choices are equal

2022年12月12日

Not all choices are equal

We are all familiar with the classic dilemma - you have two (or more) choices and must decide on the best option…
Simple is not always easy

2021年11月17日

Simple is not always easy

Last week I wrote about a simple process that can generate complex patterns. Today's topic is the equally 'simple' but…

7 条评论
Can random choices lead to predictable outcomes?

2021年11月11日

Can random choices lead to predictable outcomes?

The simplest algorithms can sometimes generate unexpected outcomes. This post looks at a simple, three-step process and…
Digital (image) transformation

2020年9月10日

Digital (image) transformation

Digital transformation can mean many things to many people. In this post, I avoid the bigger question and demonstrate…

4 条评论
My thoughts on Future Decoded 2019

2019年10月16日

My thoughts on Future Decoded 2019

A couple of weeks ago, a familiar journey ended as I arrived at London ExCeL for Microsoft Future Decoded (FD) 2019. My…

3 条评论

See all articles

社区洞察

Linguistics

You’ve collected data on language change. How can you analyze it effectively?

Why do LLMs Hallucinate?

Matthew Weaver

Experienced consultant - Team Leader | Gen AI | Machine Learning | Data Analytics | Problem Solver

Because response diversity cannot exist without it