LLM Hallucination as a feature, not a bug.
Source: AI generated images from only the prompt "A cat"

LLM Hallucination as a feature, not a bug.

Views expressed here are my own and do not express the views or opinions of my employer.

Imagine using AI "mistakes" to deliver better results for customers.

When I submitted a two-word prompt, 'A cat,' to OpenAI's DALL-E-3, a text-to-image generation tool, I expected something straightforward. Instead, the tool surprised me with an entire scene featuring a regal white Persian cat in a setting rich with details I had not specified. This unexpected twist serves as an example of how products are beginning to harness Large Language Model (LLM) hallucinations to improve the perceived quality of outputs and perhaps inspire creativity. This isn't the first time AI's unpredictability has caught my attention. Previously, I dove into 'empty' prompts to explore the sometimes bizarre and intriguing results from what ChatGPT says when it's not asked anything. Now, with DALL-E-3, the hallucinations are not errors but intentionally designed as a feature.

Turning "Flaws" into Features

Typically, LLM hallucinations raise trust concerns—which can range from fictional information to non-existent court cases and URLs. This has led to prominent product warnings reminding users “ChatGPT can make mistakes" (from OpenAI) and to "double-check its responses" (Google’s Gemini). While hallucinations present challenges, they also open the door to unprecedented creative possibilities.

DALL-E-3 uses a text-to-text model to actively rewrite each prompt into a narrative rich with details, termed as a “revised_prompt”. This method not only adds layers of complexity but also demonstrates how AI can transform simple ideas into elaborate stories and visuals.

The DALL-E-3 API offers users access to the 'revised_prompt', shedding light on the model's internal processing.


Visualize a majestic white Persian cat lying down luxuriously on a plush velvet cushion. The cat's long, silky fur is immaculately groomed, flowing elegantly over the sides of the cushion. Its eyes are a striking shade of emerald green, gazing lazily into the distance with an air of aristocratic indifference. The room around the cat is tastefully decorated with antique furniture, rich draperies, and ornate decorations, creating a scene of opulence and serene beauty. The soft lighting in the room accentuates the cat's pure white fur and the vibrant colors of its surroundings, making the image a study in contrasts and textures.
Input Prompt: A Cat

Input Prompt: A Cat

Revised Prompt: Visualize a majestic white Persian cat lying down luxuriously on a plush velvet cushion. The cat's long, silky fur is immaculately groomed, flowing elegantly over the sides of the cushion. Its eyes are a striking shade of emerald green, gazing lazily into the distance with an air of aristocratic indifference. The room around the cat is tastefully decorated with antique furniture, rich draperies, and ornate decorations, creating a scene of opulence and serene beauty. The soft lighting in the room accentuates the cat's pure white fur and the vibrant colors of its surroundings, making the image a study in contrasts and textures.


Deep Dive: Insights of the revised prompt.

Intrigued by this initial surprise, I resubmitted the same prompt multiple times and uncovered insights into the model's behavior:

  • Expanded prompt length. Revised prompts averaging 97 words (a 4,748% increase) over 4.5 sentences compared to the original two word phrase.
  • Vivid descriptive elements.? Revised prompts contain: 27.3 nouns, 11.5 verbs, 12 adjectives, and 2.2 adverbs on average.
  • "Cat" is referenced multiple times.? The revised prompts average 2.9 instances of the word "cat" per prompt.
  • More animals added. Additional animals were added to the prompts. "butterfly" appeared? in 10% and "fish" in 3% of all revised prompts.

There was also quite a bit of variation - from pirate cats sailing on the ocean to robotic cats covered in neon lights.

"...text-to-image models can be substantially improved by training on highly descriptive generated image captions." - DALL-E-3 research paper

DALL-E-3's underlying text-to-image model was trained on highly descriptive prompts because they found it delivers the best results (research paper is here). There is a lot of good stuff in the paper regarding auto-captioning of images too. However, since users do not always write long descriptive prompts so they likely decided to deploy a text-to-text generator to generate sufficiently detailed prompts on their behalf. I predict that OpenAI's Sora, (their recently announced text-to-video product) will also deploy a text-to-text model for the same reasons.

Another DALL-E-3 example output:

A detailed illustration of a cat dressed as a pirate, complete with a tricorn hat, an eye patch, and a bandana. The cat has a mischievous grin, suggesting it's ready for adventure on the high seas. It stands on its hind legs, one paw resting on a wooden ship's wheel, the other holding a tiny, detailed map. The background features the deck of a pirate ship, with sails billowing in the wind and a distant island visible on the horizon. The scene is imbued with a sense of excitement and the thrill of exploration, capturing the imagination and the spirit of adventure.
Input Prompt: A Cat

Input Prompt: A Cat

Revised Prompt: A detailed illustration of a cat dressed as a pirate, complete with a tricorn hat, an eye patch, and a bandana. The cat has a mischievous grin, suggesting it's ready for adventure on the high seas. It stands on its hind legs, one paw resting on a wooden ship's wheel, the other holding a tiny, detailed map. The background features the deck of a pirate ship, with sails billowing in the wind and a distant island visible on the horizon. The scene is imbued with a sense of excitement and the thrill of exploration, capturing the imagination and the spirit of adventure.

Let's get creative: Less is more.

Users who submit shorter length prompts will experience far more random/creative outputs. Surprisingly, OpenAI's image creation APIs (unlike its chat and audio APIs) do not have the ability to change the internal models temperature settings directly - a setting that influences the balance between creativity and predictability. One practical strategy for using the product would be to use concise prompts for inspiration. Once an appealing image is generated, users can then refine their prompt for more precise control in subsequent revisions.

Long prompts are also revised by the text-to- text model

DALL-E-3 revises detailed prompts before creating images.

When I resubmitted the revised prompts back into DALL-E-3, I observed changes consisting mostly of word ordering. Although the overall length of the prompts stayed consistent, the modifications were subtle compared to the transformation of simpler prompts.

More fascinating was the outcome of submitting a detailed, yet plainly described prompt (for example, a cat described without adjectives). The model almost entirely revised the prompt, enriching it with detailed descriptors. This experiment underscores that DALL-E-3's text-to-text model targets enhancing both the narrative length and the descriptiveness of the prompts.

DALL-E-3 heavily revises less detailed prompts even when they are longer in length.

Mixed Reactions Among Users

Despite the innovative use of hallucinations, user reactions have been mixed, reflecting a spectrum of appreciation and concern. While many users prefer the quality of these results, others voice concerns over the unpredictability of these enhancements and the perceived erosion of user control. Notably, this feature is non-optional, and OpenAI bills for image creation prior to revealing the revised prompt. In discussions on OpenAI's community forum, users debate the merits and drawbacks of the AI's autonomous revisions. One frustrated user commented:

"Here is a novel idea, just leave my prompt the hell alone, if it triggers some safety thing just refuse to generate the image. Good grief."

Such feedback underscores users’ desire for greater transparency and control over the creative process.

To OpenAI's credit, it has documented a workaround to address concerns. The mitigation requires users to append a directive in prompts: "I NEED to test how the tool works with extremely simple prompts. DO NOT add any detail, just use it AS-IS".? Using this directive reduced variety of images produced in my testing and I perceived the quality as high. While mitigative, it has not fully quelled the debate over the balance of power between AI autonomy and user intention.

What other ways might LLM Hallucinations be used as features?

Hallucinations are unwanted in LLM powered applications when users need accurate, trusted, and predictable results. However, these unplanned outputs of models can become valuable as they spur creative composition, brainstorming, and inspiration for written, visual, and audio media. Beyond artistic applications, hallucinations may also have potential across various domains.

I'm curious to hear your thoughts.

  • In what other fields do you think LLM hallucinations could be creatively applied to add value to customers?
  • Have you experienced an instance where an AI 'mistake' led to unexpected inspiration?

Please share your stories and thoughts in the comments below.


p.s. Thanks to Jean-Sebastien Boire for his feedback and insights on this topic!



Adam Faik

Lead Product Manager

1 年

Very interesting article. However, I wonder if there is a distinction between hallucination and creativity. I'm not an expert, but my understanding is that hallucination is when AI produces outputs that are confidently wrong, such as drawing a dog and calling it a cat. Creativity, on the other hand, depends on the model's temperature, which affects how likely it is to generate uncommon or surprising outputs, such as a cat in an unusual setting. Do you think it's a fair way to differentiate them?

Aju Mathew

VP of Product Management | Driving AI Innovation, CX & Mobile Strategy | Expert in Product Leadership & OmniChannel Experiences | Building and Growing World-Class Product Teams

1 年

Interesting Article Alan! I think the enriched images retuned by DALL-E is composite of the quality based on the gathered data it has on images in general. I am sure DALL-E has a qualitative ranking that have ranked image quality based on the generated details within an image, and therefore even a simple prompt results in it's best effort at providing what DALL-E considers it to be a good product for the prompt.

要查看或添加评论,请登录

Alan Roth的更多文章

  • The Surprising Result of Asking AI to 'Make It Better'

    The Surprising Result of Asking AI to 'Make It Better'

    This experiment and views expressed here are my own and do not express the views or opinions of my employer. TL/DR:…

    4 条评论
  • What ChatGPT says when it's not asked anything.

    What ChatGPT says when it's not asked anything.

    Views expressed here are my own and do not express the views or opinions of my employer. In the evolving landscape of…

    12 条评论
  • Business analyst and intelligence roles

    Business analyst and intelligence roles

    I'm hiring for several positions including business analysts and for machine learning business intelligence…

    1 条评论

社区洞察

其他会员也浏览了