登录查看更多内容

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2024年10月8日

Today's paper explores the internal representations of large language models (LLMs) to better understand and detect their errors, often called "hallucinations". It reveals that LLMs encode more information about the truthfulness of their outputs than previously recognized, but this information is concentrated in specific tokens and doesn't generalize well across different tasks. The paper introduces new methods for analyzing and potentially mitigating LLM errors.

Method Overview

The goal is to predict if an LLM’s generated response given a prompt is correct or incorrect, using only the LLM’s internal states (white-box setting) without external resources like search engines or other LLMs. Given a dataset that contains a series of questions paired with their correct answers, for each question, a model generates a response, which is then compared to the correct answer and labeled as either correct or incorrect. This process builds an error-detection dataset that includes each question, the model’s answer, and its correctness label. Instances where the model refuses to answer are marked as incorrect and excluded.

The method involves probing the internal representations of LLMs at different layers and tokens to detect errors in their outputs. It introduces the concept of "exact answer tokens" - the most meaningful parts of a generated response whose modification would alter the answer's correctness. It then trains probing classifiers (probing classifiers involve training a small classifier on a model’s intermediate activations to predict features of processed text) on these exact answer tokens to detect errors, showing significant improvements over existing methods. It then explores how well these error detection methods generalize across different tasks and datasets.

Further, the method involves categorizing errors into different types based on the model's behavior across multiple samples of the same question. This taxonomy includes categories like "consistently correct", "consistently incorrect", and "many answers".

Finally, the method compares the model's internal representations with its external behavior by using the trained probe to select the best answer from multiple generated responses.

Results

The study found that truthfulness information in LLMs is concentrated in specific tokens, particularly the exact answer tokens. Leveraging this improved error detection performance significantly.

It also revealed that error detection methods don't generalize well across different tasks, suggesting that LLMs encode multiple, distinct notions of truth rather than a universal truthfulness mechanism.

Lastly, it uncovered a significant discrepancy between the model's internal states and external behavior, where models sometimes encode the correct answer internally but consistently generate an incorrect one.

Conclusion

This paper provides deep insights into how LLMs encode and process truthfulness during text generation. It demonstrates that LLMs' internal representations contain valuable information about their errors, which could be leveraged to improve their performance. For more information please consult the?full paper.

Congrats to the authors for their work!

Orgad, Hadas, et al. "LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations." arXiv preprint arXiv:2410.02707 (2024).

要查看或添加评论，请登录

Vlad Bogolin的更多文章

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

2024年10月16日

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Today's paper introduces VisRAG, a new approach to retrieval-augmented generation (RAG) that leverages vision-language…
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

2024年10月15日

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

Today's paper introduces LOKI, a comprehensive benchmark for evaluating large multimodal models (LMMs) on synthetic…
From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

2024年10月14日

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Today's paper introduces VITask, a new framework for adapting large vision language models (VLMs) to specific tasks. It…
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

2024年10月13日

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

Today's paper investigates the challenges of using long-context large language models (LLMs) in retrieval-augmented…
MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

2024年10月12日

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Today's paper introduces MLLM As ReTriever (MART), a new method for enhancing the performance of embodied agents in…
Aria: An Open Multimodal Native Mixture-of-Experts Model

2024年10月11日

Aria: An Open Multimodal Native Mixture-of-Experts Model

Today's paper introduces ARIA, an open multimodal native mixture-of-experts model with state-of-the-art performance…
Pixtral 12B

2024年10月10日

Pixtral 12B

Today's paper introduces Pixtral 12B, a 12-billion-parameter multimodal language model capable of understanding both…
VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

2024年10月9日

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Today's paper introduces VideoGuide, a new framework for improving the temporal consistency of pretrained text-to-video…
Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models

2024年10月7日

Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models

Today's paper addresses a critical issue in Large Vision-Language Models (LVLMs): cross-modality parametric knowledge…
LLaVA-Critic: Learning to Evaluate Multimodal Models

2024年10月6日

LLaVA-Critic: Learning to Evaluate Multimodal Models

Today's paper introduces LLaVA-Critic, an open-source large multimodal model (LMM) designed as a generalist evaluator…

See all articles

Method Overview

Results

Conclusion

Vlad Bogolin的更多文章

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Aria: An Open Multimodal Native Mixture-of-Experts Model

Pixtral 12B

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models

LLaVA-Critic: Learning to Evaluate Multimodal Models