LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
Credit: https://arxiv.org/pdf/2410.02707

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Today's paper explores the internal representations of large language models (LLMs) to better understand and detect their errors, often called "hallucinations". It reveals that LLMs encode more information about the truthfulness of their outputs than previously recognized, but this information is concentrated in specific tokens and doesn't generalize well across different tasks. The paper introduces new methods for analyzing and potentially mitigating LLM errors.

Method Overview

The goal is to predict if an LLM’s generated response given a prompt is correct or incorrect, using only the LLM’s internal states (white-box setting) without external resources like search engines or other LLMs. Given a dataset that contains a series of questions paired with their correct answers, for each question, a model generates a response, which is then compared to the correct answer and labeled as either correct or incorrect. This process builds an error-detection dataset that includes each question, the model’s answer, and its correctness label. Instances where the model refuses to answer are marked as incorrect and excluded.

The method involves probing the internal representations of LLMs at different layers and tokens to detect errors in their outputs. It introduces the concept of "exact answer tokens" - the most meaningful parts of a generated response whose modification would alter the answer's correctness. It then trains probing classifiers (probing classifiers involve training a small classifier on a model’s intermediate activations to predict features of processed text) on these exact answer tokens to detect errors, showing significant improvements over existing methods. It then explores how well these error detection methods generalize across different tasks and datasets.

Further, the method involves categorizing errors into different types based on the model's behavior across multiple samples of the same question. This taxonomy includes categories like "consistently correct", "consistently incorrect", and "many answers".

Finally, the method compares the model's internal representations with its external behavior by using the trained probe to select the best answer from multiple generated responses.

Results

The study found that truthfulness information in LLMs is concentrated in specific tokens, particularly the exact answer tokens. Leveraging this improved error detection performance significantly.

It also revealed that error detection methods don't generalize well across different tasks, suggesting that LLMs encode multiple, distinct notions of truth rather than a universal truthfulness mechanism.


Lastly, it uncovered a significant discrepancy between the model's internal states and external behavior, where models sometimes encode the correct answer internally but consistently generate an incorrect one.

Conclusion

This paper provides deep insights into how LLMs encode and process truthfulness during text generation. It demonstrates that LLMs' internal representations contain valuable information about their errors, which could be leveraged to improve their performance. For more information please consult the?full paper.

Congrats to the authors for their work!

Orgad, Hadas, et al. "LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations." arXiv preprint arXiv:2410.02707 (2024).

要查看或添加评论,请登录

Vlad Bogolin的更多文章