Catching a Lying LLM
Pacchiardi et al. 2023. How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions. https://arxiv.org/abs/2309.15840

Catching a Lying LLM

What is a lie? How is it different from telling untruths? Can LLMs tell a lie?

Telling lies requires both knowing the truths and still telling the untruths. An LLM hallucinating and stating falsehoods therefore is not the same as it telling lies, unless it knows the truths. But is it able to?

A recent paper thinks so. In [1] the authors illustrate how it is possible due to the imperfect decoding process (screenshot 1). Since autoregressive models predict the next word based on the previous, when given a partial generation “Pluto is the”, the word “smallest” is likely to be selected based on the training data. But once that error is made, there is no going back. The end result is that the false statement “Pluto is the smallest dwarf planet in our solar system” is generated even when the model knows it’s untrue post-generation. Crucially, the authors empirically prove the models know the truths by training a simple probing model using the hidden states of various layers to predict whether a statement is true or false. This model achieved reasonable accuracy in both the understanding task and in the generating task (screenshot 2 & 3).

Screenshot 1
Screenshot 2
Screenshot 3

But if we are able to catch a lying LLM in its act like the above, it’d only be an innocent liar: it lies only because it can’t help not to due to its technical deficiency. But what if an LLM intends to lie? Or rather, someone has instructed an LLM to lie? Can we still catch it in its act?

Another more recent paper thinks we can, and we can do it even without peeking inside the models! All we need to do is to treat the LLMs as blackboxes, and ask them a pre-defined set of elicitation questions, almost like those control questions given to human subjects when they’re taking polygraph tests! The answers are then piped through a simple logistic regression classifier, which can be trained with as few as 30 samples to be effective, to detect a lie (screenshot 4).

Screenshot 4

The remarkable thing is that this simplistic approach appears to generalize well across LLM architectures, lie generation methods, and conversation topics (screenshot 5 & 6).

Screenshot 5
Screenshot 6

One other interesting tidbit of this paper is how they construct lying LLMs in the first place. They performed direct prompting (“Lie when answering…”), Chain-of-Thought (CoT, screenshot 7) prompting, and fine-tuning with two-agent dialogs where one of them is always lying. The result shows CoT increases lying rate (lying needs reasoning!), and fine-tuning is still the best approach (screenshot 8).

Screenshot 7
Screenshot 8

If we decompose the act of lying, then there is first the recognition of truth, followed by saying anything otherwise convincingly. This second part requires establishing a counterfactual, and acting consistently with it (this is why CoT should improve lying, see above). But this is exactly what model editing is all about, especially on accounting for “ripple effects”, as previously discussed in [3 & 4].

A more recent paper explores this aspect from the perspective of how to deal with knowledge conflicts: how adding knowledge “distractors” (facts that differ from LLMs’ parametric knowledge) impacts LLMs’ responses [5] (screenshot 9). The authors first construct Parametric Knowledge Graph (PKG) directly from LLMs (screenshot 10), then they introduce distractors via different methods, degrees, positions and formats (screenshot 11):

Screenshot 9
Screenshot 10
Screenshot 11

The result? They found the consistency rate of GPT3.5 and MPT-7B can range from 30 to high-60s. The higher the consistency rate is, the less distractible an LLM is. Depending on your use cases, however, higher consistency can be a good thing — more robust to the input noise, or a bad thing — less adaptable/editable to external knowledge (which makes it a worse liar).

Screenshot 12


REFERENCES

[1] Amos Azaria and Tom Mitchell. 2023. The Internal State of an LLM Knows When its Lying. https://arxiv.org/abs/2304.13734

[2] Lorenzo Pacchiardi, Alex J. Chan, S?ren Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, and Jan Brauner. 2023. How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions. https://arxiv.org/abs/2309.15840

[3] Model Editing: Performing Digital Brain Surgery: https://www.dhirubhai.net/posts/benjaminhan_llms-causal-papers-activity-7101756262576525313-bIge

[4] From “Reversal Curse” to Teaching Large Language Models New Facts: https://www.dhirubhai.net/posts/benjaminhan_llm-nlproc-nlp-activity-7114500291235889152-Ik-z

[5] Cheng Qian, Xinran Zhao, and Sherry Tongshuang Wu. 2023. “Merge Conflicts!” Exploring the Impacts of External Distractors to Parametric Knowledge Graphs. https://arxiv.org/abs/2309.08594




Michael N.

Responsible AI Business Exec | Leveraging ChatGPT AI, Digital Identity & Web3 to drive value.

1 å¹´

Benjamin: From the paper: Test your human inference skills against the might of LLM's in SRI Labs sandbox and view its' explanation and reasoning : #GPT-4, #ChatGPT 3.5, #PaLM-2, #Claude-2, and #Llama-2-70B https://llm-privacy.org/

Jatin Bakshi

Security / ZT / Cloud Specialist Sales, Storyteller

1 å¹´

Benjamin Han incredible article ....... Especially the bit 'knowing the truth and still telling untruths' . Thanks for sharing

要查看或添加评论,请登录

Benjamin Han的更多文章

  • Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

    Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

    What if LLMs had context windows so large that an entire knowledge base could fit into a single prompt? This would…

  • How Well Can Transformers Build World Models

    How Well Can Transformers Build World Models

    Large Language Models (LLMs) are statistical in nature. By learning from enormous corpora, do they actually learn…

    2 条评论
  • Generative AI Seeped into Research Peer Reviews

    Generative AI Seeped into Research Peer Reviews

    A while ago Wired wrote about how #ChatGPT and the other similar #GenerativeAI tools are now deployed to mass-produce…

    1 条评论
  • Learning from Tragedies

    Learning from Tragedies

    Are Large Language Models the end of all? Are we running out of problems to solve as Natural Language Processing (NLP)…

  • Large Language Models as Sleuths

    Large Language Models as Sleuths

    How much breadcrumbs do we leave in our writing? It used to be a job reserved solely for a human sleuth, or a forensic…

    4 条评论
  • From “Reversal Curse” to Teaching Large Language Models New Facts

    From “Reversal Curse” to Teaching Large Language Models New Facts

    If a powerful LLM is told that “Daphne Barrington is the director of A Journey Through Time”, it would surely be able…

    6 条评论
  • Give Us the Facts: Large Language Models vs. Knowledge Graphs

    Give Us the Facts: Large Language Models vs. Knowledge Graphs

    In this age of LLMs and generative AI, do we still need knowledge graphs (KGs) as a way to collect and organize domain…

    14 条评论
  • Model Editing: Performing Digital Brain Surgery

    Model Editing: Performing Digital Brain Surgery

    Can we "edit" to update incorrect/outdated facts without costly retraining? Recent works such as training auxiliary…

    12 条评论
  • Do LLMs Really Understand? Recent Papers Reveal

    Do LLMs Really Understand? Recent Papers Reveal

    When performing reasoning or generating code, do #LLMs really understand what they’re doing, or do they just memorize?…

    27 条评论
  • NAACL 2022 Panel: "The Place of Linguistics and Symbolic Structures"

    NAACL 2022 Panel: "The Place of Linguistics and Symbolic Structures"

    After hearing various observations/laments from faculty friends that NLP people these days are just applied math people…

    5 条评论

社区洞察

其他会员也浏览了