ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Catching a Lying LLM

Benjamin Han

Data + Knowledge + AI @ ?

å‘å¸ƒæ—¥æœŸ: 2023å¹´10æœˆ8æ—¥

What is a lie? How is it different from telling untruths? Can LLMs tell a lie?

Telling lies requires both knowing the truths and still telling the untruths. An LLM hallucinating and stating falsehoods therefore is not the same as it telling lies, unless it knows the truths. But is it able to?

A recent paper thinks so. In [1] the authors illustrate how it is possible due to the imperfect decoding process (screenshot 1). Since autoregressive models predict the next word based on the previous, when given a partial generation â€œPluto is theâ€, the word â€œsmallestâ€ is likely to be selected based on the training data. But once that error is made, there is no going back. The end result is that the false statement â€œPluto is the smallest dwarf planet in our solar systemâ€ is generated even when the model knows itâ€™s untrue post-generation. Crucially, the authors empirically prove the models know the truths by training a simple probing model using the hidden states of various layers to predict whether a statement is true or false. This model achieved reasonable accuracy in both the understanding task and in the generating task (screenshot 2 & 3).

But if we are able to catch a lying LLM in its act like the above, itâ€™d only be an innocent liar: it lies only because it canâ€™t help not to due to its technical deficiency. But what if an LLM intends to lie? Or rather, someone has instructed an LLM to lie? Can we still catch it in its act?

Another more recent paper thinks we can, and we can do it even without peeking inside the models! All we need to do is to treat the LLMs as blackboxes, and ask them a pre-defined set of elicitation questions, almost like those control questions given to human subjects when theyâ€™re taking polygraph tests! The answers are then piped through a simple logistic regression classifier, which can be trained with as few as 30 samples to be effective, to detect a lie (screenshot 4).

The remarkable thing is that this simplistic approach appears to generalize well across LLM architectures, lie generation methods, and conversation topics (screenshot 5 & 6).

One other interesting tidbit of this paper is how they construct lying LLMs in the first place. They performed direct prompting (â€œLie when answeringâ€¦â€), Chain-of-Thought (CoT, screenshot 7) prompting, and fine-tuning with two-agent dialogs where one of them is always lying. The result shows CoT increases lying rate (lying needs reasoning!), and fine-tuning is still the best approach (screenshot 8).

If we decompose the act of lying, then there is first the recognition of truth, followed by saying anything otherwise convincingly. This second part requires establishing a counterfactual, and acting consistently with it (this is why CoT should improve lying, see above). But this is exactly what model editing is all about, especially on accounting for â€œripple effectsâ€, as previously discussed in [3 & 4].

é¢†è‹±æŽ¨è

LLM Paper Reading Notes - October 2024

Jean David Ruvini 5 ä¸ªæœˆå‰

The Challenges of Making Fair Machines

Quanta Magazine 3 å‘¨å‰

??Top ML Papers of the Week

DAIR.AI 11 ä¸ªæœˆå‰

A more recent paper explores this aspect from the perspective of how to deal with knowledge conflicts: how adding knowledge â€œdistractorsâ€ (facts that differ from LLMsâ€™ parametric knowledge) impacts LLMsâ€™ responses [5] (screenshot 9). The authors first construct Parametric Knowledge Graph (PKG) directly from LLMs (screenshot 10), then they introduce distractors via different methods, degrees, positions and formats (screenshot 11):

The result? They found the consistency rate of GPT3.5 and MPT-7B can range from 30 to high-60s. The higher the consistency rate is, the less distractible an LLM is. Depending on your use cases, however, higher consistency can be a good thing â€” more robust to the input noise, or a bad thing â€” less adaptable/editable to external knowledge (which makes it a worse liar).

REFERENCES

[1] Amos Azaria and Tom Mitchell. 2023. The Internal State of an LLM Knows When its Lying. https://arxiv.org/abs/2304.13734

[2] Lorenzo Pacchiardi, Alex J. Chan, S?ren Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, and Jan Brauner. 2023. How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions. https://arxiv.org/abs/2309.15840

[3] Model Editing: Performing Digital Brain Surgery: https://www.dhirubhai.net/posts/benjaminhan_llms-causal-papers-activity-7101756262576525313-bIge

[4] From â€œReversal Curseâ€ to Teaching Large Language Models New Facts: https://www.dhirubhai.net/posts/benjaminhan_llm-nlproc-nlp-activity-7114500291235889152-Ik-z

[5] Cheng Qian, Xinran Zhao, and Sherry Tongshuang Wu. 2023. â€œMerge Conflicts!â€ Exploring the Impacts of External Distractors to Parametric Knowledge Graphs. https://arxiv.org/abs/2309.08594

Michael N.

Responsible AI Business Exec | Leveraging ChatGPT AI, Digital Identity & Web3 to drive value.

1 å¹´

Benjamin: From the paper: Test your human inference skills against the might of LLM's in SRI Labs sandbox and view its' explanation and reasoning : #GPT-4, #ChatGPT 3.5, #PaLM-2, #Claude-2, and #Llama-2-70B https://llm-privacy.org/

èµž

å›žå¤

1 æ¬¡å›žåº”

Jatin Bakshi

Security / ZT / Cloud Specialist Sales, Storyteller

1 å¹´

Benjamin Han incredible article ....... Especially the bit 'knowing the truth and still telling untruths' . Thanks for sharing

èµž

å›žå¤

1 æ¬¡å›žåº”

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Benjamin Hançš„æ›´å¤šæ–‡ç«

Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

2025å¹´1æœˆ15æ—¥

Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

What if LLMs had context windows so large that an entire knowledge base could fit into a single prompt? This wouldâ€¦
How Well Can Transformers Build World Models

2024å¹´11æœˆ8æ—¥

How Well Can Transformers Build World Models

Large Language Models (LLMs) are statistical in nature. By learning from enormous corpora, do they actually learnâ€¦

2 æ¡è¯„è®º
Generative AI Seeped into Research Peer Reviews

2024å¹´3æœˆ27æ—¥

Generative AI Seeped into Research Peer Reviews

A while ago Wired wrote about how #ChatGPT and the other similar #GenerativeAI tools are now deployed to mass-produceâ€¦

1 æ¡è¯„è®º
Learning from Tragedies

2023å¹´11æœˆ11æ—¥

Learning from Tragedies

Are Large Language Models the end of all? Are we running out of problems to solve as Natural Language Processing (NLP)â€¦
Large Language Models as Sleuths

2023å¹´10æœˆ24æ—¥

Large Language Models as Sleuths

How much breadcrumbs do we leave in our writing? It used to be a job reserved solely for a human sleuth, or a forensicâ€¦

4 æ¡è¯„è®º
From â€œReversal Curseâ€ to Teaching Large Language Models New Facts

2023å¹´10æœˆ2æ—¥

From â€œReversal Curseâ€ to Teaching Large Language Models New Facts

If a powerful LLM is told that â€œDaphne Barrington is the director of A Journey Through Timeâ€, it would surely be ableâ€¦

6 æ¡è¯„è®º
Give Us the Facts: Large Language Models vs. Knowledge Graphs

2023å¹´9æœˆ2æ—¥

Give Us the Facts: Large Language Models vs. Knowledge Graphs

In this age of LLMs and generative AI, do we still need knowledge graphs (KGs) as a way to collect and organize domainâ€¦

14 æ¡è¯„è®º
Model Editing: Performing Digital Brain Surgery

2023å¹´8æœˆ28æ—¥

Model Editing: Performing Digital Brain Surgery

Can we "edit" to update incorrect/outdated facts without costly retraining? Recent works such as training auxiliaryâ€¦

12 æ¡è¯„è®º
Do LLMs Really Understand? Recent Papers Reveal

2023å¹´7æœˆ10æ—¥

Do LLMs Really Understand? Recent Papers Reveal

When performing reasoning or generating code, do #LLMs really understand what theyâ€™re doing, or do they just memorize?â€¦

27 æ¡è¯„è®º
NAACL 2022 Panel: "The Place of Linguistics and Symbolic Structures"

2022å¹´7æœˆ12æ—¥

NAACL 2022 Panel: "The Place of Linguistics and Symbolic Structures"

After hearing various observations/laments from faculty friends that NLP people these days are just applied math peopleâ€¦

5 æ¡è¯„è®º

See all articles

Catching a Lying LLM

Benjamin Han

Data + Knowledge + AI @ ?

é¢†è‹±æŽ¨è

REFERENCES

Benjamin Hançš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

LLM part 4

New Book: The Computer Book

From Dot Matrix to Digital Age: Republishing My 1984 A.I. Manuscript (part ii of iii)

What are Hash Functions?

So What's A Finite Field and Why Is It So Important to your Privacy?

Understanding the Properties of Numbers: Exploring Prime, Composite, Even, Odd, Rational, and Irrational Numbers

Emerging Research Trends in LLM's

Cipher Cracking: When Computer Bugs Were Real and Bits Were Cogs

Debugging the Dark Side: What If Serial Killers Applied the Scientific Method?

Dive into zk-SNARKs: No Cryptographic Jargon Required!

é¢†è‹±æŽ¨è

REFERENCES

Benjamin Hançš„æ›´å¤šæ–‡ç«

Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

How Well Can Transformers Build World Models

Generative AI Seeped into Research Peer Reviews

Learning from Tragedies

Large Language Models as Sleuths

From â€œReversal Curseâ€ to Teaching Large Language Models New Facts

Give Us the Facts: Large Language Models vs. Knowledge Graphs

Model Editing: Performing Digital Brain Surgery

Do LLMs Really Understand? Recent Papers Reveal

NAACL 2022 Panel: "The Place of Linguistics and Symbolic Structures"

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

LLM part 4

New Book: The Computer Book

From Dot Matrix to Digital Age: Republishing My 1984 A.I. Manuscript (part ii of iii)

What are Hash Functions?

So What's A Finite Field and Why Is It So Important to your Privacy?

Understanding the Properties of Numbers: Exploring Prime, Composite, Even, Odd, Rational, and Irrational Numbers

Emerging Research Trends in LLM's

Cipher Cracking: When Computer Bugs Were Real and Bits Were Cogs

Debugging the Dark Side: What If Serial Killers Applied the Scientific Method?

Dive into zk-SNARKs: No Cryptographic Jargon Required!

é¢†è‹±æŽ¨è

From â€œReversal Curseâ€ to Teaching Large Language Models New Facts

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†