Do LLMs Really Understand? Recent Papers Reveal
From Xiaojuan Tang, Zilong Zheng, Jiaqi Li, Fanxu Meng, Song-Chun Zhu, Yitao Liang, and Muhan Zhang. 2023. Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners.

Do LLMs Really Understand? Recent Papers Reveal

When performing reasoning or generating code, do #LLMs really understand what they’re doing, or do they just memorize? Several new results seem to have painted a not-so-rosy picture.

The authors in [1] are interested in testing LLMs on “semantic” vs. “symbolic” reasoning: the former involves reasoning with language-like input, and the latter is reasoning with abstract symbols. They use a symbolic dataset and a semantic dataset to test models’ abilities on memorization and reasoning (Fig. 1). For each dataset they created a corresponding one in the other modality, e.g., they replace natural language labels for the relations and the entities with abstract symbols to create a symbolic version of a semantic dataset (Fig. 2). The end result? LLMs perform much worse on *symbolic* reasoning (Fig. 3), suggesting they leverage heavily on the semantics of the words involved rather than really understand and follow reasoning patterns.

No alt text provided for this image
Fig.1
No alt text provided for this image
Fig. 2
No alt text provided for this image
Fig. 3

The same tendency is borne out by another paper focusing on testing code-generating LLMs when function names are *swapped* in the input [2] (Fig. 4). They not only found almost all models failed completely, but also most of them exhibit an “inverse scaling” effect: the larger a model is, the worse it gets (Fig. 5). This shows the semantic priors learned from these function names have totally dominated, and the models don’t really understand what they are doing.?

No alt text provided for this image
Fig. 4
No alt text provided for this image
Fig. 5

How about LLMs on #causalReasoning ? There have been reports of extremely impressive performance of #GPT 3.5 and 4, but these models also lack consistency in performance and even possibly have cheated by memorizing the tests [3], as discussed in a previous post [4]. In a more recent work [5], the authors tested LLMs on *pure* #causalInference tasks, where all variables are now symbolic (Fig. 6). They constructed systematically a dataset starting by picking variables, to generating all possible #causalGraphs , to finally mapping all possible statistical #correlations . They then “verbalize” these graphs into problems for LLMs to solve for a given causation hypothesis (Fig. 7). The results? Both #GPT4 and #Alpaca perform worse than BART fine-tuned with MNLI, and not much better than the uniform random baseline (Fig. 8).

No alt text provided for this image
Fig. 6
No alt text provided for this image
Fig. 7
No alt text provided for this image
Fig. 8


(On #Mastodon : https://sigmoid.social/@BenjaminHan/110687186074579773)

#Paper #NLP #NLProc #CodeGeneration #Causation #CausalReasoning #reasoning #research ?

References

[1] Xiaojuan Tang, Zilong Zheng, Jiaqi Li, Fanxu Meng, Song-Chun Zhu, Yitao Liang, and Muhan Zhang. 2023. Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners. https://arxiv.org/abs/2305.14825 ?

[2] Antonio Valerio Miceli-Barone, Fazl Barez, Ioannis Konstas, and Shay B. Cohen. 2023. The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python. https://arxiv.org/abs/2305.15507 ?

[3] Emre K?c?man, Robert Ness, Amit Sharma, and Chenhao Tan. 2023. Causal Reasoning and Large Language Models: Opening a New Frontier for Causality. https://arxiv.org/abs/2305.00050 ?

[4] https://www.dhirubhai.net/posts/benjaminhan_reasoning-gpt-gpt4-activity-7060428182910373888-JnGQ

[5] Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, and Bernhard Sch?lkopf. 2023. Can Large Language Models Infer Causation from Correlation? https://arxiv.org/abs/2306.05836 ?

Kyrtin Atreides

COO | Cognitive Architecture & Cognitive Bias Researcher | Co-founder

1 年

Actual AI experts have understood that LLMs have neither reasoning nor understanding in any meaningful sense all along. The number of vocal people without a shred of understanding has mostly just exploded over the past year, skewing heuristic availability. A fundamental understanding of what any technology is and is not capable of is a thing that no one talking about that technology can afford to lose sight of. A few like the researchers at Stanford have made a good showing this year, debunking many of the fraudulent claims, like putting a stake through "emergent abilities": https://arxiv.org/abs/2304.15004

回复

I don’t know how safe it is for us to be denigrating SkyNet in the open ??

回复
Debela Tesfaye

Senior Data Scientists at ContactEngine

1 年

Thanks for the review!

回复
Johnmark Obiefuna

Learning Manager @ Andela

1 年

"Understanding" is a function of sentience, which only humans have. "Pattern identification" however, is how machines draw inferences between apparent cause and effect - right or wrong, which is in turn riddled by pre-programmed human biases.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了