登录查看更多内容

How Well Can Transformers Build World Models

Benjamin Han

Data + Knowledge + AI @ ?

发布日期: 2024年11月8日

Large Language Models (LLMs) are statistical in nature. By learning from enormous corpora, do they actually learn “world models” latent behind the texts? Can next-token prediction really learn the principles governing the environment it operates in?

One recent work from colleagues in Apple gives negative results on math reasoning [1]. They show on GSM8K benchmark — a test set on grade-school-level math questions — SOTA transformer models perform significantly worse when they simply change the names, numbers, or both in the problems (picture 1). This clearly demonstrates these models don’t really learn the math principles behind the problems.

Another interesting work to test if transformers really build world models is by reducing problems into Deterministic Finite Automata (DFA) — graphs depicting possible states and transitions among them when consuming input — and see if these models really learn them [2]. The authors introduce two new metrics: compression precision and distinction precision (picture 2). The compression precision measures how well a model can conclude that the same state, no matter what input sequences are used to reach it, should lead to the same accepted sequences. The distinction precision, on the other hand, measures how well a model can recognize input sequences that lead to different states.

The authors then tested two instances of DFA: navigating streets in New York City and playing the board game Othello. They found big gaps both between the conventional metrics (next-token test and current state probe) and the new metrics (picture 3 and 4), and model performance (measured by the new metrics) when the “world” changes, e.g., randomly increasing cost on certain roads in the city (“noisy shortest paths” in picture 3). Another interesting find is that random walks actually yield far better performance, because the models get to explore all corners of the world, albeit at a much higher cost.

领英推荐

Best Programming Languages for AI: A Comprehensive…

GUVI Geek Networks, IITM Research Park 6 个月前

??Top ML Papers of the Week

DAIR.AI 1 年前

Exploring AI Concepts Through Python's Prism

Django Stars 8 个月前

There are many interesting questions to be explored from here: are these new metrics sufficient to capture models’ ability to learn world models? What about the problems that can’t be reduced to DFA? How can we map next-token prediction to learning the deeper and more holistic representation of the world, or should we find new ways?

REFERENCES

[1] Iman Mirzadeh, keivan alizadeh vahid, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2024. GSM-symbolic: Understanding the limitations of mathematical reasoning in Large Language Models. https://arxiv.org/abs/2410.05229

[2] Keyon Vafa, Justin Y. Chen, Jon Kleinberg, Sendhil Mullainathan, and Ashesh Rambachan. 2024. Evaluating the world model implicit in a generative model. https://arxiv.org/abs/2406.03689

S?ren Christian S?ndergaard Poulsen

4 个月

Thanks for sharing. It’s an interesting question and how to solve it. I’ve been thinking about an Ontology-driven approach to LLM’s, today they exist in combinations with knowledge graphs and LLM’s.

Kyle Grove

Leader in AI & ML | Trust and Safety | Generative AI Safety | Natural Language Processing | Customer Centered Data Science

4 个月

Was just reading the Vafa et al paper. Neat work.

1 次回应

查看更多评论

要查看或添加评论，请登录

Benjamin Han的更多文章

Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

2025年1月15日

Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

What if LLMs had context windows so large that an entire knowledge base could fit into a single prompt? This would…
Generative AI Seeped into Research Peer Reviews

2024年3月27日

Generative AI Seeped into Research Peer Reviews

A while ago Wired wrote about how #ChatGPT and the other similar #GenerativeAI tools are now deployed to mass-produce…

1 条评论
Learning from Tragedies

2023年11月11日

Learning from Tragedies

Are Large Language Models the end of all? Are we running out of problems to solve as Natural Language Processing (NLP)…
Large Language Models as Sleuths

2023年10月24日

Large Language Models as Sleuths

How much breadcrumbs do we leave in our writing? It used to be a job reserved solely for a human sleuth, or a forensic…

4 条评论
Catching a Lying LLM

2023年10月8日

Catching a Lying LLM

What is a lie? How is it different from telling untruths? Can LLMs tell a lie? Telling lies requires both knowing the…

3 条评论
From “Reversal Curse” to Teaching Large Language Models New Facts

2023年10月2日

From “Reversal Curse” to Teaching Large Language Models New Facts

If a powerful LLM is told that “Daphne Barrington is the director of A Journey Through Time”, it would surely be able…

6 条评论
Give Us the Facts: Large Language Models vs. Knowledge Graphs

2023年9月2日

Give Us the Facts: Large Language Models vs. Knowledge Graphs

In this age of LLMs and generative AI, do we still need knowledge graphs (KGs) as a way to collect and organize domain…

14 条评论
Model Editing: Performing Digital Brain Surgery

2023年8月28日

Model Editing: Performing Digital Brain Surgery

Can we "edit" to update incorrect/outdated facts without costly retraining? Recent works such as training auxiliary…

12 条评论
Do LLMs Really Understand? Recent Papers Reveal

2023年7月10日

Do LLMs Really Understand? Recent Papers Reveal

When performing reasoning or generating code, do #LLMs really understand what they’re doing, or do they just memorize?…

27 条评论
NAACL 2022 Panel: "The Place of Linguistics and Symbolic Structures"

2022年7月12日

NAACL 2022 Panel: "The Place of Linguistics and Symbolic Structures"

After hearing various observations/laments from faculty friends that NLP people these days are just applied math people…

5 条评论

See all articles

How Well Can Transformers Build World Models

Benjamin Han

Data + Knowledge + AI @ ?

领英推荐

REFERENCES

Benjamin Han的更多文章

社区洞察

其他会员也浏览了

Mastering Machine Learning: The Essential Tools To Watch Out For In 2023

The Big O notation and its significance in LLMs

AIM Weekly 19-August-2024

My new book on Language Models is here

It’s here: My new book on Language Models

Solving Math Word Problems with LLMs

My new GenAI book is now available!

Alfonso Martínez is our new Back-End Developer

Hallucination-Free, Self-Tuned, Fast Hierarchical LLMs with Multi-Token Embeddings

Do Transformers Really Perform Bad for Graph Representation?

领英推荐

REFERENCES

Benjamin Han的更多文章

Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

Generative AI Seeped into Research Peer Reviews

Learning from Tragedies

Large Language Models as Sleuths

Catching a Lying LLM

From “Reversal Curse” to Teaching Large Language Models New Facts

Give Us the Facts: Large Language Models vs. Knowledge Graphs

Model Editing: Performing Digital Brain Surgery

Do LLMs Really Understand? Recent Papers Reveal

NAACL 2022 Panel: "The Place of Linguistics and Symbolic Structures"

社区洞察

其他会员也浏览了

Mastering Machine Learning: The Essential Tools To Watch Out For In 2023

The Big O notation and its significance in LLMs

AIM Weekly 19-August-2024

My new book on Language Models is here

It’s here: My new book on Language Models

Solving Math Word Problems with LLMs

My new GenAI book is now available!

Alfonso Martínez is our new Back-End Developer

Hallucination-Free, Self-Tuned, Fast Hierarchical LLMs with Multi-Token Embeddings

Do Transformers Really Perform Bad for Graph Representation?