How Well Can Transformers Build World Models

How Well Can Transformers Build World Models

Large Language Models (LLMs) are statistical in nature. By learning from enormous corpora, do they actually learn “world models” latent behind the texts? Can next-token prediction really learn the principles governing the environment it operates in?

One recent work from colleagues in Apple gives negative results on math reasoning [1]. They show on GSM8K benchmark — a test set on grade-school-level math questions — SOTA transformer models perform significantly worse when they simply change the names, numbers, or both in the problems (picture 1). This clearly demonstrates these models don’t really learn the math principles behind the problems.

Picture 1

Another interesting work to test if transformers really build world models is by reducing problems into Deterministic Finite Automata (DFA) — graphs depicting possible states and transitions among them when consuming input — and see if these models really learn them [2]. The authors introduce two new metrics: compression precision and distinction precision (picture 2). The compression precision measures how well a model can conclude that the same state, no matter what input sequences are used to reach it, should lead to the same accepted sequences. The distinction precision, on the other hand, measures how well a model can recognize input sequences that lead to different states.

Picture 2

The authors then tested two instances of DFA: navigating streets in New York City and playing the board game Othello. They found big gaps both between the conventional metrics (next-token test and current state probe) and the new metrics (picture 3 and 4), and model performance (measured by the new metrics) when the “world” changes, e.g., randomly increasing cost on certain roads in the city (“noisy shortest paths” in picture 3). Another interesting find is that random walks actually yield far better performance, because the models get to explore all corners of the world, albeit at a much higher cost.

Picture 3
Picture 4

There are many interesting questions to be explored from here: are these new metrics sufficient to capture models’ ability to learn world models? What about the problems that can’t be reduced to DFA? How can we map next-token prediction to learning the deeper and more holistic representation of the world, or should we find new ways?

REFERENCES

[1] Iman Mirzadeh, keivan alizadeh vahid, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2024. GSM-symbolic: Understanding the limitations of mathematical reasoning in Large Language Models. https://arxiv.org/abs/2410.05229

[2] Keyon Vafa, Justin Y. Chen, Jon Kleinberg, Sendhil Mullainathan, and Ashesh Rambachan. 2024. Evaluating the world model implicit in a generative model. https://arxiv.org/abs/2406.03689

S?ren Christian S?ndergaard Poulsen

Partner syv.ai | AI Award Winner | Management Advisor | AI strategy | Make it right | [email protected]

4 个月

Thanks for sharing. It’s an interesting question and how to solve it. I’ve been thinking about an Ontology-driven approach to LLM’s, today they exist in combinations with knowledge graphs and LLM’s.

回复
Kyle Grove

Leader in AI & ML | Trust and Safety | Generative AI Safety | Natural Language Processing | Customer Centered Data Science

4 个月

Was just reading the Vafa et al paper. Neat work.

要查看或添加评论,请登录

Benjamin Han的更多文章

社区洞察

其他会员也浏览了