Do Large Language Models Lack Reasoning?

Problem Statement

It might seem as if large language models (LLMs) lack reasoning. But what does this actually mean?

Let’s break it down:

Let x? represent the first token the LLM generates.
Let x? represent the second token.
Let x? represent the third token.
...
Let x? represent the nth token.

When we give a prompt to an LLM, here’s what happens:

It generates x? such that p(x?) is maximized.
Then, given x?, it generates x? such that p(x? | x?) is maximized.
Next, it generates x? such that p(x? | x?, x?) is maximized.

And so on, step by step.

At first glance, this doesn’t feel like reasoning. It feels like a greedy algorithm, optimizing one step at a time without considering the full picture. This makes us wonder:

What if we generated 50 tokens first and realized that a different set of tokens would have created a more logical answer?
What if this greedy approach prevents the model from revising earlier decisions?

It feels like when an LLM starts generating a token, it has no idea what it’s going to say 50 or 100 steps ahead. After all, if a model has reasoning, shouldn’t it already have thought through the entire response?

What Does a Logical Answer Even Mean?

Let’s put LLMs aside for a moment and think about what a logical answer even is.

Imagine we’ve built an Artificial General Intelligence (AGI) system. When given a prompt, AGI is so intelligent that it provides the most logical and sound answer possible, based on everything it knows.

But what does a "logical and sound" answer mean mathematically? It means AGI generates a sequence of tokens (x?, x?, ..., x?) such that the joint probability p(x?, x?, ..., x?) is maximized.

If the joint probability isn’t maximized, how can we call the answer the most sound? AGI would ensure that the joint probability of all tokens in the sequence is the highest possible.

How LLMs Work vs. How AGI Might Work

So far, we’ve discussed two things:

How LLMs generate tokens (sequentially).
How AGI would ideally generate tokens (maximizing joint probability).

Here’s the key insight: LLMs also maximize the joint probability of all tokens, but they do so sequentially.

If an LLM maximizes p(x?), p(x? | x?), p(x? | x?, x?), and so on, each separately, what does this mean? It means their product is also maximized! Mathematically, this product is:

p(x?) × p(x? | x?) × p(x? | x?, x?) × ...

And this is equivalent to the joint probability p(x?, x?, ..., x?). In other words, LLMs maximize the joint probability of all the generated tokens!

Why Don’t LLMs Generate All Tokens at Once?

This ties back to how they’re trained:

During training, we provide the model with an input sequence and let it predict one token at a time, masking the rest of the target sequence.

Why? Because if the model sees the entire target sequence upfront, it won’t learn to "reason." Instead, it would simply memorize the sequence.

Think of it like teaching students. A good teacher doesn’t dump the full answer on them and say, "Figure it out." Instead, they guide students step by step, correcting them along the way. This approach helps students develop reasoning skills.

Similarly, LLMs learn to predict sequences incrementally, allowing them to generalize better.

Could we show the entire target sequence at once? Sure, but this would likely lead to overfitting and poor learning. Imagine trying to teach the Pythagorean theorem by dumping the proof on a student and saying, "Learn this." It might work sometimes, but most likely, the student would end up memorizing without understanding.

Alternatively, if we walk the student through the proof step by step, they’ll develop reasoning patterns they can apply to other problems. The more they see step-by-step reasoning, the better they learn.

This is exactly how LLMs work. By no means does this mean the LLM isn’t aware of the full picture. It very much is!

Because of this sequential training process, the decoder in an LLM generates tokens sequentially. But this doesn’t mean the LLM isn’t thinking ahead!

In fact, while generating the first token, the LLM is already reasoning about the entire sequence. It just can’t output everything at once because of how it’s designed.

Whenever you doubt that LLMs lack reasoning, remember this:

LLMs don’t simply predict the next token in isolation.
They predict tokens based on patterns they’ve learned, ensuring that the target sequence as a whole makes sense for a given input sequence.

So yes, while the process of generating tokens is sequential, the reasoning spans the entire sequence!

Why Do People Think LLMs Lack Reasoning?

The confusion comes from a misunderstanding of how LLMs are built. Many people assume that sequential generation means the model isn’t reasoning about the whole sequence.

But here’s the truth:

LLMs have already learned relationships between tokens during training.
They understand which sequences are most likely to follow others.

While LLMs generate tokens one at a time, the conditional probability framework ensures that the joint probability of the entire sequence is maximized.

Conclusion

Do LLMs lack reasoning? Absolutely not. The reasoning is there. It’s just embedded in how they optimize for the sequence step by step. LLMs are much smarter than they’re given credit for!

Do Large Language Models Lack Reasoning?

Babak Mehrabi

Problem Statement

What Does a Logical Answer Even Mean?

How LLMs Work vs. How AGI Might Work

领英推荐

Why Don’t LLMs Generate All Tokens at Once?

Why Do People Think LLMs Lack Reasoning?

Conclusion

社区洞察

其他会员也浏览了

Retrieval-Augmented Generation (RAG) and Agentic RAG

Enhancing Reasoning in Transformer-Based Large Language Models via Symbolic Templates

Emergence of Small Language Models

Advancing Reasoning Strategies in Large Language Models

How to think about Large Language Models

Faithful Logical Reasoning- Symbolic Chain-of-Thought & GNN-RAG - Graph Neural Retrieval for Large Language Model Reasoning

Chaining Large Language Model Prompts

Breaking Barriers: How RAG Elevates Language Model Proficiency

Are Artificial Language Sweeteners Hijacking AI-generated Text?

Fortune Teller or Fool's Gold? Why LLM Predictions Need a Reality Check