LLM Math?

At the beginning of the year, I predicted that reasoning would become a hot topic, but I didn’t expect it to blow up so quickly.

Just like my 2023 prediction about MoE, by the end of 2024, reasoning is almost certainly going to be the AI buzzword of the year. I might as well be Diavolo’s King Crimson.

But to be honest, there hasn’t been much worth writing about lately, and I’ve received some complaints from my friends.

So today, I decided to cover two papers at once.

Not just to fill up space, but because these two papers are actually related!

First Paper

What does this paper discuss?

I’ll start with a conclusion: LLMs don’t actually solve math problems; they just rely on brute-force exposure to massive amounts of problems.

Let me explain how this paper validates that claim.

It modifies and intervenes in original problems — not just randomly, but in a way that makes the new problems appear similar while requiring completely different solutions.

These interventions are divided into simple and hard categories.

  • Simple interventions: Minor modifications, like changing a denominator from x+1x+1x+1 to x+2x+2x+2.
  • Hard interventions: Drastic changes, like modifying the denominator from x+1x+1x+1 to just xxx.

You might think changing x+1x+1x+1 to x+2x+2x+2 or just xxx isn’t a big deal.

But it’s actually a fundamental shift.

Take the first problem as an example. Changing x+1x+1x+1 to x+2x+2x+2 still allows the problem to be solved using factoring.

But if you change it to xxx, can you still factor it?

No — you have to use the Cauchy-Schwarz Inequality instead.

For an LLM that has been pretrained on large datasets:

  1. If it has learned CSI (Cauchy-Schwarz Inequality) well or has encountered similar problems frequently:

  • It acquires the knowledge.
  • It internalizes the reasoning and applies it.
  • The model can solve the problem correctly.

  1. But if it hasn’t seen such problems often (since datasets like MATH500 might be dominated by factoring problems), will it just give up?

Not at all! Instead, it stubbornly sticks to factoring and tries to force a Chain-of-Thought (CoT) reasoning process that resembles the original approach.

And of course, it gets the problem wrong.

Summary

  • LLMs can definitely learn problem-solving techniques from their training sets — this is what Chain-of-Thought (CoT) provides.
  • But CoT also has fixed patterns, or reasoning frameworks, and the model blindly applies them even when they’re not appropriate.
  • When a similar-looking problem is modified, the model doesn’t check whether its usual CoT approach is still valid — it just applies it anyway.

Let’s look at another example:

Left: Original Problem

  • Question: Find all integer values of nnn.
  • Solution:
  • If nnn is even, simplification leads to 0=00 = 00=0, which has no solution.
  • If nnn is odd, setting n=2m+1n = 2m+1n=2m+1 and solving gives the unique solution n=5n = 5n=5.

Right: Hard Perturbation (MATH-P-Hard)

  • Modified Problem: The problem is changed and now asks for the smallest integer value of nnn instead.
  • LLM’s Reasoning:
  • The correct answer is n=10n = 10n=10, but the model outputs both n=10n = 10n=10 and n=13n = 13n=13.
  • Instead of identifying just the smallest nnn, it memorizes and outputs all possible solutions.

So, sometimes I disagree with the idea that “compression equals intelligence” — of course, I’m not pointing fingers at anyone here.

Conclusion: LLMs solve math problems using probabilities, not true understanding.

Second Paper

Now, someone might ask: “If LLMs rely on probability, why do models like OpenAI’s O-series and R1 show significant improvements in math reasoning?”

The answer: They’ve learned patterns.

Or rather, they’ve learned different CoT patterns.

With better training methods, models can acquire more diverse reasoning frameworks.

Why does this work?

Let’s look at the next paper.

What This Paper Does

The core idea is simple:

It clusters different CoT reasoning approaches into around 500 distinct “patterns”.

Then, these CoT patterns are explicitly trained into the model.

Now, the model knows which CoT pattern to apply for each type of problem.

That’s all there is to it.

But this is actually a genius idea.

What does this mean?

It turns continuous action spaces into discrete ones.

Once you make reasoning a discrete, finite space, training becomes much easier.

As for whether the paper uses MCTS, BON, or its own training method — it doesn’t really matter.

  • Any method that combines MCTS + a reward model can probably achieve similar results.
  • The key point is internalizing CoT patterns into the model.

Inference Process in the New Model

  1. First, read the problem.
  2. Identify which CoT pattern applies.
  3. Use that pattern to solve the problem.

That’s it.

Does This Work?

The paper includes ablation studies, showing improvement across all model sizes.

Although the tests only covered math, because they didn’t train the model on other domains,

I believe that embedding these reasoning patterns should generalize to other scenarios as well.

Math problems already have a lot of structure, and 500 reasoning patterns is actually quite a lot.

Where Did These 500 CoT Patterns Come From?

  • Whether they were extracted from Gemini’s data
  • Or generated synthetically

It doesn’t matter.

What matters is that they exist and they work.

Final Thoughts

These are two of the most interesting papers I’ve read recently.

They provide strong insights into future training methodologies.

And personally, I only value papers that convince me through fundamental principles.

Another perspective?

These papers actually made me feel more optimistic.

I’ve always worried that LLMs might eventually replace humans.

But I never had solid evidence to confirm or disprove that fear.

These two papers gave me some confidence.

Can LLMs replace humans?

At least not in their current form.

Can 500 CoT pattern templates really solve all reasoning problems?

What a joke.

Only when LLMs can autonomously generate reasoning and planning pathways will they be ready to compete with humans.

Until then — I’ve got a few more years to play around.

要查看或添加评论,请登录

Boyang Zhou的更多文章

  • 你理解的控制LLM幻觉的方法可能是错的

    你理解的控制LLM幻觉的方法可能是错的

    那什么是粗暴的控制LLM的幻觉的方法呢? 正常你们大家学到的应该是 temperature=0 top_k=1 top_p=0.1 类似这种的 但是这种是不是能解决幻觉呢?很显然在做的各位试过,应该是没什么效果的 为什么呢?…

  • SSI用量子计算来玩AI

    SSI用量子计算来玩AI

    SSI用量子计算来玩AI 刚到家,早上说今天回来要写SSI为什么这么牛B,那就必须得写 SSI是什么公司? Safe Super Intelligence 就是中间这个秃子的公司 ilya 前openAI 首席科学家(现在的mark…

  • 强化学习能让小模型多恐怖?

    强化学习能让小模型多恐怖?

    不是标题党! 不是标题党! 不是标题党! 先说3遍 这个模型有多大呢?1.5B,相当于鼻涕嘎一般大小,和大模型可以说是毫无关系 先看看它和别的模型比较,我们不能只看eval…

  • 快速讲一下deepseek的新论文,这次他们魔爪伸向了attention

    快速讲一下deepseek的新论文,这次他们魔爪伸向了attention

    快速讲一下deepseek的新论文,这次他们魔爪伸向了attention 新文论地址: https://arxiv.org/pdf/2502.

  • LLM到底会解数学题吗?

    LLM到底会解数学题吗?

    我开年时候说Reasoning会火,但是我没想到火这么快 和2023年预测MOE一样,2024年年底预测reasoning基本也是年度AI热词了,我跟迪亚波罗的绯红之王有一拼 但是确实最近没什么东西值得写,也收到了兄弟的吐槽…

  • All in one 的 AI tool Chain “Halomate”

    All in one 的 AI tool Chain “Halomate”

    这不算广告啊,就是真好用,虽然是我哥们儿的产品 比如你定了个gpt的plus 订阅,你发现好像有挺多功能 1- chat,这个自不必说,必须的功能 2- 高级语音 现在变成学英语的了,实时视频也就是我过年给姑婶介绍是是ai用的 3-…

  • 产品思维的角度来讲,Deep Research本质是Co-RAG

    产品思维的角度来讲,Deep Research本质是Co-RAG

    当然我这个标题扣的很多同学会不同意 也能理解 比如有些人说我用while 也能实现只要最终给出一个差不多样子的markdown文件就行 这话也对 也不对 对的是似乎从产出物来讲,是那么回事,但是实际上你的东西不一定是deep…

  • CoRAG: A New Paradigm for Retrieval-Augmented Generation in Deep Research

    CoRAG: A New Paradigm for Retrieval-Augmented Generation in Deep Research

    Introduction When discussing advanced problem-solving using large language models (LLMs), a recurring topic is the role…

  • Adaptive LLM Transformer2

    Adaptive LLM Transformer2

    I came across an interesting paper titled "TRANSFORMER-SQUARED: SELF-ADAPTIVE LLMS" (link). It was published by…

  • Adaptive LLM Transformer2

    Adaptive LLM Transformer2

    看到了一个不错的论文https://arxiv.org/pdf/2501.

其他会员也浏览了