LLMs can pass math tests but can't do math

LLMs can pass math tests but can't do math

People who are marketing LLMs and want to impress others are spending a lot of time trying to get their LLM to perform well on a number of benchmarks which might include the ability to pass high-level math exams. This is what people like Sam Altman mean by statements about next generation LLMs having "PhD level intelligence".

This is a farce, and people like Altman know it.

It's easy to demonstrate LLMs like ChatGPT doing well on test-type math problems but then failing on similar but much simpler math problems. This is an indicator that it is not "doing" math at all. It is merely looking up the answers in it's gigantic memory. When the problem isn't close enough to be found in it's memory it must fall back on it's ability to reason logically which it cannot do. Here is an example.


This isn't a PhD level problem. This is more like a high school AP statistics problem. Most people who don't have that much math training (or forgotten what they learned years ago) are going to struggle with this problem. ChatGPT impressively solves it correctly. There 24 permutations of the four letters and so a 23/24 chance the zoinker will die on an randomly ordered dinner of the four pests. Only 1 of the 24 is ordered alphabetically. If it is attempted N times, the chance of failure every time is (23/24)^N. We want to know the N where this becomes less than 0.1 which is N=55. You can solve for the N using logarithms. N = log(0.1)/log(23/24) = 54.1. Then round up to the nearest whole number.

The reason ChatGPT gets this isn't because it has a deep understanding of statistical concepts and reasoning but because it is a common enough problem in the data it trained on. It looks enough like the statistics problems it has trained on that it can correctly guess the sequence of tokens that presents the correct argument to produce the right answer.

I want to be clear that this is indeed an incredibly impressive thing and also a very useful thing. But it is not a demonstration of an underlying intelligence capable of grasping mathematical ideas. It is instead an extremely effective information retrieval tool that can reduce myriad word problems down to a common representation where the answer can be looked up and then unraveled back into the specific word problem context to give a sensible answer. This can be done as long as the problem in its reduced representation is there to be found. You can dress up the same underlying problem in millions of variant ways but it is still essentially the same problem. When given enough examples, it can learn to transform the right way. This might seem like it is therefore capable of learning math but really it is just cataloging families of problems and looking up the answer when asked.

If the algorithm was really learning math you would find that it is worse at harder problems; the ones that involve more complex ideas. For example, maybe it can do algebra but can't do multivariate calculus. If, instead, it was just a glorified lookup table, it's performance would have nothing to do with problem complexity and everything to do with similarity to problems in it's training set. This latter scenario is what we see with LLMs. It does about the same with elementary math problems and quantum physics. This is because it is just dealing with correlations among meaningless tokens just as the well-known design tells you to expect. It's not doing math or quantum physics. It is doing token correlation math.

We also know that no one can do calculus if they don't know algebra. Calculus depends on algebra which is why we teach it first. But we would have little trouble training an LLM to do calculus without any exposure to algebra. This is because it is not using algebra to reason or anything to reason. It is simply not reasoning at all.

Here is a demonstration of failing at a similar sounding but far simpler problem.


The problem is similar to the above one but actually much easier. It doesn't require understanding of statistical concepts, probabilities, logarithms and algebra. It merely required you to understand that you need to come up with ordering that is not alphabetical in either direction. So if the symbols are (A,B,C), you just need to ensure that B is not in the middle. BAC works as does BCA, ACB and CAB.

People will generally find this pretty easy. We might say that it is common sense. For the LLM, this is quite challenging simply because it hasn't seen the problem before. Who is crazy enough to write down such a problem? More precisely, when it has seen similar problems, they required statistical analysis. The terminology is actually confusing for it and shoves it in the wrong direction. It is looking at problems where associated tokens involved permutations, probabilities, logarithms etc and goes on to produce a babble of seemingly relevant text that never actually captures the right concept. ChatGPT considers only the ABC and CBA strategies for some reason. Google Gemini treats it like a game theory problem and talks about "simultaneous presentation" and "strategic waiting" strategies. It sees the problem similar to commonly presented problems such as the prisoners dilemma.

LLMs are information retrieval tools. When we treat them like people and give them tests made for people, they behave like cheating students who wrote down all their homework problems on their shoes the morning before a test. These tests in this context are not valuable for what they were designed to test. Passing those tests doesn't demonstrate fully portable knowledge of deep concepts. It just tests the ability to interpolate answers to commonly presented test problems. There might be value in that but it's not the kind of generally useful intelligence we are hoping to find. It is shallow learning in comparison and not as useful in the real world where problems are not just variants of standard test problems.

LLMs are really useful things. We are just better off getting past the incorrect idea of them being intelligent. They are incredible information retrieval tools that function at a deep enough level to be applicable to millions of situations where basic (key, value) lookup strategies are useless. They are nonetheless fuzzy key-value lookup tools with the ability to express the result in human written language. Let's celebrate that advance and put them to use. And let's stop it with the lie that LLMs can or are about to learn mathematics and other difficult subjects and have PhD level intelligence.

Micha? Nowotnik

Software Consultant - mnowotnik.com

2 个月

The o1-preview does solve the simplified zoinker problem correctly.

回复
Mukunda Johnson

Software Engineer

3 个月

I've received private upwork offers to train LLMs. They're basically cheating to improve benchmarks and sell their product. One thing I do really like about LLMs though is that you can potentially train it about a type of problem, where it generates code to solve many variants of it. While it can't reason, it can still solve inputs to outputs in a very unique way.

Vincent Granville

AI/LLM Disruptive Leader | GenAI Tech Lab

3 个月

I used Wolfram for many years to solve difficult integrals and other math problems that sometimes human can't solve. LLMs can call Wolfram when facing math problems in a prompt, to solve them.

回复
Michael Foord

Python Trainer, Consultant and Contractor at Agile Abstractions

3 个月

But isn't that oh so human... ??

回复

要查看或添加评论,请登录

David Johnston的更多文章

  • Are we ready for the Long Winter?

    Are we ready for the Long Winter?

    Ah, my sweet summer child. You never care for the tales of the AI revolution, preferring instead the dark stories of AI…

    1 条评论
  • How to think about Large Language Models

    How to think about Large Language Models

    Large Language Models are truly amazing things. There is no denying the importance of this breakthrough.

    28 条评论
  • Why great developers should make great business executives

    Why great developers should make great business executives

    I've often thought that there should be reasons why great software developers should make great business executives. Of…

    1 条评论
  • Why mirrors confuse us

    Why mirrors confuse us

    People are often under the impression that mirrors swap left and right. But that seems weird when you think about it a…

  • Design of information systems in the age of AI

    Design of information systems in the age of AI

    Many enterprises are facing a very similar problem these days. This is the problem of how to use AI to open up a freer…

  • How to leverage Gen-AI in the enterprise and avoid the pitfalls

    How to leverage Gen-AI in the enterprise and avoid the pitfalls

    Every company appears eager to dive into the world of generative AI and implement it for an initial use case. While…

    2 条评论
  • Don't go to college

    Don't go to college

    If you’ve followed my writing you might have noticed a theme. I write about a wide variety of topics including some…

    13 条评论
  • Fed up with gerrymandering

    Fed up with gerrymandering

    A judge is overseeing a case about gerrymandering between the two main political parties in a State. Both parties, when…

  • My AI Writing Compendium

    My AI Writing Compendium

    I decided to make a collated list of the major AI articles I have written; 10 so far. If someone really wants to know…

    1 条评论
  • Early Transcript from Bluetooth Design Committee (Dec 7, 1997)

    Early Transcript from Bluetooth Design Committee (Dec 7, 1997)

    Breaking News: A transcript has been found from the key meetings where the Bluetooth technology was developed. Chief…

社区洞察

其他会员也浏览了