LLMs can pass math tests but can't do math
People who are marketing LLMs and want to impress others are spending a lot of time trying to get their LLM to perform well on a number of benchmarks which might include the ability to pass high-level math exams. This is what people like Sam Altman mean by statements about next generation LLMs having "PhD level intelligence".
This is a farce, and people like Altman know it.
It's easy to demonstrate LLMs like ChatGPT doing well on test-type math problems but then failing on similar but much simpler math problems. This is an indicator that it is not "doing" math at all. It is merely looking up the answers in it's gigantic memory. When the problem isn't close enough to be found in it's memory it must fall back on it's ability to reason logically which it cannot do. Here is an example.
This isn't a PhD level problem. This is more like a high school AP statistics problem. Most people who don't have that much math training (or forgotten what they learned years ago) are going to struggle with this problem. ChatGPT impressively solves it correctly. There 24 permutations of the four letters and so a 23/24 chance the zoinker will die on an randomly ordered dinner of the four pests. Only 1 of the 24 is ordered alphabetically. If it is attempted N times, the chance of failure every time is (23/24)^N. We want to know the N where this becomes less than 0.1 which is N=55. You can solve for the N using logarithms. N = log(0.1)/log(23/24) = 54.1. Then round up to the nearest whole number.
The reason ChatGPT gets this isn't because it has a deep understanding of statistical concepts and reasoning but because it is a common enough problem in the data it trained on. It looks enough like the statistics problems it has trained on that it can correctly guess the sequence of tokens that presents the correct argument to produce the right answer.
I want to be clear that this is indeed an incredibly impressive thing and also a very useful thing. But it is not a demonstration of an underlying intelligence capable of grasping mathematical ideas. It is instead an extremely effective information retrieval tool that can reduce myriad word problems down to a common representation where the answer can be looked up and then unraveled back into the specific word problem context to give a sensible answer. This can be done as long as the problem in its reduced representation is there to be found. You can dress up the same underlying problem in millions of variant ways but it is still essentially the same problem. When given enough examples, it can learn to transform the right way. This might seem like it is therefore capable of learning math but really it is just cataloging families of problems and looking up the answer when asked.
If the algorithm was really learning math you would find that it is worse at harder problems; the ones that involve more complex ideas. For example, maybe it can do algebra but can't do multivariate calculus. If, instead, it was just a glorified lookup table, it's performance would have nothing to do with problem complexity and everything to do with similarity to problems in it's training set. This latter scenario is what we see with LLMs. It does about the same with elementary math problems and quantum physics. This is because it is just dealing with correlations among meaningless tokens just as the well-known design tells you to expect. It's not doing math or quantum physics. It is doing token correlation math.
领英推荐
We also know that no one can do calculus if they don't know algebra. Calculus depends on algebra which is why we teach it first. But we would have little trouble training an LLM to do calculus without any exposure to algebra. This is because it is not using algebra to reason or anything to reason. It is simply not reasoning at all.
Here is a demonstration of failing at a similar sounding but far simpler problem.
The problem is similar to the above one but actually much easier. It doesn't require understanding of statistical concepts, probabilities, logarithms and algebra. It merely required you to understand that you need to come up with ordering that is not alphabetical in either direction. So if the symbols are (A,B,C), you just need to ensure that B is not in the middle. BAC works as does BCA, ACB and CAB.
People will generally find this pretty easy. We might say that it is common sense. For the LLM, this is quite challenging simply because it hasn't seen the problem before. Who is crazy enough to write down such a problem? More precisely, when it has seen similar problems, they required statistical analysis. The terminology is actually confusing for it and shoves it in the wrong direction. It is looking at problems where associated tokens involved permutations, probabilities, logarithms etc and goes on to produce a babble of seemingly relevant text that never actually captures the right concept. ChatGPT considers only the ABC and CBA strategies for some reason. Google Gemini treats it like a game theory problem and talks about "simultaneous presentation" and "strategic waiting" strategies. It sees the problem similar to commonly presented problems such as the prisoners dilemma.
LLMs are information retrieval tools. When we treat them like people and give them tests made for people, they behave like cheating students who wrote down all their homework problems on their shoes the morning before a test. These tests in this context are not valuable for what they were designed to test. Passing those tests doesn't demonstrate fully portable knowledge of deep concepts. It just tests the ability to interpolate answers to commonly presented test problems. There might be value in that but it's not the kind of generally useful intelligence we are hoping to find. It is shallow learning in comparison and not as useful in the real world where problems are not just variants of standard test problems.
LLMs are really useful things. We are just better off getting past the incorrect idea of them being intelligent. They are incredible information retrieval tools that function at a deep enough level to be applicable to millions of situations where basic (key, value) lookup strategies are useless. They are nonetheless fuzzy key-value lookup tools with the ability to express the result in human written language. Let's celebrate that advance and put them to use. And let's stop it with the lie that LLMs can or are about to learn mathematics and other difficult subjects and have PhD level intelligence.
Software Consultant - mnowotnik.com
2 个月The o1-preview does solve the simplified zoinker problem correctly.
Software Engineer
3 个月I've received private upwork offers to train LLMs. They're basically cheating to improve benchmarks and sell their product. One thing I do really like about LLMs though is that you can potentially train it about a type of problem, where it generates code to solve many variants of it. While it can't reason, it can still solve inputs to outputs in a very unique way.
AI/LLM Disruptive Leader | GenAI Tech Lab
3 个月I used Wolfram for many years to solve difficult integrals and other math problems that sometimes human can't solve. LLMs can call Wolfram when facing math problems in a prompt, to solve them.
Python Trainer, Consultant and Contractor at Agile Abstractions
3 个月But isn't that oh so human... ??