LLMs can pass math tests but can't do math
People who are marketing LLMs and want to impress others are spending a lot of time trying to get their LLM to perform well on a number of benchmarks which might include the ability to pass high-level math exams
This is a farce, and people like Altman know it.
It's easy to demonstrate LLMs like ChatGPT doing well on test-type math problems but then failing on similar but much simpler math problems. This is an indicator that it is not "doing" math at all. It is merely looking up the answers in it's gigantic memory. When the problem isn't close enough to be found in it's memory it must fall back on it's ability to reason logically which it cannot do. Here is an example.
This isn't a PhD level problem. This is more like a high school AP statistics problem. Most people who don't have that much math training (or forgotten what they learned years ago) are going to struggle with this problem. ChatGPT impressively solves it correctly. There 24 permutations of the four letters and so a 23/24 chance the zoinker will die on an randomly ordered dinner of the four pests. Only 1 of the 24 is ordered alphabetically. If it is attempted N times, the chance of failure every time is (23/24)^N. We want to know the N where this becomes less than 0.1 which is N=55. You can solve for the N using logarithms. N = log(0.1)/log(23/24) = 54.1. Then round up to the nearest whole number.
The reason ChatGPT gets this isn't because it has a deep understanding of statistical concepts and reasoning
I want to be clear that this is indeed an incredibly impressive thing and also a very useful thing. But it is not a demonstration of an underlying intelligence capable of grasping mathematical ideas. It is instead an extremely effective information retrieval tool that can reduce myriad word problems down to a common representation
If the algorithm was really learning math you would find that it is worse at harder problems; the ones that involve more complex ideas. For example, maybe it can do algebra but can't do multivariate calculus. If, instead, it was just a glorified lookup table, it's performance would have nothing to do with problem complexity
领英推è
We also know that no one can do calculus if they don't know algebra. Calculus depends on algebra which is why we teach it first. But we would have little trouble training an LLM to do calculus without any exposure to algebra. This is because it is not using algebra to reason or anything to reason. It is simply not reasoning at all.
Here is a demonstration of failing at a similar sounding but far simpler problem.
The problem is similar to the above one but actually much easier. It doesn't require understanding of statistical concepts, probabilities, logarithms and algebra. It merely required you to understand that you need to come up with ordering that is not alphabetical in either direction. So if the symbols are (A,B,C), you just need to ensure that B is not in the middle. BAC works as does BCA, ACB and CAB.
People will generally find this pretty easy. We might say that it is common sense. For the LLM, this is quite challenging simply because it hasn't seen the problem before. Who is crazy enough to write down such a problem? More precisely, when it has seen similar problems, they required statistical analysis. The terminology is actually confusing for it and shoves it in the wrong direction. It is looking at problems where associated tokens involved permutations, probabilities, logarithms etc and goes on to produce a babble of seemingly relevant text that never actually captures the right concept. ChatGPT considers only the ABC and CBA strategies for some reason. Google Gemini treats it like a game theory problem and talks about "simultaneous presentation" and "strategic waiting" strategies. It sees the problem similar to commonly presented problems such as the prisoners dilemma.
LLMs are information retrieval tools
LLMs are really useful things. We are just better off getting past the incorrect idea of them being intelligent. They are incredible information retrieval tools that function at a deep enough level to be applicable to millions of situations where basic (key, value) lookup strategies are useless. They are nonetheless fuzzy key-value lookup tools with the ability to express the result in human written language. Let's celebrate that advance and put them to use. And let's stop it with the lie that LLMs can or are about to learn mathematics and other difficult subjects and have PhD level intelligence.
Software Consultant - mnowotnik.com
6 个月The o1-preview does solve the simplified zoinker problem correctly.
Software Engineer
6 个月I've received private upwork offers to train LLMs. They're basically cheating to improve benchmarks and sell their product. One thing I do really like about LLMs though is that you can potentially train it about a type of problem, where it generates code to solve many variants of it. While it can't reason, it can still solve inputs to outputs in a very unique way.
Co-Founder, BondingAI.io
7 个月I used Wolfram for many years to solve difficult integrals and other math problems that sometimes human can't solve. LLMs can call Wolfram when facing math problems in a prompt, to solve them.
Python Trainer, Consultant and Contractor at Agile Abstractions
7 个月But isn't that oh so human... ??