ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

LLMs can pass math tests but can't do math

David Johnston

Chief Data & AI Scientist, Navalia

å‘å¸ƒæ—¥æœŸ: 2024å¹´8æœˆ7æ—¥

People who are marketing LLMs and want to impress others are spending a lot of time trying to get their LLM to perform well on a number of benchmarks which might include the ability to pass high-level math exams. This is what people like Sam Altman mean by statements about next generation LLMs having "PhD level intelligence".

This is a farce, and people like Altman know it.

It's easy to demonstrate LLMs like ChatGPT doing well on test-type math problems but then failing on similar but much simpler math problems. This is an indicator that it is not "doing" math at all. It is merely looking up the answers in it's gigantic memory. When the problem isn't close enough to be found in it's memory it must fall back on it's ability to reason logically which it cannot do. Here is an example.

This isn't a PhD level problem. This is more like a high school AP statistics problem. Most people who don't have that much math training (or forgotten what they learned years ago) are going to struggle with this problem. ChatGPT impressively solves it correctly. There 24 permutations of the four letters and so a 23/24 chance the zoinker will die on an randomly ordered dinner of the four pests. Only 1 of the 24 is ordered alphabetically. If it is attempted N times, the chance of failure every time is (23/24)^N. We want to know the N where this becomes less than 0.1 which is N=55. You can solve for the N using logarithms. N = log(0.1)/log(23/24) = 54.1. Then round up to the nearest whole number.

The reason ChatGPT gets this isn't because it has a deep understanding of statistical concepts and reasoning but because it is a common enough problem in the data it trained on. It looks enough like the statistics problems it has trained on that it can correctly guess the sequence of tokens that presents the correct argument to produce the right answer.

I want to be clear that this is indeed an incredibly impressive thing and also a very useful thing. But it is not a demonstration of an underlying intelligence capable of grasping mathematical ideas. It is instead an extremely effective information retrieval tool that can reduce myriad word problems down to a common representation where the answer can be looked up and then unraveled back into the specific word problem context to give a sensible answer. This can be done as long as the problem in its reduced representation is there to be found. You can dress up the same underlying problem in millions of variant ways but it is still essentially the same problem. When given enough examples, it can learn to transform the right way. This might seem like it is therefore capable of learning math but really it is just cataloging families of problems and looking up the answer when asked.

If the algorithm was really learning math you would find that it is worse at harder problems; the ones that involve more complex ideas. For example, maybe it can do algebra but can't do multivariate calculus. If, instead, it was just a glorified lookup table, it's performance would have nothing to do with problem complexity and everything to do with similarity to problems in it's training set. This latter scenario is what we see with LLMs. It does about the same with elementary math problems and quantum physics. This is because it is just dealing with correlations among meaningless tokens just as the well-known design tells you to expect. It's not doing math or quantum physics. It is doing token correlation math.

é¢†è‹±æŽ¨è

Using chatGPT to teach computational thinking

Ajit Jaokar 6 ä¸ªæœˆå‰

Typology (old: now reposted at two articles)

Graham Berrisford 5 ä¸ªæœˆå‰

Top LLM Papers of the Week (October Week 1, 2024)

Kalyan KS 5 ä¸ªæœˆå‰

We also know that no one can do calculus if they don't know algebra. Calculus depends on algebra which is why we teach it first. But we would have little trouble training an LLM to do calculus without any exposure to algebra. This is because it is not using algebra to reason or anything to reason. It is simply not reasoning at all.

Here is a demonstration of failing at a similar sounding but far simpler problem.

The problem is similar to the above one but actually much easier. It doesn't require understanding of statistical concepts, probabilities, logarithms and algebra. It merely required you to understand that you need to come up with ordering that is not alphabetical in either direction. So if the symbols are (A,B,C), you just need to ensure that B is not in the middle. BAC works as does BCA, ACB and CAB.

People will generally find this pretty easy. We might say that it is common sense. For the LLM, this is quite challenging simply because it hasn't seen the problem before. Who is crazy enough to write down such a problem? More precisely, when it has seen similar problems, they required statistical analysis. The terminology is actually confusing for it and shoves it in the wrong direction. It is looking at problems where associated tokens involved permutations, probabilities, logarithms etc and goes on to produce a babble of seemingly relevant text that never actually captures the right concept. ChatGPT considers only the ABC and CBA strategies for some reason. Google Gemini treats it like a game theory problem and talks about "simultaneous presentation" and "strategic waiting" strategies. It sees the problem similar to commonly presented problems such as the prisoners dilemma.

LLMs are information retrieval tools. When we treat them like people and give them tests made for people, they behave like cheating students who wrote down all their homework problems on their shoes the morning before a test. These tests in this context are not valuable for what they were designed to test. Passing those tests doesn't demonstrate fully portable knowledge of deep concepts. It just tests the ability to interpolate answers to commonly presented test problems. There might be value in that but it's not the kind of generally useful intelligence we are hoping to find. It is shallow learning in comparison and not as useful in the real world where problems are not just variants of standard test problems.

LLMs are really useful things. We are just better off getting past the incorrect idea of them being intelligent. They are incredible information retrieval tools that function at a deep enough level to be applicable to millions of situations where basic (key, value) lookup strategies are useless. They are nonetheless fuzzy key-value lookup tools with the ability to express the result in human written language. Let's celebrate that advance and put them to use. And let's stop it with the lie that LLMs can or are about to learn mathematics and other difficult subjects and have PhD level intelligence.

å¸¦æœ‰æ¤å›¾æ ‡çš„é“¾æŽ¥ ç”±é¢†è‹±åˆ›å»ºï¼Œä¸å¸¦æ¤å›¾æ ‡çš„é“¾æŽ¥ç”±ä½œè€…æ·»åŠ ã€‚

Micha? Nowotnik

Software Consultant - mnowotnik.com

6 ä¸ªæœˆ

The o1-preview does solve the simplified zoinker problem correctly.

èµž

å›žå¤

Mukunda Johnson

Software Engineer

6 ä¸ªæœˆ

I've received private upwork offers to train LLMs. They're basically cheating to improve benchmarks and sell their product. One thing I do really like about LLMs though is that you can potentially train it about a type of problem, where it generates code to solve many variants of it. While it can't reason, it can still solve inputs to outputs in a very unique way.

èµž

å›žå¤

2 æ¬¡å›žåº”

Vincent Granville

Co-Founder, BondingAI.io

7 ä¸ªæœˆ

I used Wolfram for many years to solve difficult integrals and other math problems that sometimes human can't solve. LLMs can call Wolfram when facing math problems in a prompt, to solve them.

èµž

å›žå¤

Michael Foord

Python Trainer, Consultant and Contractor at Agile Abstractions

7 ä¸ªæœˆ

But isn't that oh so human... ??

èµž

å›žå¤

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

David Johnstonçš„æ›´å¤šæ–‡ç«

Are we ready for the Long Winter?

2024å¹´8æœˆ7æ—¥

Are we ready for the Long Winter?

Ah, my sweet summer child. You never care for the tales of the AI revolution, preferring instead the dark stories of AIâ€¦

1 æ¡è¯„è®º
How to think about Large Language Models

2024å¹´6æœˆ1æ—¥

How to think about Large Language Models

Large Language Models are truly amazing things. There is no denying the importance of this breakthrough.

28 æ¡è¯„è®º
Why great developers should make great business executives

2024å¹´4æœˆ22æ—¥

Why great developers should make great business executives

I've often thought that there should be reasons why great software developers should make great business executives. Ofâ€¦

1 æ¡è¯„è®º
Why mirrors confuse us

2024å¹´4æœˆ17æ—¥

Why mirrors confuse us

People are often under the impression that mirrors swap left and right. But that seems weird when you think about it aâ€¦
Design of information systems in the age of AI

2024å¹´4æœˆ17æ—¥

Design of information systems in the age of AI

Many enterprises are facing a very similar problem these days. This is the problem of how to use AI to open up a freerâ€¦
How to leverage Gen-AI in the enterprise and avoid the pitfalls

2024å¹´4æœˆ16æ—¥

How to leverage Gen-AI in the enterprise and avoid the pitfalls

Every company appears eager to dive into the world of generative AI and implement it for an initial use case. Whileâ€¦

2 æ¡è¯„è®º
Don't go to college

2024å¹´4æœˆ7æ—¥

Don't go to college

If youâ€™ve followed my writing you might have noticed a theme. I write about a wide variety of topics including someâ€¦

13 æ¡è¯„è®º
Fed up with gerrymandering

2024å¹´3æœˆ30æ—¥

Fed up with gerrymandering

A judge is overseeing a case about gerrymandering between the two main political parties in a State. Both parties, whenâ€¦
My AI Writing Compendium

2024å¹´3æœˆ27æ—¥

My AI Writing Compendium

I decided to make a collated list of the major AI articles I have written; 10 so far. If someone really wants to knowâ€¦

1 æ¡è¯„è®º
Early Transcript from Bluetooth Design Committee (Dec 7, 1997)

2024å¹´3æœˆ1æ—¥

Early Transcript from Bluetooth Design Committee (Dec 7, 1997)

Breaking News: A transcript has been found from the key meetings where the Bluetooth technology was developed. Chiefâ€¦

See all articles

LLMs can pass math tests but can't do math

David Johnston

Chief Data & AI Scientist, Navalia

é¢†è‹±æŽ¨è

David Johnstonçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Understanding and teaching complex numbers with the help of GPT

Artificial Intelligence #73

o1-Preview?â€”?Everything You Need to Know About OpenAIâ€™s New Model in 2024

Artificial Intelligence #34 - Foundations of Coding for artificial intelligence - part two

Artificial Intelligence #11: Is maths needed for AI? What many developers miss about maths and AI

AGENTIC AI IN A SIMPLE WAY - How to Make AI Agents That Argue Like a Teacher and Student (Using AutoGen!) ??????

How was OpenAIo1 Built?

The Enduring Relevance of Computational Thinking in the Age of AI

The AI governed Future: how to still make room for a Math Enthusiast in his Career Choice

Getting started with AI & ML- 10 plus use cases

é¢†è‹±æŽ¨è

David Johnstonçš„æ›´å¤šæ–‡ç«

Are we ready for the Long Winter?

How to think about Large Language Models

Why great developers should make great business executives

Why mirrors confuse us

Design of information systems in the age of AI

How to leverage Gen-AI in the enterprise and avoid the pitfalls

Don't go to college

Fed up with gerrymandering

My AI Writing Compendium

Early Transcript from Bluetooth Design Committee (Dec 7, 1997)

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Understanding and teaching complex numbers with the help of GPT

Artificial Intelligence #73

o1-Preview?â€”?Everything You Need to Know About OpenAIâ€™s New Model in 2024

Artificial Intelligence #34 - Foundations of Coding for artificial intelligence - part two

Artificial Intelligence #11: Is maths needed for AI? What many developers miss about maths and AI

AGENTIC AI IN A SIMPLE WAY - How to Make AI Agents That Argue Like a Teacher and Student (Using AutoGen!) ??????

How was OpenAIo1 Built?

The Enduring Relevance of Computational Thinking in the Age of AI

The AI governed Future: how to still make room for a Math Enthusiast in his Career Choice

Getting started with AI & ML- 10 plus use cases

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†