Teaching an old monkey new?tricks
?vaklove

Teaching an old monkey new?tricks

There is a correction to this post, including new information about the FrontierMath benchmark. (edited 2025-01-27)


The moment OpenAi released its ChatGPT, which was quickly followed up by competing products from Meta/Facebook and Google and?… everybody else, the question was?—?Who has the bigger Schwartz aka which Large Language Model (LLM) is better. A slew of benchmark tests came out and the race began.


Soon we learned that AI could pass bar exams and OpenAI posted how ChatGPT 4 is acing the Uniform Bar Exam, LSAT and many others.


Here is an overview of the top 10 LLMs and their capabilities. This is how they are described: ‘excels at handling complex math problems’, ‘from basic arithmetic to advanced calculus.’ Or it can ‘handle complex, multi-step calculations while prioritizing logical accuracy and transparency’.


There is not much left for mathematicians to do. Just keep asking questions and AI will do the rest.


Except. There is a wrinkle.


On November 8th, Epoch AI released its FrontierMath, the math benchmark designed for testing the limits of AI. And the news is no longer that great for any of the LLMs. Actually the news is really bad for all of them. All the models solved less than 2% of the problems contained in that benchmark. While the famous LLMs hit an almost perfect score on the other benchmarks like GSM-8kor MATH, here they barely register.


Why is that?


This is the fundamental problem with training algorithms. When companies train their LLMs, they throw at them every possible piece of content. Remember my post, Oil and data. There will be blood. I noted there that researchers argued and the media happily repeated the argument (wrongly) that soon we will run out of data to train AI.


The assumption is that the more we feed AI the smarter it will become. Another assumption was made that if the models become bigger and bigger, they will be able to solve bigger and more complex questions.


It appears that is not the case. The team working on the FrontierMath benchmark, created hundreds of original aka never-before-seen math problems. In their own words ‘These problems span major branches of modern mathematics?—?from computational number theory to abstract algebraic geometry?—?and typically require hours or days for expert mathematicians to solve.


These problems can also be quickly validated with a single number, which is difficult to guess or randomly calculate. To illustrate, here are two answers?—?3677073 and 1876572071974094803391179. You either know or you don’t.


This is the gap between how AI is portrayed and its actual capabilities. Feeding it with all the legal documents and then asking legal questions gives you an impression that you don’t need a lawyer. See it answering Grade 12 Math questions and be amazed by the answer. However, consider that the model has been trained on all the Math tests going back years. It is suddenly not that amazing. Giving it something which has not been trained on and it becomes an expensive piece of hardware.


It is still a well-trained circus monkey and regardless of how you dress it, it is still a monkey.


Once we re-discover the old pattern to evaluate technology for what it is and what can be done with it, we will move forward much faster. This is a pattern which keeps recurring.

Vaclav Vincalek

Technology entrepreneur, CTO and technology advisor for startups and fast-growing companies. Creating Strategic options with Technology.

1 个月

There is a correction (in the post) to this post, including new information about the FrontierMath benchmark. (edited 2025-01-27)

回复
Feite Kraay

Author | Speaker | Ecosystem and Channel Sales Leader | IBM Champion | Quantum Enthusiast

4 个月

The more I look at it the more I think Frontier might be on to something here. Turing Test 2.0 maybe?

Fortunato Vega

Confidant to individuals, families, and business owners. Specialty: Planning and developing a strategy for whatever is next while amassing relevant resources and expertise.

4 个月

Thanks for sharing Vaclav Vincalek. Always happy to hear your thought-provoking insights. For those who have not had this opportunity, sign up here: https://www.dhirubhai.net/newsletters/recurrent-patterns-6914330464384163841/

回复

Vaclav Vincalek The AI hysteria we've seen recently is analogous to drug addicts who have lost contact with their pusher. In the old days, companies could create competitive advantage with new technology. But now, the days of technology differentiation are over. Almost anyone can replicate a tech-focused product in a short amount of time. And AI was supposed to take a company back to the good old days of "better technology." But it's not the type of drug the addict was desperate to find.

要查看或添加评论,请登录

Vaclav Vincalek的更多文章

  • AI troubles at Apple's?kingdom

    AI troubles at Apple's?kingdom

    Last year, Apple made a big announcement at its Worldwide Developers Conference. It introduced AI.

    4 条评论
  • AI Sand Castle?Trap

    AI Sand Castle?Trap

    This post will start as a very boring talk about science, but try to get through the first few lines, we will get into…

    14 条评论
  • Goodnight, Skype

    Goodnight, Skype

    And that wraps up the Skype journey. May 5th, 2025 will be the last time you’ll be able to use Skype.

    1 条评论
  • The House of?Alexa

    The House of?Alexa

    The wait is over. Amazon introduced Alexa+.

    1 条评论
  • Big Bada Boom, Christmas 2032?

    Big Bada Boom, Christmas 2032?

    There is a chance that Earth will get hit by an asteroid on Dec. 22, 2032.

    2 条评论
  • Robots. The next wave is coming

    Robots. The next wave is coming

    My dear reader, by now, you might be tired of reading another write up about AI. The promises of the imminent arrival…

    4 条评论
  • When the AI rubber hits the?road

    When the AI rubber hits the?road

    Large Language Models (LLMs) have stormed the front pages of mass media, thanks to OpenAI and its now famous ChatGPT…

    10 条评论
  • DeepSeek hysteria

    DeepSeek hysteria

    One of the advantages of writing a weekly newsletter is that you don’t have to react immediately to any breaking news…

    7 条评论
  • The pitfalls of AI?search

    The pitfalls of AI?search

    Before we resume our regular programming, I have to issue an apology to you, my dear reader. I have been misled and in…

    2 条评论
  • AI. In search of value, in search of?price

    AI. In search of value, in search of?price

    Now that we are on our way to spending billions of dollars on AI, the question of making at least some of the money…

社区洞察

其他会员也浏览了