ARC-AGI: The Ultimate Test of Machine Intelligence

ARC-AGI: The Ultimate Test of Machine Intelligence

The benchmark AI can’t pass



Hello Everyone,

Our understanding of what AGI might be seems to evolve on a monthly basis, even in 2024.

Fran?ois Chollet is a legend. He’s a French software engineer and artificial intelligence researcher currently working at Google. Chollet is the creator of the Keras deep-learning library, released in 2015.

Introduced in 2019 by Fran?ois Chollet, ARC or the Abstraction and Reasoning Corpus (ARC) is a unique benchmark designed to measure AI skill acquisition and track progress towards achieving human-level AI.

This is a guest post by Jurgen Gravestein who is the author behind Teaching Computers how to Talk.

Jurgen is an exceptionally lucid and skilled guide and practitioner for the understanding of Generative AI, chatbots, agents and the world of accomplishing workplace and professionals tasks with this new array of AI tools. His short highly readable essays are must reads for macro insights into the space which you will keep coming back to.

?? In partnership with Prolific??




Create datasets to fine-tune AI, with Prolific


  • Prolific's database of 200k+ active participants and domain specialists provide reliable data for your AI projects.
  • Learn how to use Prolific to create your own high-quality datasets for AI training and fine-tuning. Includes a free download of the dataset created.


Watch Case Study Now


Trusted by over 3000 world-renowned organizations.

Jurgen Gravestein works for the professional services branch of Conversation Design Institute and has trained more than 100+ conversational AI teams globally. He has been teaching computers how to talk since 2018. His eponymous newsletter is read by folks in tech, academia, and journalism. Subscribe for free here.

  • To read this post in its fullest form go here.


ARC-AGI: The AI Benchmark With A $1.000.000 Prize Pool?

Artificial general intelligence (AGI) progress has stalled. New ideas are needed. That’s the premise of ARC-AGI, an AI benchmark that has garnered worldwide attention after Mike Knoop, Fran?ois Chollet, and Lab42 announced a 1.000.000 dollar prize pool.

ARC-AGI stands for “Abstraction and Reasoning Corpus for Artificial General Intelligence” and is aimed to measure the efficiency of AI skill-acquisition on unknown tasks. Fran?ois Chollet, the creator of ARC-AGI, is a deep learning veteran. He’s the creator of Keras, an open-source deep learning library adopted by over 2.5M developers worldwide, and works as an AI researcher at Google.

The ARC-AGI benchmark isn’t new. It has actually been around for a while, five years to be exact. And here comes the crazy part, since its introduction in 2019, no AI has been able to solve it.?

What makes ARC so hard for AI to solve?

Now I know what you’re thinking, if AI can’t pass the test, this ARC-thing must be pretty hard. Turns out, it isn’t. Most of its puzzles can be solved by a 5-year old.

The benchmark was explicitly designed to compare artificial intelligence with human intelligence. It doesn’t rely on acquired or cultural knowledge. Instead, the puzzles (for lack of a better word) require something that Chollet refers to as ‘core knowledge’. These are things that we as humans naturally understand about the world from a very young age.

Here are a few examples:

  1. Objectness Objects persist and cannot appear or disappear without reason. Objects can interact or not depending on the circumstances.
  2. Goal-directedness Objects can be animate or inanimate. Some objects are “agents” — they have intentions and they pursue goals.
  3. Numbers & counting Objects can be counted or sorted by their shape, appearance, or movement using basic mathematics like addition, subtraction, and comparison.
  4. Basic geometry & topology Objects can be shapes like rectangles, triangles, and circles which can be mirrored, rotated, translated, deformed, combined, repeated, etc. Differences in distances can be detected.

As children, we learn experimentally. We learn by interacting with the world, often through play, and that which we come to understand intuitively, we apply to novel situations.

But wait, didn’t ChatGPT pass the bar exam?

Now, you might be under the impression that AI is pretty smart already. With every test it passes — whether it is a medical, law, or business school exam — it strengthens the idea that these systems are intellectually outclassing us.

If you believe the benchmarks, AI is well on its way to outperforming humans on a wide range of tasks. Surely it can solve this ARC-test, no?

To answer that question, we should take a closer look at how AI manages to pass these tests.

Large language models (LLMs) have the ability to store a lot of information in their parameters, so they tend to perform well when they can rely on stored knowledge rather than reasoning. They are so good at storing knowledge that sometimes they even regurgitate training data verbatim, as evidenced by the court case brought against OpenAI by the New York Times.

So when it was reported that GPT-4 passed the bar exam and the US medical licensing exam, the question we should ask ourselves is: could it have simply memorized the answers? We can’t check if that is the case, because we don’t know what is in the training data, since very few AI companies disclose this kind of information.

This is commonly referred to as the contamination problem. And it is for this reason that Arvind Narayanan and Sayash Kapoor have called evaluating LLMs a minefield.

ARC does things differently. The test itself doesn’t rely on knowledge stored in the model. Instead, the benchmark consists exclusively of visual reasoning puzzles that are pretty obvious to solve (for humans, at least).

To tackle the problem of contamination, ARC uses a private evaluation set. This is done to ensure that the test itself doesn’t become part of the data that the AI is trained on. You also need to open source the solution and publish a paper outlining what you’ve done to solve it in order to be eligible for the prize money.

This rule does two things:

  1. It forces transparency making it harder to cheat.
  2. It promotes open research. Strong market incentives have pushed companies to go closed source, but it didn’t used to be like that. ARC was created in the spirit of the days when AI research was still done in the open.

Are we getting closer to AGI?

ARC’s prize money is awarded to the team, or teams, that score at least 85% on the private evaluation during an annual competition period. This year’s competition runs until November 10, 2024, and if no one claims the grand prize, it will continue during the next annual competition. Thus far no AI has been up to the task.

According to Chollet, progress toward AGI has stalled. While LLMs are trained on unimaginably vast amounts of data, they remain brittle reasoners and are unable to adapt to simple problems they haven’t been trained on. Despite that, research attention and capital keep pouring in, in the hope these capabilities will somehow emerge from scaling our current approach. Chollet, and others with him, have argued this is unlikely.

To promote the launch of the ARC-AGI Prize 2024, Fran?ois Chollet and Mike Knoop were interviewed by Dwarkesh Patel. I recommend you watch it in full here.?

During that interview, Chollet said: “Intelligence is what you use when you don’t know what to do.” It's a quote that belongs to Jean Piaget, a famous Swiss psychologist who has written a lot about cognitive development in children.?

The simple nature of the ARC puzzles is what makes it so powerful. Most AI benchmarks measure skill. But skill is not intelligence. General intelligence is the ability to efficiently acquire new skills. And the fact that ARC remains unbeaten speaks to its resilience. New ideas are needed.

Oh, and to those who think that solving ARC equals solving AGI…

Looking to test your own intelligence on the ARC benchmark? You can play here.



TOP ARTICLES:

Read more by Jurgen Gravestein :

The Intelligence Paradox

https://jurgengravestein.substack.com/p/the-intelligence-paradox?

Why Your AI Assistant Is Probably Sweet Talking You

https://jurgengravestein.substack.com/p/why-your-ai-assistant-is-probably

What AI Engineers Can Learn From Wittgenstein

https://jurgengravestein.substack.com/p/what-ai-engineers-can-learn-from?


Thanks for posting. Yes, hard to measure machine vs. human intelligence when we can’t yet define intelligence clearly and don’t know enough about how humans think. We do know that humans commonly apply heuristics to address challenges that are currently uncomputable in reasonable timeframes.

回复
Travis Libre

SWE @ Northrop Grumman

2 个月

Thoughts on this article, given today's events? Michael Spencer

回复
Andrei Gogiu

Cybersecurity Consultant at Dell Technologies | AI Security | Writer

7 个月

This sounds like a Raven IQ test designed for AI. Pretty clever.

PURNENDU RANJAN RAJA - Solution Architect -Presales Expert

20 Years of IT experience and 8 Years of Experience in a Solution Architect ( Drupal, PHP, Python, React, Angular, Java, AWS) & Presales Expert at Sonata Software Hyderabad

7 个月

I agree!

回复

Thanks for posting…thought-provoking. In our surveys of global AI experts, most think that AGI is at least 25 years away. Guess we’ll see. The arguments around AGI bring up the ancient mind-body debate (Descartes’ thinking vs extended matter), such as whether ‘mind’ or consciousness is inherent in all matter or something separate that must be acquired.

要查看或添加评论,请登录

Michael Spencer的更多文章

  • GPT-4.5 is Not a Frontier Model

    GPT-4.5 is Not a Frontier Model

    To get my best content for less than $2 a week, subscribe here. Guys, we have to talk! OpenAI in the big picture is a…

    11 条评论
  • On why LLMs cannot truly reason

    On why LLMs cannot truly reason

    ?? In partnership with HubSpot ?? HubSpot Integrate tools on HubSpot The HubSpot Developer Platform allows thousands of…

    2 条评论
  • Can AI Lead us to Enlightenment?

    Can AI Lead us to Enlightenment?

    This is a guest post by Chad Woodford, JD, MA, to read the entire thing read the original published today here. For…

    11 条评论
  • Apple is a Stargate Too for American Jobs and R&D ??

    Apple is a Stargate Too for American Jobs and R&D ??

    Apple's $500 Billion Investment Plan in the U.S.

    5 条评论
  • OpenAI o3 Deep Research vs. Google Gemini Deep Research

    OpenAI o3 Deep Research vs. Google Gemini Deep Research

    Good Morning, A whole lot of Deep Research, as we wait for Anthropic and OpenAI models like GPT-4.5.

    4 条评论
  • AI Capex, DeepSeek and Nvidia's Monster on the Horizon ??

    AI Capex, DeepSeek and Nvidia's Monster on the Horizon ??

    This is a guest post by the folk at the Pragmatic Optimist. To get access to my best work, consider a Paid subscription…

    4 条评论
  • Key to Using Perplexity for Intelligent Search

    Key to Using Perplexity for Intelligent Search

    Mastering Perplexity AI: Your Guide to the Future of Intelligent Search Hello there, This is a guest post by Nick…

    15 条评论
  • Google, OpenAI, DeepSeek Battle it out in AI

    Google, OpenAI, DeepSeek Battle it out in AI

    Good morning, To get access to the best of my work, consider subscribing for less than $2 a week. While DeepSeek felt…

    18 条评论
  • DeepSeek's R1 Disrupting America's AI Business Model

    DeepSeek's R1 Disrupting America's AI Business Model

    This week I am collecting thoughts on DeepSeek, the AI story of the year in 2025 so far. Today we feature a great piece…

    15 条评论
  • OpenAI's Portal to an Alien Intelligence: Stargate

    OpenAI's Portal to an Alien Intelligence: Stargate

    If you find Project Stargate worthy of discussion and reading, feel free to share the article. For less than $2 a week,…

    9 条评论

社区洞察

其他会员也浏览了