ARC-AGI: The Ultimate Test of Machine Intelligence
Michael Spencer
A.I. Writer, researcher and curator - full-time Newsletter publication manager.
The benchmark AI can’t pass
Hello Everyone,
Our understanding of what AGI might be seems to evolve on a monthly basis, even in 2024.
Fran?ois Chollet is a legend. He’s a French software engineer and artificial intelligence researcher currently working at Google. Chollet is the creator of the Keras deep-learning library, released in 2015.
Introduced in 2019 by Fran?ois Chollet, ARC or the Abstraction and Reasoning Corpus (ARC) is a unique benchmark designed to measure AI skill acquisition and track progress towards achieving human-level AI.
This is a guest post by Jurgen Gravestein who is the author behind Teaching Computers how to Talk.
Jurgen is an exceptionally lucid and skilled guide and practitioner for the understanding of Generative AI, chatbots, agents and the world of accomplishing workplace and professionals tasks with this new array of AI tools. His short highly readable essays are must reads for macro insights into the space which you will keep coming back to.
?? In partnership with Prolific??
Create datasets to fine-tune AI, with Prolific
Trusted by over 3000 world-renowned organizations.
Jurgen Gravestein works for the professional services branch of Conversation Design Institute and has trained more than 100+ conversational AI teams globally. He has been teaching computers how to talk since 2018. His eponymous newsletter is read by folks in tech, academia, and journalism. Subscribe for free here.
ARC-AGI: The AI Benchmark With A $1.000.000 Prize Pool?
Artificial general intelligence (AGI) progress has stalled. New ideas are needed. That’s the premise of ARC-AGI, an AI benchmark that has garnered worldwide attention after Mike Knoop, Fran?ois Chollet, and Lab42 announced a 1.000.000 dollar prize pool.
ARC-AGI stands for “Abstraction and Reasoning Corpus for Artificial General Intelligence” and is aimed to measure the efficiency of AI skill-acquisition on unknown tasks. Fran?ois Chollet, the creator of ARC-AGI, is a deep learning veteran. He’s the creator of Keras, an open-source deep learning library adopted by over 2.5M developers worldwide, and works as an AI researcher at Google.
The ARC-AGI benchmark isn’t new. It has actually been around for a while, five years to be exact. And here comes the crazy part, since its introduction in 2019, no AI has been able to solve it.?
What makes ARC so hard for AI to solve?
Now I know what you’re thinking, if AI can’t pass the test, this ARC-thing must be pretty hard. Turns out, it isn’t. Most of its puzzles can be solved by a 5-year old.
The benchmark was explicitly designed to compare artificial intelligence with human intelligence. It doesn’t rely on acquired or cultural knowledge. Instead, the puzzles (for lack of a better word) require something that Chollet refers to as ‘core knowledge’. These are things that we as humans naturally understand about the world from a very young age.
Here are a few examples:
领英推荐
As children, we learn experimentally. We learn by interacting with the world, often through play, and that which we come to understand intuitively, we apply to novel situations.
But wait, didn’t ChatGPT pass the bar exam?
Now, you might be under the impression that AI is pretty smart already. With every test it passes — whether it is a medical, law, or business school exam — it strengthens the idea that these systems are intellectually outclassing us.
If you believe the benchmarks, AI is well on its way to outperforming humans on a wide range of tasks. Surely it can solve this ARC-test, no?
To answer that question, we should take a closer look at how AI manages to pass these tests.
Large language models (LLMs) have the ability to store a lot of information in their parameters, so they tend to perform well when they can rely on stored knowledge rather than reasoning. They are so good at storing knowledge that sometimes they even regurgitate training data verbatim, as evidenced by the court case brought against OpenAI by the New York Times.
So when it was reported that GPT-4 passed the bar exam and the US medical licensing exam, the question we should ask ourselves is: could it have simply memorized the answers? We can’t check if that is the case, because we don’t know what is in the training data, since very few AI companies disclose this kind of information.
This is commonly referred to as the contamination problem. And it is for this reason that Arvind Narayanan and Sayash Kapoor have called evaluating LLMs a minefield.
ARC does things differently. The test itself doesn’t rely on knowledge stored in the model. Instead, the benchmark consists exclusively of visual reasoning puzzles that are pretty obvious to solve (for humans, at least).
To tackle the problem of contamination, ARC uses a private evaluation set. This is done to ensure that the test itself doesn’t become part of the data that the AI is trained on. You also need to open source the solution and publish a paper outlining what you’ve done to solve it in order to be eligible for the prize money.
This rule does two things:
Are we getting closer to AGI?
ARC’s prize money is awarded to the team, or teams, that score at least 85% on the private evaluation during an annual competition period. This year’s competition runs until November 10, 2024, and if no one claims the grand prize, it will continue during the next annual competition. Thus far no AI has been up to the task.
According to Chollet, progress toward AGI has stalled. While LLMs are trained on unimaginably vast amounts of data, they remain brittle reasoners and are unable to adapt to simple problems they haven’t been trained on. Despite that, research attention and capital keep pouring in, in the hope these capabilities will somehow emerge from scaling our current approach. Chollet, and others with him, have argued this is unlikely.
To promote the launch of the ARC-AGI Prize 2024, Fran?ois Chollet and Mike Knoop were interviewed by Dwarkesh Patel. I recommend you watch it in full here.?
During that interview, Chollet said: “Intelligence is what you use when you don’t know what to do.” It's a quote that belongs to Jean Piaget, a famous Swiss psychologist who has written a lot about cognitive development in children.?
The simple nature of the ARC puzzles is what makes it so powerful. Most AI benchmarks measure skill. But skill is not intelligence. General intelligence is the ability to efficiently acquire new skills. And the fact that ARC remains unbeaten speaks to its resilience. New ideas are needed.
Oh, and to those who think that solving ARC equals solving AGI…
Looking to test your own intelligence on the ARC benchmark? You can play here.
TOP ARTICLES:
Read more by Jurgen Gravestein :
The Intelligence Paradox
Why Your AI Assistant Is Probably Sweet Talking You
What AI Engineers Can Learn From Wittgenstein
Thanks for posting. Yes, hard to measure machine vs. human intelligence when we can’t yet define intelligence clearly and don’t know enough about how humans think. We do know that humans commonly apply heuristics to address challenges that are currently uncomputable in reasonable timeframes.
SWE @ Northrop Grumman
2 个月Thoughts on this article, given today's events? Michael Spencer
Cybersecurity Consultant at Dell Technologies | AI Security | Writer
7 个月This sounds like a Raven IQ test designed for AI. Pretty clever.
20 Years of IT experience and 8 Years of Experience in a Solution Architect ( Drupal, PHP, Python, React, Angular, Java, AWS) & Presales Expert at Sonata Software Hyderabad
7 个月I agree!
Thanks for posting…thought-provoking. In our surveys of global AI experts, most think that AGI is at least 25 years away. Guess we’ll see. The arguments around AGI bring up the ancient mind-body debate (Descartes’ thinking vs extended matter), such as whether ‘mind’ or consciousness is inherent in all matter or something separate that must be acquired.