AI Benchmarking and German Cars
Carlo De Marchis
Advisor. 35+ years in sports & media tech. "A guy with a scarf" Public speaker. C-suite, strategy, product, innovation, OTT, digital, B2B/D2C marketing, AI/ML.
I just saw Sam Altman present episode 12 of OpenAI "shipmas" presenting the new o3 model and heavily focusing on benchmarks (what else?).
Even before AI I always had a thing for trying to understand beyond benchmark if innovations in speed or power really impacted my (or our) real experience with software and hardware, even with AI sometimes the my real daily heavy use of a model tells a different story of what the benchmarks say. That's ok. My experience is a case in a billion.
But... this also immediately resonated in my memory to the now famous german cars emission scandal... seen great documentaries on it. When basically cars were programmed to do well in test and much worse on the streets.
Also the fact that OneAI is "collaborating" with the testing agencies does not sound well to me, even if I understand that in truly hyper-advanced matters it is tough to find independent experts at the same level - who the f"£$% would pay them to do it?
Hence I tough about expanding on this with the article below.
The AI Benchmarking Race: Learning from History's Testing Failures
In the increasingly competitive landscape of artificial intelligence development, benchmarks have become the gold standard for measuring progress. However, recent developments in the industry, particularly OpenAI's latest announcements during their "shipmas" event, raise important questions about the transparency and reliability of AI benchmarking practices. The situation bears striking parallels to the 2015 Volkswagen emissions scandal, offering crucial lessons about the risks of benchmark-focused development.
The Race to AGI: Benchmarks as the New Battlefield
OpenAI's recent announcement of their o3 model family marks a significant milestone in the company's pursuit of Artificial General Intelligence (AGI). With claimed scores of 87.5% on the ARC-AGI benchmark under high-compute settings and remarkable performances across other tests, including a 96.7% score on the 2024 American Invitational Mathematics Exam, the company suggests they're approaching their definition of AGI – "highly autonomous systems that outperform humans at most economically valuable work."
However, these impressive numbers deserve closer scrutiny. The announcement reveals that OpenAI is partnering with the foundation behind ARC-AGI to help build the next generation of the benchmark. This collaboration between a major AI company and a benchmark provider raises questions about potential conflicts of interest, reminiscent of how automotive companies once maintained close relationships with emissions testing bodies.
The Volkswagen Parallel: When Tests Don't Reflect Reality
The parallels with the 2015 Volkswagen emissions scandal are particularly instructive. In that case, Volkswagen had engineered its diesel engines to recognize when they were being tested, activating additional emissions controls only during test conditions. In real-world driving, these vehicles emitted up to 40 times more nitrogen oxide pollutants than shown in test results.
Similar risks exist in AI benchmarking. Models could potentially be optimized specifically for known benchmark tasks without developing genuine, generalizable capabilities. The high-compute setting used for o3's best performance on ARC-AGI, costing thousands of dollars per challenge according to co-creator Fran?ois Chollet, raises questions about real-world applicability and whether such results reflect practical usage scenarios.
The Transparency Challenge
OpenAI's benchmarking claims come primarily from internal evaluations, a fact that warrants careful consideration. While impressive – showing o3 outperforming its predecessor o1 by 22.8 percentage points on SWE-Bench Verified and achieving a Codeforces rating of 2727 – these results await independent verification.
Chollet himself has pointed out that o3 fails on "very easy tasks" in ARC-AGI, suggesting fundamental differences from human intelligence. He further notes that the upcoming successor benchmark might reduce o3's score to under 30%, even at high compute, while humans could still achieve over 95% without training.
The Risk of Benchmark-Driven Development
The current race in AI development creates strong incentives for companies to optimize their models specifically for benchmark performance. This approach could lead to:
The Need for Independent Verification
To avoid repeating the mistakes of the automotive industry, the AI field needs:
领英推荐
Looking Forward: Beyond Benchmarks
While benchmarks remain crucial for measuring progress in AI development, the industry must develop more comprehensive evaluation methods. This could include:
The Role of Regulatory Oversight
The parallels with the automotive industry suggest that some form of regulatory oversight might become necessary. Just as emissions testing eventually required government supervision, AI benchmarking might need similar oversight to ensure accuracy and prevent manipulation.
This becomes particularly relevant given OpenAI's CEO Sam Altman's recent statements about preferring a federal testing framework before releasing new reasoning models. Such oversight could help prevent the kind of benchmark manipulation that led to the Volkswagen scandal.
Conclusion
The race to develop advanced AI models, exemplified by OpenAI's o3 announcement, highlights both the importance and limitations of current benchmarking practices. While benchmarks provide valuable metrics for progress, the industry must learn from historical lessons like the Volkswagen emissions scandal to ensure these measurements truly reflect real-world capabilities.
The solution lies not in abandoning benchmarks but in developing more robust, transparent, and independent testing frameworks. As AI capabilities continue to advance, the integrity of performance measurement becomes increasingly crucial for maintaining public trust and ensuring genuine progress toward beneficial AI development.
As the field moves forward, the focus should shift from achieving impressive benchmark scores to demonstrating consistent, reliable, and verifiable performance across a wide range of real-world applications. Only then can we be confident that claimed advances in AI capabilities represent genuine progress rather than merely optimized test performance.
Follow Untangling AI for more
This is episode No. 199 of my LinkedIn newsletter, A guy with a scarf.
Subscribe here: https://lnkd.in/ddmvMF-Q
A guy with a scarf
LinkedIn newsletter: subscribe here
YouTube channel: watch here
Podcast: listen here
I help tech companies hire tech talent
2 个月Intriguing points, Carlo. How's Cleeng adapting?
Empowering Strategic Content Diversification | Increase Media Company Profits by 10% | Saas: All in One viewer engagement toolbox & White-Label Social Broadcast platform | Frictionless Collaboration Media & Creators
2 个月Fair point Carlo, good that you bring this up. But fair to not make it a German thing, it was quite international. Why this matters? Similarly this may be expected not to be an OPEN AI only thing. We have seen a few strange moves by Mr. Musk and Google in the last year. A good standard is a good start to make this transparant accros all companies in this market. Would be good if there were good criteria to test against and full disclosure of all test data. Not sure though what would be the logical organisation to own this. ISO? UN? WTO? ... The infamous emissions scandal, often referred to as "Dieselgate," initially involved Volkswagen (VW), a German automaker. However, the scandal's impact extended beyond German brands, with investigations uncovering irregularities among other automakers globally. Here’s a breakdown: German Brands: Volkswagen, Audi, Porsche, BMW and Mercedes-Benz Italian Brands: Fiat Chrysler Automobiles (FCA, Jeep, Ram) American Brands: General Motors, Ford. Asian Brands: Mitsubishi, Hyundai and Kia French Brands: Renault and PSA Group (Peugeot-Citro?n)
Advisor. 35+ years in sports & media tech. "A guy with a scarf" Public speaker. C-suite, strategy, product, innovation, OTT, digital, B2B/D2C marketing, AI/ML.
2 个月Subscribe to my newsletter A guy with a scarf: https://www.dhirubhai.net/newsletters/a-guy-with-a-scarf-6998145822441775104/