AI Benchmarking and German Cars

AI Benchmarking and German Cars


I just saw Sam Altman present episode 12 of OpenAI "shipmas" presenting the new o3 model and heavily focusing on benchmarks (what else?).

Even before AI I always had a thing for trying to understand beyond benchmark if innovations in speed or power really impacted my (or our) real experience with software and hardware, even with AI sometimes the my real daily heavy use of a model tells a different story of what the benchmarks say. That's ok. My experience is a case in a billion.

But... this also immediately resonated in my memory to the now famous german cars emission scandal... seen great documentaries on it. When basically cars were programmed to do well in test and much worse on the streets.

Also the fact that OneAI is "collaborating" with the testing agencies does not sound well to me, even if I understand that in truly hyper-advanced matters it is tough to find independent experts at the same level - who the f"£$% would pay them to do it?

Hence I tough about expanding on this with the article below.


The AI Benchmarking Race: Learning from History's Testing Failures

In the increasingly competitive landscape of artificial intelligence development, benchmarks have become the gold standard for measuring progress. However, recent developments in the industry, particularly OpenAI's latest announcements during their "shipmas" event, raise important questions about the transparency and reliability of AI benchmarking practices. The situation bears striking parallels to the 2015 Volkswagen emissions scandal, offering crucial lessons about the risks of benchmark-focused development.

The Race to AGI: Benchmarks as the New Battlefield

OpenAI's recent announcement of their o3 model family marks a significant milestone in the company's pursuit of Artificial General Intelligence (AGI). With claimed scores of 87.5% on the ARC-AGI benchmark under high-compute settings and remarkable performances across other tests, including a 96.7% score on the 2024 American Invitational Mathematics Exam, the company suggests they're approaching their definition of AGI – "highly autonomous systems that outperform humans at most economically valuable work."

However, these impressive numbers deserve closer scrutiny. The announcement reveals that OpenAI is partnering with the foundation behind ARC-AGI to help build the next generation of the benchmark. This collaboration between a major AI company and a benchmark provider raises questions about potential conflicts of interest, reminiscent of how automotive companies once maintained close relationships with emissions testing bodies.

The Volkswagen Parallel: When Tests Don't Reflect Reality

The parallels with the 2015 Volkswagen emissions scandal are particularly instructive. In that case, Volkswagen had engineered its diesel engines to recognize when they were being tested, activating additional emissions controls only during test conditions. In real-world driving, these vehicles emitted up to 40 times more nitrogen oxide pollutants than shown in test results.

Similar risks exist in AI benchmarking. Models could potentially be optimized specifically for known benchmark tasks without developing genuine, generalizable capabilities. The high-compute setting used for o3's best performance on ARC-AGI, costing thousands of dollars per challenge according to co-creator Fran?ois Chollet, raises questions about real-world applicability and whether such results reflect practical usage scenarios.

The Transparency Challenge

OpenAI's benchmarking claims come primarily from internal evaluations, a fact that warrants careful consideration. While impressive – showing o3 outperforming its predecessor o1 by 22.8 percentage points on SWE-Bench Verified and achieving a Codeforces rating of 2727 – these results await independent verification.

Chollet himself has pointed out that o3 fails on "very easy tasks" in ARC-AGI, suggesting fundamental differences from human intelligence. He further notes that the upcoming successor benchmark might reduce o3's score to under 30%, even at high compute, while humans could still achieve over 95% without training.

The Risk of Benchmark-Driven Development

The current race in AI development creates strong incentives for companies to optimize their models specifically for benchmark performance. This approach could lead to:

  1. Narrow Optimization: Models that excel at specific benchmarks while lacking robust, general-purpose capabilities
  2. Resource Misallocation: Development efforts focused on benchmark performance rather than practical utility
  3. Misleading Progress Indicators: Benchmark scores that don't accurately reflect real-world performance or capabilities

The Need for Independent Verification

To avoid repeating the mistakes of the automotive industry, the AI field needs:

  • Truly independent benchmark development and testing organizations
  • Clear separation between model developers and benchmark creators
  • Transparent reporting of test conditions and limitations
  • Regular updates to benchmarks to prevent optimization exploitation
  • Real-world performance metrics alongside controlled test results

Looking Forward: Beyond Benchmarks

While benchmarks remain crucial for measuring progress in AI development, the industry must develop more comprehensive evaluation methods. This could include:

  • Long-term performance monitoring in real-world applications
  • Adversarial testing by independent researchers
  • Standardized reporting of resource requirements and limitations
  • Open-source benchmarks with community oversight
  • Regular rotation of test cases to prevent overoptimization

The Role of Regulatory Oversight

The parallels with the automotive industry suggest that some form of regulatory oversight might become necessary. Just as emissions testing eventually required government supervision, AI benchmarking might need similar oversight to ensure accuracy and prevent manipulation.

This becomes particularly relevant given OpenAI's CEO Sam Altman's recent statements about preferring a federal testing framework before releasing new reasoning models. Such oversight could help prevent the kind of benchmark manipulation that led to the Volkswagen scandal.

Conclusion

The race to develop advanced AI models, exemplified by OpenAI's o3 announcement, highlights both the importance and limitations of current benchmarking practices. While benchmarks provide valuable metrics for progress, the industry must learn from historical lessons like the Volkswagen emissions scandal to ensure these measurements truly reflect real-world capabilities.

The solution lies not in abandoning benchmarks but in developing more robust, transparent, and independent testing frameworks. As AI capabilities continue to advance, the integrity of performance measurement becomes increasingly crucial for maintaining public trust and ensuring genuine progress toward beneficial AI development.

As the field moves forward, the focus should shift from achieving impressive benchmark scores to demonstrating consistent, reliable, and verifiable performance across a wide range of real-world applications. Only then can we be confident that claimed advances in AI capabilities represent genuine progress rather than merely optimized test performance.


Follow Untangling AI for more


This is episode No. 199 of my LinkedIn newsletter, A guy with a scarf.

Subscribe here: https://lnkd.in/ddmvMF-Q


A guy with a scarf

LinkedIn newsletter: subscribe here

YouTube channel: watch here

Podcast: listen here


Yuriy Demedyuk

I help tech companies hire tech talent

2 个月

Intriguing points, Carlo. How's Cleeng adapting?

回复
Frank van Oirschot

Empowering Strategic Content Diversification | Increase Media Company Profits by 10% | Saas: All in One viewer engagement toolbox & White-Label Social Broadcast platform | Frictionless Collaboration Media & Creators

2 个月

Fair point Carlo, good that you bring this up. But fair to not make it a German thing, it was quite international. Why this matters? Similarly this may be expected not to be an OPEN AI only thing. We have seen a few strange moves by Mr. Musk and Google in the last year. A good standard is a good start to make this transparant accros all companies in this market. Would be good if there were good criteria to test against and full disclosure of all test data. Not sure though what would be the logical organisation to own this. ISO? UN? WTO? ... The infamous emissions scandal, often referred to as "Dieselgate," initially involved Volkswagen (VW), a German automaker. However, the scandal's impact extended beyond German brands, with investigations uncovering irregularities among other automakers globally. Here’s a breakdown: German Brands: Volkswagen, Audi, Porsche, BMW and Mercedes-Benz Italian Brands: Fiat Chrysler Automobiles (FCA, Jeep, Ram) American Brands: General Motors, Ford. Asian Brands: Mitsubishi, Hyundai and Kia French Brands: Renault and PSA Group (Peugeot-Citro?n)

Carlo De Marchis

Advisor. 35+ years in sports & media tech. "A guy with a scarf" Public speaker. C-suite, strategy, product, innovation, OTT, digital, B2B/D2C marketing, AI/ML.

2 个月

要查看或添加评论,请登录

Carlo De Marchis的更多文章

社区洞察

其他会员也浏览了