?? AI’s ongoing mismeasure
Hi, I’m Azeem Azhar . In this week’s edition, we explore the problem of mismeasuring AI and what to do about it.
?? Thanks to our sponsor: Sana, your new AI assistant for work.
New York Times columnist Kevin Roose makes an interesting case that we don’t actually know how smart AI is because AI developers aren’t required to submit their products for testing before release. They simply pick and choose which information they make public.
We analysed the issue of measuring AI back in March 2023 when my colleague Nathan Warren wrote :
existing benchmarks and evaluation techniques for AI contain numerous flaws that have been exacerbated with the rise of LLMs [...] Large language models are rapidly advancing in proficiency across various tasks, leading to the accelerated achievement of near-peak performance (often around 90%) on established benchmarks. The short lives of these benchmarks make it difficult for us to know exactly where we stand in the field of AI. It’s hard to know exactly where we are heading when the goalposts keep shifting.
There are many factors at play. Most importantly, we don’t have a strong definition of intelligence and how we could define a precise objective function for it. Even if we did, we lack the necessary tools to measure how well different AI models perform; benchmarks become obsolete very fast and don’t properly encapsulate AI capabilities anyway; and there’s no outside authority to systematically test all of these models. Our misknowledge of AI is problematic, particularly when it comes to governance — we need to find ways of making AI’s capabilities and failings legible and accountable.
That’s not to say they haven’t tried. Governments have attempted to make AI capabilities legible by developing compute thresholds. For example, the November 2023 US Executive Order requires disclosures for models trained on more than 10^26 FLOP . The threshold acts as a proxy for a model’s ability to do harm . However, as Dean Ball points out, compute limits are not particularly future-proof . Newer versions of LLMs can already achieve GPT-4 performance on an order of magnitude less compute, and new architectures could deliver very powerful capabilities even more parsimoniously.?
The way AI is made legible also matters when it comes to words. Matthijs M. Maas argues that the choice of metaphors when talking about AI holds “regulatory narratives” . For example, AI as a “field of science” will emphasise the need for transparency, knowledge-sharing and scientific rigour. AI as an “IT technology” implies business as usual and conventional IT sector regulation.
So where do we go from here? First, we need a standardised method of measuring AI to accurately represent model capability. This isn’t a quantification of intelligence (or a single objective function), rather a set of capabilities we can measure. Second, we need to encourage a thoughtful approach towards the vocabulary that’s used to denote those systems, and the politics that those words hold.
?? Today’s edition is supported by Sana .
Sana AI is a knowledge assistant that helps you work faster and smarter.
You can use it for everything from analysing documents and drafting reports to finding information and automating repetitive tasks.
Integrated with your apps, capable of understanding meetings and completing tasks in other tools, Sana AI won’t just change the way you access knowledge. It’ll change the way you work.
GEN AI Evangelist | #TechSherpa | #LiftOthersUp
7 个月Looking forward to diving deeper into this analysis on AI benchmarks Azeem Azhar
I write deep dives on product growth @ Growthdives.com | Fractional Head Of Growth, PLG
7 个月It's like when kids grow up and keep outgrowing their clothes - we're constantly having to re-invent the ways in which we measure the performance of AI as it just keeps surpassing ??
Senior Researcher │ Emerging technology at Exponential View
7 个月It's easy to forget that how we speak about a technology and how we measure it's capabilities (what it's good at but also what it's definitely _not_ good at) matters - and is inherently political.
Writing about technological change at Exponential View
7 个月A difficult problem. Little progress has been made toward finding a solution since I initially examined the issue early last year.
DER BUNTE VOGEL ?? Internationaler Wissenstransfer - Influencerin bei Corporate Influencer Club | Wirtschaftswissenschaften
7 个月Thank you Azeem Azhar