?? AI’s ongoing mismeasure
Exponential View via DALL-E 3

?? AI’s ongoing mismeasure

Hi, I’m Azeem Azhar . In this week’s edition, we explore the problem of mismeasuring AI and what to do about it.


?? Thanks to our sponsor: Sana, your new AI assistant for work.


New York Times columnist Kevin Roose makes an interesting case that we don’t actually know how smart AI is because AI developers aren’t required to submit their products for testing before release. They simply pick and choose which information they make public.

We analysed the issue of measuring AI back in March 2023 when my colleague Nathan Warren wrote :

existing benchmarks and evaluation techniques for AI contain numerous flaws that have been exacerbated with the rise of LLMs [...] Large language models are rapidly advancing in proficiency across various tasks, leading to the accelerated achievement of near-peak performance (often around 90%) on established benchmarks. The short lives of these benchmarks make it difficult for us to know exactly where we stand in the field of AI. It’s hard to know exactly where we are heading when the goalposts keep shifting.

There are many factors at play. Most importantly, we don’t have a strong definition of intelligence and how we could define a precise objective function for it. Even if we did, we lack the necessary tools to measure how well different AI models perform; benchmarks become obsolete very fast and don’t properly encapsulate AI capabilities anyway; and there’s no outside authority to systematically test all of these models. Our misknowledge of AI is problematic, particularly when it comes to governance — we need to find ways of making AI’s capabilities and failings legible and accountable.

That’s not to say they haven’t tried. Governments have attempted to make AI capabilities legible by developing compute thresholds. For example, the November 2023 US Executive Order requires disclosures for models trained on more than 10^26 FLOP . The threshold acts as a proxy for a model’s ability to do harm . However, as Dean Ball points out, compute limits are not particularly future-proof . Newer versions of LLMs can already achieve GPT-4 performance on an order of magnitude less compute, and new architectures could deliver very powerful capabilities even more parsimoniously.?

The way AI is made legible also matters when it comes to words. Matthijs M. Maas argues that the choice of metaphors when talking about AI holds “regulatory narratives” . For example, AI as a “field of science” will emphasise the need for transparency, knowledge-sharing and scientific rigour. AI as an “IT technology” implies business as usual and conventional IT sector regulation.

So where do we go from here? First, we need a standardised method of measuring AI to accurately represent model capability. This isn’t a quantification of intelligence (or a single objective function), rather a set of capabilities we can measure. Second, we need to encourage a thoughtful approach towards the vocabulary that’s used to denote those systems, and the politics that those words hold.


?? Today’s edition is supported by Sana .

Sana AI is a knowledge assistant that helps you work faster and smarter.

You can use it for everything from analysing documents and drafting reports to finding information and automating repetitive tasks.

Integrated with your apps, capable of understanding meetings and completing tasks in other tools, Sana AI won’t just change the way you access knowledge. It’ll change the way you work.

Try for free

Pete Grett

GEN AI Evangelist | #TechSherpa | #LiftOthersUp

7 个月

Looking forward to diving deeper into this analysis on AI benchmarks Azeem Azhar

回复
Rosie Hoggmascall

I write deep dives on product growth @ Growthdives.com | Fractional Head Of Growth, PLG

7 个月

It's like when kids grow up and keep outgrowing their clothes - we're constantly having to re-invent the ways in which we measure the performance of AI as it just keeps surpassing ??

Chantal Smith

Senior Researcher │ Emerging technology at Exponential View

7 个月

It's easy to forget that how we speak about a technology and how we measure it's capabilities (what it's good at but also what it's definitely _not_ good at) matters - and is inherently political.

Nathan Warren

Writing about technological change at Exponential View

7 个月

A difficult problem. Little progress has been made toward finding a solution since I initially examined the issue early last year.

Christel-Silvia Fischer

DER BUNTE VOGEL ?? Internationaler Wissenstransfer - Influencerin bei Corporate Influencer Club | Wirtschaftswissenschaften

7 个月

Thank you Azeem Azhar

要查看或添加评论,请登录

Azeem Azhar的更多文章

  • ?? What surprised me most after 500 editions of Exponential View

    ?? What surprised me most after 500 editions of Exponential View

    Artwork by Moebius After nine years of writing Exponential View and 500 Sunday editions at technology’s frontier, I’ve…

    4 条评论
  • ?? Ten charts to understand the Exponential Age

    ?? Ten charts to understand the Exponential Age

    This week marks the 500th edition of the Sunday newsletter. My aim all along has been to show that we live in…

    10 条评论
  • ?? The chip advantage

    ?? The chip advantage

    This is an excerpt from my weekly newsletter, Exponential View. All new paying subscribers to Exponential View get 1…

    3 条评论
  • ?? My first, magical Waymo ride

    ?? My first, magical Waymo ride

    After changing my view of self-driving cars by using my head and thinking through the data, I can confirm that my heart…

    3 条评论
  • ?? What would you do with an abundance of computing power?

    ?? What would you do with an abundance of computing power?

    What would you do with 1000x more computing power? How would your organisation use it? If you were to ask these…

    7 条评论
  • ?? Will genAI cause a compute crunch?

    ?? Will genAI cause a compute crunch?

    Last year, Google reached a milestone where its spending on compute exceeded its spending on people. This is a…

    4 条评论
  • ?? The foundations of future AI

    ?? The foundations of future AI

    ChatGPT, Claude and other language models have dominated mainstream discussions and use. It’s not surprising: they’re…

    4 条评论
  • ?? AI, energy & industry round-up for September

    ?? AI, energy & industry round-up for September

    Welcome to my September recap on AI, climate and energy transition, industry and economic trends. This summarises the…

    5 条评论
  • Fastest tech in history

    Fastest tech in history

    ?? THANK YOU for reading Exponential View. If you upgrade your membership today, you’ll get 1 year of FREE access to…

    8 条评论
  • ?? What is going on at OpenAI?

    ?? What is going on at OpenAI?

    This was originally published earlier today in my newsletter Exponential View. If you become a paying member of…

    7 条评论

社区洞察