The bragging about foggy AI
?vaklove

The bragging about foggy AI

How do we get around the cognitive dissonance with AI? We know it is awesome and can solve all of humanity's problems. But we also know it doesn't work and it is far from being production ready.


Here is an example. I came across two articles this week which I hope will demonstrate my point.


The first was an article 'This Startup Is Trying to Test How Well AI Models Actually Work'. The second was a blog post 'The Art of Product Management in the Fog of AI' by Tomasz Tunguz


The first article talks about a startup, vals.ai which 'is working to build a third-party review system for vetting the performance of AI in areas like accounting, law and finance.'?


(As an aside, I note that this company was started by two guys who dropped out of a masters program at Stanford university.

I always wonder why the story has to include this piece of information. Is it the fact they dropped out of school? Do you have to drop out of Stanford as the prerequisite to start a company? I am sure there is enough material for a separate post.)


Back to the topic.


These two gentlemen identified a need for a transparent benchmark to evaluate Large Language Models (LLMs) against standard criteria within a particular context. To begin, they chose three areas - Legal, Tax and Finance. To test, they chose 15 models, from famous and proprietary, to open source or less known. Then they started sending the questions and ranking the answers.


Demonstrating their (unfinished) engineering background they included in the stats things like speed of the answer or cost associated with the answer. Completely - at this stage - irrelevant metrics. The only number which matters is the accuracy. For the legal reasoning tasks, the highest number for accuracy was 77%. For legal contract-related questions it was 74%. For tax-related questions it was 55%. And for corporate finance, 65%. And to make it clear again - these were the best results.


From Tomasz Tunguz' blog post, I will quote this: 'How does one design a product experience in the fog of AI? The answer lies in embracing the unpredictable nature of AI'.


Does the above bring a question in your mind? Something like 'Is this AI thing just another hype which will end in disaster?' Sadly, the whole AI industry is trying to achieve just that. And statements like 'AI Intelligence Will Be Smarter Than Some Humans Within a Year' by Mr. Musk are not helping.


The mentioned startup is trying to achieve an impossible task. How can you measure or benchmark something where the number of questions is very large and in order to evaluate the accuracy of the answers, you need a human expert (we call them lawyers) to approve it? And even then, I am sure there will be another human expert (another lawyer) who would argue that the answer is not that accurate. If you try to suggest that this can't happen, obviously you've never been part of any contract dispute negotiation. From my experience, I can also add that equally important to what's in the contract is what it is not in the contract. Try asking AI what's missing ...


True, there will be arguments that the models can improve, will get better, faster, more accurate. And I am sure, you already saw articles showing that AI can pass the bar exam. That is not a drinking game but a test for aspiring lawyers to be able to practice law in their respective jurisdiction. Sounds amazing, until you read that AI got just 76% of answers correct.


And this is the problem. We expect from technology 100% accuracy. For a straight question, a straight answer. For the same question, the same answer. Bragging about 76% accuracy, it's just that - bragging. Trying to use technology in production, where quality and accuracy is paramount; and we are told that we have to embrace 'the unpredictable nature of AI', will only result in questioning everything that comes out.


To make AI a helpful recurrent pattern, we should be more explicit about what it is and what is not. So far it is much less than what we are being told.

要查看或添加评论,请登录

Vaclav Vincalek的更多文章

  • AI Sand Castle?Trap

    AI Sand Castle?Trap

    This post will start as a very boring talk about science, but try to get through the first few lines, we will get into…

    12 条评论
  • Goodnight, Skype

    Goodnight, Skype

    And that wraps up the Skype journey. May 5th, 2025 will be the last time you’ll be able to use Skype.

    1 条评论
  • The House of?Alexa

    The House of?Alexa

    The wait is over. Amazon introduced Alexa+.

    1 条评论
  • Big Bada Boom, Christmas 2032?

    Big Bada Boom, Christmas 2032?

    There is a chance that Earth will get hit by an asteroid on Dec. 22, 2032.

    2 条评论
  • Robots. The next wave is coming

    Robots. The next wave is coming

    My dear reader, by now, you might be tired of reading another write up about AI. The promises of the imminent arrival…

    4 条评论
  • When the AI rubber hits the?road

    When the AI rubber hits the?road

    Large Language Models (LLMs) have stormed the front pages of mass media, thanks to OpenAI and its now famous ChatGPT…

    10 条评论
  • DeepSeek hysteria

    DeepSeek hysteria

    One of the advantages of writing a weekly newsletter is that you don’t have to react immediately to any breaking news…

    7 条评论
  • The pitfalls of AI?search

    The pitfalls of AI?search

    Before we resume our regular programming, I have to issue an apology to you, my dear reader. I have been misled and in…

    2 条评论
  • AI. In search of value, in search of?price

    AI. In search of value, in search of?price

    Now that we are on our way to spending billions of dollars on AI, the question of making at least some of the money…

  • The Face-AI-book

    The Face-AI-book

    I wanted to write something this week about the continuous moral decline of Facebook. But then I found something…

    1 条评论

社区洞察

其他会员也浏览了