How GPT-4 Fails to Measure Up in 2023

How GPT-4 Fails to Measure Up in 2023

Hey Everyone,

Recently I’ve been on a spree of writing about OpenAI. Today we’re going to continue that trend with a shorter guest contribution.



?? From our sponsor: ??

Start Your AI Journey With an All-In-One Reading and Writing Assistant

Wordtune is an AI writing assistant that helps users create compelling content. With advanced capabilities, Wordtune understands the context of your writing, providing intelligent suggestions for improvement. Whether it's enhancing sentence structure or expanding your vocabulary, Wordtune offers real-time assistance, saving you time and effort.




The Turing test, originally called the “imitation game” by Alan Turing in 1950 (nearly 74 years ago), is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human being.

So how do you suppose OpenAI’s GPT-4 measures up?


Jurgen Gravestein is one of those writers I follow for his extremely well articled and clear takes on Generative A.I. trends combined with excellent topic choice. His Newsletter name also easily resonates. I was already highly impressed by the topic choice of this piece.


A window to the world of conversational AI. Join 1,700+ consultants, creatives, and founders who also subscribed to this newsletter.

“Teaching Computers How to Talk”


Subscribe to TChtT

When will GPT-5 Arrive?

It’s estimated GPT-5 has a 62% chance of being released in 2024. This means we somewhat unlikely to see it at the start of next year. OpenAI may require more funding before it is released.

But is the world ready for it?

If you want to help me continue to write, you can also check out my paid subscription here.


By Jurgen Gravestein - December, 2023


GPT-4 Can’t Pass the Turing Test

The Turing test, originally called the imitation game, was a thought experiment proposed by Alan Turing in 1950 to explore the concept of machine intelligence. In a recent study, two UC San Diego researchers tested GPT-4, GPT-3.5, and 1960’s ELIZA in a Turing test of their own, to see which program best mimics human conversation. Participants had to guess if they were interacting with a human or an AI.

The study compared the performance of several AI models: OpenAI's GPT-4, its predecessor GPT-3.5, and the historical ELIZA program (see image below). The experiment involved 652 participants who completed a total of 1,810 sessions in which they had to guess whether they had interacted with a machine or with another human being.

The ELIZA program, despite its rudimentary rule-based system, scored higher (27%) compared to GPT-3.5 (14%). GPT-4 outperformed ELIZA (41%), but effectively scored worse than a coin flip, meaning people were more often able to tell the difference than not.

Maybe the most surprising result of all was that humans were correctly identified as humans in 63% of all interactions. I suppose that can be considered a win for the computers, showing that it certainly has become harder to tell the difference due to the impressive linguistic display of modern day programs.

Let’s take a closer look

The study evaluated the AIs through a public platform. Participants, recruited online, were randomly matched with a human or an AI. The conversation interface was similar to any regular messaging app, with a 5-minute time limit and a 300-character limit per message. After the conversation, the participants had to decide whether they just interacted with a human or AI, rate their confidence, and optionally provide reasoning.?

Besides the Eliza system, both GPT-3.5 and GPT-4 were put to the test, each with different settings and system prompts. Here’s an example of one of the prompts used:

System prompts can be used to make the AI behave more human-like, follow instructions and/or adopt a certain style. The prompts used by the researchers varied in several aspects, including whether the AI should make spelling, grammar, and capitalization errors; the personality it should adopt; and whether it should act as a human or as an AI pretending to be human. Some prompts turned out to be more effective (or deceptive) than others.?

Here’s an overview of the success rate of the top performing prompts:

What stands out is that the best-performing GPT-4 prompt only achieves a success rate of 41%. This is higher than the performances of GPT-3.5 and ELIZA, but falls short of the human baseline of 63%.?

With GPT-4 scoring below the 50% threshold, it performed worse than flipping a coin. It’s safe to say that people weren’t merely guessing, but successfully able to tell the difference more often than not. The researchers conclude that based on their findings GPT-4 does not pass the Turing Test.?

Language ≠ intelligence

A common misconception is that the Turing test is a test of intelligence — it’s not. Any machine that can engage in coherent and seemingly thoughtful dialogue can fool us into believing it is smart. This phenomenon is better known as the ELIZA-effect, and the fact that the rudimentary ELIZA chatbot from 1960 outperformed OpenAI’s 3.5 model in this 21st century study serves as a testament to that.

All the Turing test does is question if a machine can mimic human conversation to the extent that it becomes impossible for us to tell the difference. As a matter of fact, the study confirms this assumption, because when asked about it the participants claim they based their judgments mainly on the style of responses, not perceived intelligence.

While the GPT-4’s ability to imitate human conversation can be seen as a major technical achievement, it does not imply understanding or consciousness on the part of the AI. The Turing test, therefore, must be viewed not as a barometer of AI intelligence but of linguistic competence.

Or, as I explained in a previous article:

“These systems don’t learn from first principles and experience, like us, but by crunching as much human-generated content as possible. (…) What we end up with is not human-level intelligence, but a form of machine intelligence that appears human-like.”

GPT-5 will pass the Turing test

Either way, machines will pass the Turing test sooner rather than later. GPT 5, without a shadow of a doubt, will be on par with humans in linguistic fluency — and honestly, I’m not looking forward to that moment.?

As the line between human and machine blurs, so does our ability to navigate the digital world with certainty. If every digital interaction can be faked, we’ll be forced to engage in a guessing game of sorts, a never-ending Turing test, in which we have to ask ourselves on a daily basis whether we are talking to a machine or not: every email, chat conversation, phone call, social media post or news article could be generated with AI without us knowing.?

Many will deem these linguistically fluent machines intelligent. Skeptics will argue that fluency doesn’t equate to intelligence and they are right. The essence of intelligence, human or artificial, lies in the depth of the comprehension and the ability to contextualize, not in beating the imitation game. Champions of the technology will claim an early victory anyway.

Society, in the meantime, will have to adapt to the new status quo. I don’t know how yet, but my hope is that we will, because that’s what we do. Humans are, after all, the single most adaptive species on Earth.

A shorter version of this article was previously published here.



“GPT 5, without a shadow of a doubt, will be on par with humans in linguistic fluency — and honestly, I’m not looking forward to that moment.”?- Jurgen Gravestein


?? Further Reading

He already wrote for our Newsletter back in August, 2023 on the topic of artificial general intelligence.

AGI

Why You Should be Skeptical of AGI

Michael Spencer and Jurgen Gravestein

·

Aug 31

This is third installment in our series on AGI. Today I invite Jurgen Gravestein, who is a writer, consultant, and conversation designer. He was employee no. 1 at Conversation Design Institute and now works for the strategy and delivery branch CDI Services helping companies drive more business value with conversational AI. His newsletter

Read full story


Alex Carey

AI Speaker & Consultant | Helping Organizations Navigate the AI Revolution | Generated $50M+ Revenue | Talks about #AI #ChatGPT #B2B #Marketing #Outbound

11 个月

Interesting study! It highlights the challenges and limitations of AI models in achieving human-like conversation.

Udo Kiel

????Vom Arbeitswissenschaftler zum Wissenschaftskommunikator: Gemeinsam für eine sichtbarere Forschungswelt

11 个月

This study raises important questions about the advancement of AI and the challenges it still faces. ??

Ariel Zwolinski

IT Enthusiast ?? Maker of Things ??Business ??Technology ??Leading with Trust??Behavioral Economics ?? Organizational psychology??Software development ??Artificial Intelligence ?? Electric Vehicles

11 个月

I’ve seen recently that some tool or model did beat humans at captcha accuracy, I’ll post it here if I find what it was

Mohammadamin ForghaniAllahabadi

Humanitarian Specialist / Social Activitist / Non-Profit volunteer / Charity worker / Environment Lover

11 个月

Thanks for sharing

Michael Spencer

A.I. Writer, researcher and curator - full-time Newsletter publication manager.

11 个月

Subscribe to the author's Newsletter: Teaching Computers How to Talk: https://jurgengravestein.substack.com/subscribe

要查看或添加评论,请登录

Michael Spencer的更多文章

  • Guide to NotebookLM

    Guide to NotebookLM

    Google's AI tools are starting to get interesting. What is Google Learn about? Google's new AI tool, Learn About, is…

    3 条评论
  • The Genius of China's Open-Source Models

    The Genius of China's Open-Source Models

    Why would an obscure Open-weight LLM out of China be worth watching? Just wait to see what happens in 2025. ?? In…

    7 条评论
  • First Citizen of the AI State: Elon Musk

    First Citizen of the AI State: Elon Musk

    Thank to our Sponsor of today's article. ?? In partnership with Encord ?? Manage, curate and annotate multimodal AI…

    14 条评论
  • The Future of Search Upended - ChatGPT Search

    The Future of Search Upended - ChatGPT Search

    Hey Everyone, I’ve been waiting for this moment for many many months. Upgrade to Premium (?—??For a limited time get a…

    8 条评论
  • Can India become a Leader in AI?

    Can India become a Leader in AI?

    Hey Everyone, As some of you may know, readers of Newsletters continue to have more and more readers from South Asia…

    8 条评论
  • NotebookLM gets a Meta Llama Clone

    NotebookLM gets a Meta Llama Clone

    “When everyone digs for gold, sell shovels”. - Jensen Huang Apple Intelligence is late and other phone makers are…

    7 条评论
  • Top Semiconductor Infographics and Newsletters

    Top Semiconductor Infographics and Newsletters

    TSMC is expanding globally and driving new levels of efficiency. Image from the LinkedIn post here by Claus Aasholm.

    2 条评论
  • Anthropic Unveils Computer Use but where will it lead?

    Anthropic Unveils Computer Use but where will it lead?

    Hey Everyone, This could be an important announcement, whereas the last two years (2022-2024) LLMs have showed us an…

    10 条评论
  • Why Tesla is not an AI Company

    Why Tesla is not an AI Company

    Hello Everyone, We have enough data now to surmise that Tesla won't be a robotaxi or robot winner. Elon Musk has helped…

    11 条评论
  • The State of Robotics 2024

    The State of Robotics 2024

    This is a guest post by Diana Wolf Torres - please subscribe to her Deep Learning Daily Newsletter on LinkedIn if you…

    4 条评论

社区洞察

其他会员也浏览了