How GPT-4 Fails to Measure Up in 2023
Michael Spencer
A.I. Writer, researcher and curator - full-time Newsletter publication manager.
Hey Everyone,
Recently I’ve been on a spree of writing about OpenAI. Today we’re going to continue that trend with a shorter guest contribution.
?? From our sponsor: ??
Start Your AI Journey With an All-In-One Reading and Writing Assistant
Wordtune is an AI writing assistant that helps users create compelling content. With advanced capabilities, Wordtune understands the context of your writing, providing intelligent suggestions for improvement. Whether it's enhancing sentence structure or expanding your vocabulary, Wordtune offers real-time assistance, saving you time and effort.
The Turing test, originally called the “imitation game” by Alan Turing in 1950 (nearly 74 years ago), is a test of a machine's ability to exhibit intelligent behaviour equivalent to, or indistinguishable from, that of a human being.
So how do you suppose OpenAI’s GPT-4 measures up?
Jurgen Gravestein is one of those writers I follow for his extremely well articled and clear takes on Generative A.I. trends combined with excellent topic choice. His Newsletter name also easily resonates. I was already highly impressed by the topic choice of this piece.
A window to the world of conversational AI. Join 1,700+ consultants, creatives, and founders who also subscribed to this newsletter.
“Teaching Computers How to Talk”
When will GPT-5 Arrive?
It’s estimated GPT-5 has a 62% chance of being released in 2024. This means we somewhat unlikely to see it at the start of next year. OpenAI may require more funding before it is released.
But is the world ready for it?
If you want to help me continue to write, you can also check out my paid subscription here.
By Jurgen Gravestein - December, 2023
GPT-4 Can’t Pass the Turing Test
The Turing test, originally called the imitation game, was a thought experiment proposed by Alan Turing in 1950 to explore the concept of machine intelligence. In a recent study, two UC San Diego researchers tested GPT-4, GPT-3.5, and 1960’s ELIZA in a Turing test of their own, to see which program best mimics human conversation. Participants had to guess if they were interacting with a human or an AI.
The study compared the performance of several AI models: OpenAI's GPT-4, its predecessor GPT-3.5, and the historical ELIZA program (see image below). The experiment involved 652 participants who completed a total of 1,810 sessions in which they had to guess whether they had interacted with a machine or with another human being.
The ELIZA program, despite its rudimentary rule-based system, scored higher (27%) compared to GPT-3.5 (14%). GPT-4 outperformed ELIZA (41%), but effectively scored worse than a coin flip, meaning people were more often able to tell the difference than not.
Maybe the most surprising result of all was that humans were correctly identified as humans in 63% of all interactions. I suppose that can be considered a win for the computers, showing that it certainly has become harder to tell the difference due to the impressive linguistic display of modern day programs.
领英推荐
Let’s take a closer look
The study evaluated the AIs through a public platform. Participants, recruited online, were randomly matched with a human or an AI. The conversation interface was similar to any regular messaging app, with a 5-minute time limit and a 300-character limit per message. After the conversation, the participants had to decide whether they just interacted with a human or AI, rate their confidence, and optionally provide reasoning.?
Besides the Eliza system, both GPT-3.5 and GPT-4 were put to the test, each with different settings and system prompts. Here’s an example of one of the prompts used:
System prompts can be used to make the AI behave more human-like, follow instructions and/or adopt a certain style. The prompts used by the researchers varied in several aspects, including whether the AI should make spelling, grammar, and capitalization errors; the personality it should adopt; and whether it should act as a human or as an AI pretending to be human. Some prompts turned out to be more effective (or deceptive) than others.?
Here’s an overview of the success rate of the top performing prompts:
What stands out is that the best-performing GPT-4 prompt only achieves a success rate of 41%. This is higher than the performances of GPT-3.5 and ELIZA, but falls short of the human baseline of 63%.?
With GPT-4 scoring below the 50% threshold, it performed worse than flipping a coin. It’s safe to say that people weren’t merely guessing, but successfully able to tell the difference more often than not. The researchers conclude that based on their findings GPT-4 does not pass the Turing Test.?
Language ≠ intelligence
A common misconception is that the Turing test is a test of intelligence — it’s not. Any machine that can engage in coherent and seemingly thoughtful dialogue can fool us into believing it is smart. This phenomenon is better known as the ELIZA-effect, and the fact that the rudimentary ELIZA chatbot from 1960 outperformed OpenAI’s 3.5 model in this 21st century study serves as a testament to that.
All the Turing test does is question if a machine can mimic human conversation to the extent that it becomes impossible for us to tell the difference. As a matter of fact, the study confirms this assumption, because when asked about it the participants claim they based their judgments mainly on the style of responses, not perceived intelligence.
While the GPT-4’s ability to imitate human conversation can be seen as a major technical achievement, it does not imply understanding or consciousness on the part of the AI. The Turing test, therefore, must be viewed not as a barometer of AI intelligence but of linguistic competence.
Or, as I explained in a previous article:
“These systems don’t learn from first principles and experience, like us, but by crunching as much human-generated content as possible. (…) What we end up with is not human-level intelligence, but a form of machine intelligence that appears human-like.”
GPT-5 will pass the Turing test
Either way, machines will pass the Turing test sooner rather than later. GPT 5, without a shadow of a doubt, will be on par with humans in linguistic fluency — and honestly, I’m not looking forward to that moment.?
As the line between human and machine blurs, so does our ability to navigate the digital world with certainty. If every digital interaction can be faked, we’ll be forced to engage in a guessing game of sorts, a never-ending Turing test, in which we have to ask ourselves on a daily basis whether we are talking to a machine or not: every email, chat conversation, phone call, social media post or news article could be generated with AI without us knowing.?
Many will deem these linguistically fluent machines intelligent. Skeptics will argue that fluency doesn’t equate to intelligence and they are right. The essence of intelligence, human or artificial, lies in the depth of the comprehension and the ability to contextualize, not in beating the imitation game. Champions of the technology will claim an early victory anyway.
Society, in the meantime, will have to adapt to the new status quo. I don’t know how yet, but my hope is that we will, because that’s what we do. Humans are, after all, the single most adaptive species on Earth.
A shorter version of this article was previously published here.
“GPT 5, without a shadow of a doubt, will be on par with humans in linguistic fluency — and honestly, I’m not looking forward to that moment.”?- Jurgen Gravestein
?? Further Reading
He already wrote for our Newsletter back in August, 2023 on the topic of artificial general intelligence.
Why You Should be Skeptical of AGI
·
Aug 31
This is third installment in our series on AGI. Today I invite Jurgen Gravestein, who is a writer, consultant, and conversation designer. He was employee no. 1 at Conversation Design Institute and now works for the strategy and delivery branch CDI Services helping companies drive more business value with conversational AI. His newsletter
AI Speaker & Consultant | Helping Organizations Navigate the AI Revolution | Generated $50M+ Revenue | Talks about #AI #ChatGPT #B2B #Marketing #Outbound
11 个月Interesting study! It highlights the challenges and limitations of AI models in achieving human-like conversation.
????Vom Arbeitswissenschaftler zum Wissenschaftskommunikator: Gemeinsam für eine sichtbarere Forschungswelt
11 个月This study raises important questions about the advancement of AI and the challenges it still faces. ??
IT Enthusiast ?? Maker of Things ??Business ??Technology ??Leading with Trust??Behavioral Economics ?? Organizational psychology??Software development ??Artificial Intelligence ?? Electric Vehicles
11 个月I’ve seen recently that some tool or model did beat humans at captcha accuracy, I’ll post it here if I find what it was
Humanitarian Specialist / Social Activitist / Non-Profit volunteer / Charity worker / Environment Lover
11 个月Thanks for sharing
A.I. Writer, researcher and curator - full-time Newsletter publication manager.
11 个月Subscribe to the author's Newsletter: Teaching Computers How to Talk: https://jurgengravestein.substack.com/subscribe