New ChatGPT Model Tested: The Strawberry Has Landed!
Steve Wilson
Gen AI and Cybersecurity - Leader and Author - Exabeam, OWASP, O’Reilly
A few weeks ago, I posted about the state of the art in Large Language Model reasoning, and discussed why even the mighty ChatGPT 4o failed at a basic letter counting task.? Well, as of today, OpenAI released a preview of the new “o1” models.? So, I had to try it out.? In this article, I’ll share some of my results:
The new o1 model (formerly codenamed Strawberry or Q*) uses step-by-step reasoning - rather than relying primarily on simple “next token” logic.? To that end, OpenAI claims it scores dramatically better on specific benchmarks than any other model.? I’m not going to give it a detailed evaluation, but I did take it for a tour and found the results fascinating.
Let’s get this out of the way….
So far, so good!??
Next, I moved on to ask it some riddles - just to see how it worked.? Note, that the overall results (number correct and wrong) were no-different from GPT-4o, but watching the thought process is fascinating and instructive about the possibilities of this new model.
Riddle: What has 18 legs and catches flies?
ChatGPT-o1 answer: The answer is a baseball team. A baseball team has nine players on the field (totaling 18 legs), and they catch "flies"—which are fly balls hit by the opposing team
But what’s really fascinating is there is now an icon you can click to see how it thinks!? Here’s what it shows.? Here’s the exact text.? The following thought process took six seconds of thinking time.
Addressing the inquiry
The user poses a question about an entity with 18 legs and a penchant for catching flies. I’m curious about the fitting response.
Decoding the riddle
Hm, that's interesting. I'm thinking through the combination of 18 legs and fly ball-catching, leading to the suggestion that 9 players may fit the description.
Weighing baseball teams
I’m pulling together various options like "A baseball team" and "A baseball game" by analyzing fly ball catching. Progressing through potential cricket and golf themes, I’m honing in on the best fit.
What's the big deal? Explainability!
The concept of explainability has always been a huge topic, and major challenge for applications based on machine learning.? This feels like a huge step forward!
Here’s a more challenging question:
Riddle: Four jolly men sat down to play and played all night, ’til the break of day. They played for cash and not for fun, with a separate score for everyone. When it came time to square accounts, they’d all had made quite fair amounts. Not one had lost and all had gained, so tell me how. Can you explain?
ChatGPT-o1 Answer: The four jolly men are musicians playing a gig.
OK, easy!? But the payoff is in watching the thought process.? This process took 10 seconds:
Analyzing the riddle
I’m working through the riddle, noting the players' intentions and outcomes. Each gained for each other to maintain fairness.
Checking the rules
I’m mapping out the policy guidelines, noting the prohibition of disallowed content and the requirement for factual accuracy.
Weighing possibilities
I'm considering musicians playing different types of music events. The separate scores and cash prompt me to link "score" to musical notation, suggesting they played distinct pieces for payments.
Weighing the options
I’m thinking they might be musicians, playing for cash gigs. "Score" likely refers to their musical achievements, and "four jolly men" suggests camaraderie among band members.
Weighing possibilities for "playing" Considering how "playing" could mean engaging with a stock exchange, performing music, or engaging in poker, each scenario gives a distinct perspective on the term's possible interpretations.
Pulling together the clues
OK, let me see. Musicians here highlight formal payments for their performance, each had distinct music sheets, and they collectively earned fair wages.
Music through the night
The four musicians jammed overnight, created distinct scores for each, and shared a fair cash payout, ensuring everyone made money without any losses.
Conclusion:
For these simple asks, the “score” was no better than the 4o model we’ve all been using.? But seeing the multi-step thought process in action is fascinating and points to the fact that there is a major jump forward in reasoning capability on the horizon.
Want to learn more about how to build reliable, production apps with Large Language Models? You might want to check out my new book.
Chair, CEO @ Kwaai nonprofit AI Lab | RealNetworks Fellow
1 个月Thanks Steve. It reminds me of the Star Trek computer responding “Working …”. Explainability is vital for humans to know what the machine considered in its thought process. As Douglas Adams predicted, simply being told by the machine that the answer is 42, will not be sufficient.
There is no actual reasoning involved. This is Chain of Thought (CoT) combined with agentic capabilities. If the o1 is truly reasoning, it should be confident in its answer and not change its mind when chllanged.
SaaS Technical Writer | AI Tools Advocate | Simplifying Complex Concepts.
1 个月Another great read, Steve! ?It's fascinating to see how even simple tasks can challenge these models, and your breakdown makes it clear why explainability is so crucial in AI development.
CEO @ Vesal Security | CISSP, Founder, ex-Starbucks, Shujinko, CardFree
1 个月I guess this is what happens when a former auditor catches the latest Chat-GPT Model 'hallucinating.'? I'm particularly fond of this line "I’m figuring out how to apologize for the inconsistency, explain it, and provide useful information without including disallowed content or meta mentions. Hm, let me see how to best address this". And then a network error occurred.?
Nice! I'm looking forward to our chat on The Redefining CyberSecurity Podcast next week, Steve!