登录查看更多内容

New ChatGPT Model Tested: The Strawberry Has Landed!

Steve Wilson

Gen AI and Cybersecurity - Leader and Author - Exabeam, OWASP, O’Reilly

发布日期: 2024年9月12日

A few weeks ago, I posted about the state of the art in Large Language Model reasoning, and discussed why even the mighty ChatGPT 4o failed at a basic letter counting task.? Well, as of today, OpenAI released a preview of the new “o1” models.? So, I had to try it out.? In this article, I’ll share some of my results:

The new o1 model (formerly codenamed Strawberry or Q*) uses step-by-step reasoning - rather than relying primarily on simple “next token” logic.? To that end, OpenAI claims it scores dramatically better on specific benchmarks than any other model.? I’m not going to give it a detailed evaluation, but I did take it for a tour and found the results fascinating.

Let’s get this out of the way….

Screenshot of first interaction with new ChatGPT model

So far, so good!??

Next, I moved on to ask it some riddles - just to see how it worked.? Note, that the overall results (number correct and wrong) were no-different from GPT-4o, but watching the thought process is fascinating and instructive about the possibilities of this new model.

Riddle: What has 18 legs and catches flies?

ChatGPT-o1 answer: The answer is a baseball team. A baseball team has nine players on the field (totaling 18 legs), and they catch "flies"—which are fly balls hit by the opposing team

But what’s really fascinating is there is now an icon you can click to see how it thinks!? Here’s what it shows.? Here’s the exact text.? The following thought process took six seconds of thinking time.

Addressing the inquiry

The user poses a question about an entity with 18 legs and a penchant for catching flies. I’m curious about the fitting response.

Decoding the riddle

Hm, that's interesting. I'm thinking through the combination of 18 legs and fly ball-catching, leading to the suggestion that 9 players may fit the description.

Weighing baseball teams

I’m pulling together various options like "A baseball team" and "A baseball game" by analyzing fly ball catching. Progressing through potential cricket and golf themes, I’m honing in on the best fit.

What's the big deal? Explainability!

The concept of explainability has always been a huge topic, and major challenge for applications based on machine learning.? This feels like a huge step forward!

Here’s a more challenging question:

Riddle: Four jolly men sat down to play and played all night, ’til the break of day. They played for cash and not for fun, with a separate score for everyone. When it came time to square accounts, they’d all had made quite fair amounts. Not one had lost and all had gained, so tell me how. Can you explain?

领英推荐

Latest In Web3, AI & Emerging Tech

Somi Arian 1 年前

Artificial Intelligence #214

Andriy Burkov 8 个月前

Artificial Intelligence #214

Andriy Burkov 9 个月前

ChatGPT-o1 Answer: The four jolly men are musicians playing a gig.

OK, easy!? But the payoff is in watching the thought process.? This process took 10 seconds:

Analyzing the riddle

I’m working through the riddle, noting the players' intentions and outcomes. Each gained for each other to maintain fairness.

Checking the rules

I’m mapping out the policy guidelines, noting the prohibition of disallowed content and the requirement for factual accuracy.

Weighing possibilities

I'm considering musicians playing different types of music events. The separate scores and cash prompt me to link "score" to musical notation, suggesting they played distinct pieces for payments.

Weighing the options

I’m thinking they might be musicians, playing for cash gigs. "Score" likely refers to their musical achievements, and "four jolly men" suggests camaraderie among band members.

Weighing possibilities for "playing" Considering how "playing" could mean engaging with a stock exchange, performing music, or engaging in poker, each scenario gives a distinct perspective on the term's possible interpretations.

Pulling together the clues

OK, let me see. Musicians here highlight formal payments for their performance, each had distinct music sheets, and they collectively earned fair wages.

Music through the night

The four musicians jammed overnight, created distinct scores for each, and shared a fair cash payout, ensuring everyone made money without any losses.

Conclusion:

For these simple asks, the “score” was no better than the 4o model we’ve all been using.? But seeing the multi-step thought process in action is fascinating and points to the fact that there is a major jump forward in reasoning capability on the horizon.

Want to learn more about how to build reliable, production apps with Large Language Models? You might want to check out my new book.

Developer's Playbook for Large Language Model Security

Reza Rassool

Chair, CEO @ Kwaai nonprofit AI Lab | RealNetworks Fellow

2 个月

Thanks Steve. It reminds me of the Star Trek computer responding “Working …”. Explainability is vital for humans to know what the machine considered in its thought process. As Douglas Adams predicted, simply being told by the machine that the answer is 42, will not be sufficient.

1 次回应

Haydar Jawad

2 个月

There is no actual reasoning involved. This is Chain of Thought (CoT) combined with agentic capabilities. If the o1 is truly reasoning, it should be confident in its answer and not change its mind when chllanged.

Asmaa Moussa

SaaS Technical Writer | AI Tools Advocate | Simplifying Complex Concepts.

2 个月

Another great read, Steve! ?It's fascinating to see how even simple tasks can challenge these models, and your breakdown makes it clear why explainability is so crucial in AI development.

1 次回应

Scott Schwan

CEO @ Vesal Security | CISSP, Founder, ex-Starbucks, Shujinko, CardFree

2 个月

I guess this is what happens when a former auditor catches the latest Chat-GPT Model 'hallucinating.'? I'm particularly fond of this line "I’m figuring out how to apologize for the inconsistency, explain it, and provide useful information without including disallowed content or meta mentions. Hm, let me see how to best address this". And then a network error occurred.?

2 次回应

Sean Martin

2 个月

Nice! I'm looking forward to our chat on The Redefining CyberSecurity Podcast next week, Steve!

2 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

New ChatGPT Model Tested: The Strawberry Has Landed!

Steve Wilson

Gen AI and Cybersecurity - Leader and Author - Exabeam, OWASP, O’Reilly

What's the big deal? Explainability!

领英推荐

Conclusion:

更多精彩文章

社区洞察

其他会员也浏览了

Artificial Intelligence #229

Artificial Intelligence #182

Artificial Intelligence #182

Artificial Intelligence #166

Artificial Intelligence #166

Artificial Intelligence #168

Artificial Intelligence #168

Artificial Intelligence #164

Artificial Intelligence #169

Artificial Intelligence #169

What's the big deal? Explainability!

领英推荐

Conclusion:

AI Safety: With Great Power Comes Great Responsibility

2024年10月5日

Will AI Take Your Job? How New Regulations Aim to Protect Workers

2024年10月2日

The AI Data Dilemma: What Does Privacy and Ownership Mean in the Age of Smart Machines?

2024年9月28日

AI in the Crosshairs: Can We Protect Your Data, Your Job, and Our Future?

2024年9月27日

Language Models, Strawberries and Hallucinations

2024年8月28日

Apple's Bold AI Move: An LLM Security Perspective

2024年6月11日

Contributing to the OWASP Top 10 for LLM

2024年6月4日

Chess, AGI, and the Singularity: What Computer Chess Can Teach Us About the Future

2024年5月28日

AI and Cyber at RSA '24

2024年5月3日

Updates on the OWASP Top 10 for LLM Applications Project V2

2024年4月3日

社区洞察

其他会员也浏览了

Artificial Intelligence #229

Artificial Intelligence #182

Artificial Intelligence #182

Artificial Intelligence #166

Artificial Intelligence #166

Artificial Intelligence #168

Artificial Intelligence #168

Artificial Intelligence #164

Artificial Intelligence #169

Artificial Intelligence #169