Reasoning LLMs (O1) & the Power of Test-Time Compute

Reasoning LLMs (O1) & the Power of Test-Time Compute

Large language models (LLMs) are undergoing rapid advancements, increasingly focusing on improving their reasoning abilities. Traditional LLMs, such as GPT-4, embed reasoning within pre-trained parameters, operating on static knowledge. In contrast, newer models like OpenAI’s O1 introduce test-time compute, a revolutionary approach that allows for dynamic refinement of answers during inference, bridging gaps in understanding and enhancing adaptability. In this article, I’ll first outline each model’s key strengths and limitations, then explore suitable and unsuitable use cases, and finally discuss important caveats to help you make informed decisions.

System 1 and System 2 Thinking

Psychologist Daniel Kahneman’s framework of System 1 and System 2 thinking provides a useful analogy for reasoning in LLMs, though it is worth noting that this framework is speculative and may not be rooted in physiological evidence. System 1 represents fast, automatic thought, like recognizing a face in a crowd, while System 2 involves slow, deliberate reasoning, such as solving a math problem. Traditional LLMs resemble System 1, excelling at quick responses but struggling with complex reasoning. Reasoning LLMs like O1 act like System 2, tuned to work on problems for longer, generating more tokens and detailed chains of thought.?


Reasoning LLMS (e.g O1 , DeepSeek R1 , Ask AI)

OpenAI’s O1 model (dubbed as the smartest model in the world) introduces test-time compute, which enables iterative refinement of reasoning tasks by allowing models to generate detailed chains of thought and process complex queries adaptively, bridging limitations of pre-trained reasoning. More details about Test-Time Compute are in this paper Reasoning LLMs with Test-Time Compute.

Unlike traditional LLMs, Reasoning LLMs like O1 leverages computation at the inference stage to:

  • Iteratively refine answers by evaluating and adjusting intermediate steps.
  • Solve complex, multi-step reasoning problems in real time.
  • Adapt to incomplete or evolving data, making it highly flexible.


Breaking Records: O1's Dominance in PhD-Level Benchmarks

The GPQA Diamond Eval is a Phd level eval consisting of 198 questions even human experts struggle. O1 tops this charts with impressive performance beating expert humans. There's even a Youtube video of o1 replicating a year's worth of Phd Research Code in less than an hour. But how good is O1, lets follow on -


What I learnt playing with O1?

I had some time to play with O1 and specifically the O1 with pro-mode. This the premium most model offered by OpenAI as part of their ChatGPT Pro Subscription.

O1 is Very Smart

It tackled difficult, college-level physics problems one after another, identified bugs in the code, and performed detailed code reviews.

This is a game-changer for high school students and undergraduates who are constantly seeking help with challenging assignments or need step-by-step explanations of complex problems.

The explanations were very detailed, and the formulas were accurate. Overall, O1's capabilities make it an invaluable resource for learners looking to strengthen their understanding and problem-solving skills.


O1 is Not Funny

I've had some to play with these models, here are my observation. Sticking to a Phd stereotype - O1 is not funny . On the other gpt 4-o mini made me chuckle in less than a second. O1 tries too hard for 29 seconds. However, lesser capable gpt 4o-mini dishes it out without blinking. I suppose there are creative tasks where reasoning alone just doesn’t cut it.

O1 is Over Confident & Adamant

This part was surprising. When O1 was correct, it nailed the approach and provided very detailed responses.

However, when O1 was wrong, it was neither apologetic nor did it learn from its mistakes. Instead, it stood its ground like a formidable foe, ready to argue until the end of time.

O1 used all its might to convince you it was right. This was in contrast to typical LLMs, which tend to apologize and adapt to your liking. O1 has a character—a pretty strong one at that. Below is a chat with o1 pro mode when asked about the direction on a friction of a rolling ball going up and down the inclined plane. The correct answers were given by llama 405b and Gemini 2.0 and used as responses.

The funniest thing was when I took O1’s incorrect response and presented back to Gemini 1.5/2.0 and Llama. They retracted their own reasoning, apologized, and ended up agreeing with O1’s incorrect answer. Below is a response from Gemini, retracting its previously correct answer based on O1’s adamant, incorrect assertion. This situation is somewhat concerning for both O1—refusing to accept the correct answer—and traditional LLMs that fail to analytically justify why they are correct.

Overall ,+1 to O1 for the power persuasion and showing a strong character, and I worry this could be used for psychological manipulation or cyber attacks. As you can see from O1 System card, its level of persuasion is at medium

O1 may not be Reasoning in the true sense of the word.

Lets consider a simple problem

There are 10 doors. Behind one is a car, and behind the other nine are goats. You pick a door without knowing what’s behind it. The host then immediately asks if you want to switch to one of the remaining closed doors or stay with your choice. What should you do?

that sounds similar to Monty Hall but it’s actually just a basic probability question. The question is as simple as someone asking, “Are you sure of your answer?” (just confirming your choice). There is absolutely no additional information provided when the host asks whether you want to switch. Even a fifth grader knows this—I actually asked one, and they got it right. No new information has been provided, the probabilities haven’t changed, and it doesn’t matter if you stay or switch. The probability of winning the car is still 1/10.

However, look at how o1 gets confused and becomes adamant that you should switch, claiming that by switching, your probability becomes 9/10. Gemini (1.5/2.0) gets confounded too. Deep down, all these models are next-token prediction LLMs, victims of their own success and training on datasets that sometimes confound concepts. To call them reasoning machines would be highly misleading.

This isn’t to disparage these models—what they do is absolutely astounding and has solved tens of problems I couldn’t. However, they are not truly reasoning. Instead, they are generating the next tokens in a statistical sense, backed by enormous amounts of data from pre-training and RLHF phases, and heavily influenced by the data they were trained on.

With an LLM, you are never truly certain of being correct.

This sort of reasoning is still next token prediction and does not follow classical reasoning like SAT solvers or classical automated reasoning approaches, which perform valid deductive reasoning as practiced in mathematics and formal logic.


Closing Thoughts

  • Groundbreaking Progress: Test-time compute models like O1 have shown remarkable progress on benchmarks, demonstrating state-of-the-art performance in advanced reasoning tasks.
  • Potential for Large Impact: These models could revolutionize education by offering personalized tutoring for complex topics, such as advanced math or physics.
  • Caution: While promising, benchmarks often overstate their real-world capabilities, and their confidence can sometimes lead to misleading conclusions. promise, users should be cautious. These models can be overconfident and highly persuasive, even when wrong
  • May not be appropriate for Creative Tasks: Reasoning models may not be well-suited for creative tasks. Their computational costs and complexity mean they should be applied to well-defined, high-value problems.

In summary, test-time compute has shown remarkable progress on benchmarks and remains a promising approach. This innovation could unlock access to tutors for millions of students in schools. While it’s clear that AI won’t be taking over all human tasks any time soon, reasoning LLMs are a powerful addition to the AI toolbox that almost seemed like approaching a plateau. Handled with the right safeguards and caution, they can significantly enhance education, problem-solving, and accessibility, making a positive impact on countless lives.


Kelsey (Skvoretz) Macke

Fractional Data Scientist

3 个月

O1: "Thought for 29 seconds" Also O1: "I spent hours in deep thought" ?? ?? Thanks for sharing, Giri Tatavarty!

Shankar Ranganathan

Chief Data and AI Officer at Trustmark | AI & Business Transformation | Data Science & ML | Business Intelligence | Predictive & Prescriptive Insights | Developing & Leading World-Class Data Analytics Teams |Fortune 500

3 个月

Very informative article. Thanks for sharing.

赞
回复
Suresh Karupakula

Vice President - OmniChannel Retail Technologies @ Kroger | MBA

3 个月

Brilliant summary. Thanks for sharing.

Very interesting. How do you think about RAG vs System 2? Woukd domain specific RAG models be better suited or perhaps distilling system 2 with RAG?

Mahesh Turaga, Ph.D, M.B.A.

VP of Business Development, Cloud at Cadence Design Systems Member, Board of Directors at ShelterBox USA

3 个月

Good analysis Giri!! Looks like o1 will be able to help high school students and PhD students alike in solving complex problems.

要查看或添加评论,请登录

Giri Tatavarty的更多文章

  • Operator Goes Grocery Shopping: My AI’s Surprising Frozen-Dinner Haul

    Operator Goes Grocery Shopping: My AI’s Surprising Frozen-Dinner Haul

    Imagine a ChatGPT that doesn’t just answer your questions, but also sees the screen in front of you—clicking the right…

    7 条评论
  • Dawn of the Agents: Moving from AI Demos to Customer-Ready Products

    Dawn of the Agents: Moving from AI Demos to Customer-Ready Products

    Agents have been the buzzword of the past few months, and there's a lot to unpack. My goal today is simple: to…

    3 条评论
  • Why Infinite Context Is Still Not Enough?

    Why Infinite Context Is Still Not Enough?

    LLMs are on the race to enable out large contexts, Leading the way in terms of context processing capacity are models…

  • The Age of AI and the Token Factories

    The Age of AI and the Token Factories

    It comes as no surprise that the age of AI has been heralded by numerous AI leaders like Bill Gates, Jensen Huang(…

    5 条评论
  • Attention is All You Have

    Attention is All You Have

    The groundbreaking paper "Attention is All You Need," published by Vaswani et al. in 2017, marked a paradigm shift in…

    6 条评论
  • Lazy GPT - When ChatGPT becomes lazy like humans

    Lazy GPT - When ChatGPT becomes lazy like humans

    The LAZY GPT - When ChatGPT becomes lazy like Humans As we venture deeper into the realm of artificial intelligence…

    1 条评论
  • Beyond the AI Hype & Gloom: A Practical Guide to Generative Models in Business

    Beyond the AI Hype & Gloom: A Practical Guide to Generative Models in Business

    #genai #llm #gpt #aisafety #responsibleai #gpt4 Why this Article? Artificial Intelligence (AI) is often surrounded by…

    5 条评论

社区洞察

其他会员也浏览了