Reasoning LLMs (O1) & the Power of Test-Time Compute
Large language models (LLMs) are undergoing rapid advancements, increasingly focusing on improving their reasoning abilities. Traditional LLMs, such as GPT-4, embed reasoning within pre-trained parameters, operating on static knowledge. In contrast, newer models like OpenAI’s O1 introduce test-time compute, a revolutionary approach that allows for dynamic refinement of answers during inference, bridging gaps in understanding and enhancing adaptability. In this article, I’ll first outline each model’s key strengths and limitations, then explore suitable and unsuitable use cases, and finally discuss important caveats to help you make informed decisions.
System 1 and System 2 Thinking
Psychologist Daniel Kahneman’s framework of System 1 and System 2 thinking provides a useful analogy for reasoning in LLMs, though it is worth noting that this framework is speculative and may not be rooted in physiological evidence. System 1 represents fast, automatic thought, like recognizing a face in a crowd, while System 2 involves slow, deliberate reasoning, such as solving a math problem. Traditional LLMs resemble System 1, excelling at quick responses but struggling with complex reasoning. Reasoning LLMs like O1 act like System 2, tuned to work on problems for longer, generating more tokens and detailed chains of thought.?
Reasoning LLMS (e.g O1 , DeepSeek R1 , Ask AI)
OpenAI’s O1 model (dubbed as the smartest model in the world) introduces test-time compute, which enables iterative refinement of reasoning tasks by allowing models to generate detailed chains of thought and process complex queries adaptively, bridging limitations of pre-trained reasoning. More details about Test-Time Compute are in this paper Reasoning LLMs with Test-Time Compute.
Unlike traditional LLMs, Reasoning LLMs like O1 leverages computation at the inference stage to:
- Iteratively refine answers by evaluating and adjusting intermediate steps.
- Solve complex, multi-step reasoning problems in real time.
- Adapt to incomplete or evolving data, making it highly flexible.
Breaking Records: O1's Dominance in PhD-Level Benchmarks
The GPQA Diamond Eval is a Phd level eval consisting of 198 questions even human experts struggle. O1 tops this charts with impressive performance beating expert humans. There's even a Youtube video of o1 replicating a year's worth of Phd Research Code in less than an hour. But how good is O1, lets follow on -
What I learnt playing with O1?
I had some time to play with O1 and specifically the O1 with pro-mode. This the premium most model offered by OpenAI as part of their ChatGPT Pro Subscription.
O1 is Very Smart
It tackled difficult, college-level physics problems one after another, identified bugs in the code, and performed detailed code reviews.
This is a game-changer for high school students and undergraduates who are constantly seeking help with challenging assignments or need step-by-step explanations of complex problems.
The explanations were very detailed, and the formulas were accurate. Overall, O1's capabilities make it an invaluable resource for learners looking to strengthen their understanding and problem-solving skills.
领英推è
O1 is Not Funny
I've had some to play with these models, here are my observation. Sticking to a Phd stereotype - O1 is not funny . On the other gpt 4-o mini made me chuckle in less than a second. O1 tries too hard for 29 seconds. However, lesser capable gpt 4o-mini dishes it out without blinking. I suppose there are creative tasks where reasoning alone just doesn’t cut it.
O1 is Over Confident & Adamant
This part was surprising. When O1 was correct, it nailed the approach and provided very detailed responses.
However, when O1 was wrong, it was neither apologetic nor did it learn from its mistakes. Instead, it stood its ground like a formidable foe, ready to argue until the end of time.
O1 used all its might to convince you it was right. This was in contrast to typical LLMs, which tend to apologize and adapt to your liking. O1 has a character—a pretty strong one at that. Below is a chat with o1 pro mode when asked about the direction on a friction of a rolling ball going up and down the inclined plane. The correct answers were given by llama 405b and Gemini 2.0 and used as responses.
The funniest thing was when I took O1’s incorrect response and presented back to Gemini 1.5/2.0 and Llama. They retracted their own reasoning, apologized, and ended up agreeing with O1’s incorrect answer. Below is a response from Gemini, retracting its previously correct answer based on O1’s adamant, incorrect assertion. This situation is somewhat concerning for both O1—refusing to accept the correct answer—and traditional LLMs that fail to analytically justify why they are correct.
Overall ,+1 to O1 for the power persuasion and showing a strong character, and I worry this could be used for psychological manipulation or cyber attacks. As you can see from O1 System card, its level of persuasion is at medium
O1 may not be Reasoning in the true sense of the word.
There are 10 doors. Behind one is a car, and behind the other nine are goats. You pick a door without knowing what’s behind it. The host then immediately asks if you want to switch to one of the remaining closed doors or stay with your choice. What should you do?
that sounds similar to Monty Hall but it’s actually just a basic probability question. The question is as simple as someone asking, “Are you sure of your answer?†(just confirming your choice). There is absolutely no additional information provided when the host asks whether you want to switch. Even a fifth grader knows this—I actually asked one, and they got it right. No new information has been provided, the probabilities haven’t changed, and it doesn’t matter if you stay or switch. The probability of winning the car is still 1/10.
However, look at how o1 gets confused and becomes adamant that you should switch, claiming that by switching, your probability becomes 9/10. Gemini (1.5/2.0) gets confounded too. Deep down, all these models are next-token prediction LLMs, victims of their own success and training on datasets that sometimes confound concepts. To call them reasoning machines would be highly misleading.
This isn’t to disparage these models—what they do is absolutely astounding and has solved tens of problems I couldn’t. However, they are not truly reasoning. Instead, they are generating the next tokens in a statistical sense, backed by enormous amounts of data from pre-training and RLHF phases, and heavily influenced by the data they were trained on.
With an LLM, you are never truly certain of being correct.
This sort of reasoning is still next token prediction and does not follow classical reasoning like SAT solvers or classical automated reasoning approaches, which perform valid deductive reasoning as practiced in mathematics and formal logic.
Closing Thoughts
- Groundbreaking Progress: Test-time compute models like O1 have shown remarkable progress on benchmarks, demonstrating state-of-the-art performance in advanced reasoning tasks.
- Potential for Large Impact: These models could revolutionize education by offering personalized tutoring for complex topics, such as advanced math or physics.
- Caution: While promising, benchmarks often overstate their real-world capabilities, and their confidence can sometimes lead to misleading conclusions. promise, users should be cautious. These models can be overconfident and highly persuasive, even when wrong
- May not be appropriate for Creative Tasks: Reasoning models may not be well-suited for creative tasks. Their computational costs and complexity mean they should be applied to well-defined, high-value problems.
In summary, test-time compute has shown remarkable progress on benchmarks and remains a promising approach. This innovation could unlock access to tutors for millions of students in schools. While it’s clear that AI won’t be taking over all human tasks any time soon, reasoning LLMs are a powerful addition to the AI toolbox that almost seemed like approaching a plateau. Handled with the right safeguards and caution, they can significantly enhance education, problem-solving, and accessibility, making a positive impact on countless lives.
Fractional Data Scientist
3 个月O1: "Thought for 29 seconds" Also O1: "I spent hours in deep thought" ?? ?? Thanks for sharing, Giri Tatavarty!
Chief Data and AI Officer at Trustmark | AI & Business Transformation | Data Science & ML | Business Intelligence | Predictive & Prescriptive Insights | Developing & Leading World-Class Data Analytics Teams |Fortune 500
3 个月Very informative article. Thanks for sharing.
Vice President - OmniChannel Retail Technologies @ Kroger | MBA
3 个月Brilliant summary. Thanks for sharing.
Very interesting. How do you think about RAG vs System 2? Woukd domain specific RAG models be better suited or perhaps distilling system 2 with RAG?
VP of Business Development, Cloud at Cadence Design Systems Member, Board of Directors at ShelterBox USA
3 个月Good analysis Giri!! Looks like o1 will be able to help high school students and PhD students alike in solving complex problems.