ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Reasoning LLMs (O1) & the Power of Test-Time Compute

Giri Tatavarty

å‘å¸ƒæ—¥æœŸ: 2024å¹´12æœˆ15æ—¥

Large language models (LLMs) are undergoing rapid advancements, increasingly focusing on improving their reasoning abilities. Traditional LLMs, such as GPT-4, embed reasoning within pre-trained parameters, operating on static knowledge. In contrast, newer models like OpenAIâ€™s O1 introduce test-time compute, a revolutionary approach that allows for dynamic refinement of answers during inference, bridging gaps in understanding and enhancing adaptability. In this article, Iâ€™ll first outline each modelâ€™s key strengths and limitations, then explore suitable and unsuitable use cases, and finally discuss important caveats to help you make informed decisions.

System 1 and System 2 Thinking

Psychologist Daniel Kahnemanâ€™s framework of System 1 and System 2 thinking provides a useful analogy for reasoning in LLMs, though it is worth noting that this framework is speculative and may not be rooted in physiological evidence. System 1 represents fast, automatic thought, like recognizing a face in a crowd, while System 2 involves slow, deliberate reasoning, such as solving a math problem. Traditional LLMs resemble System 1, excelling at quick responses but struggling with complex reasoning. Reasoning LLMs like O1 act like System 2, tuned to work on problems for longer, generating more tokens and detailed chains of thought.?

Reasoning LLMS (e.g O1 , DeepSeek R1 , Ask AI)

OpenAIâ€™s O1 model (dubbed as the smartest model in the world) introduces test-time compute, which enables iterative refinement of reasoning tasks by allowing models to generate detailed chains of thought and process complex queries adaptively, bridging limitations of pre-trained reasoning. More details about Test-Time Compute are in this paper Reasoning LLMs with Test-Time Compute.

Unlike traditional LLMs, Reasoning LLMs like O1 leverages computation at the inference stage to:

Iteratively refine answers by evaluating and adjusting intermediate steps.
Solve complex, multi-step reasoning problems in real time.
Adapt to incomplete or evolving data, making it highly flexible.

Breaking Records: O1's Dominance in PhD-Level Benchmarks

The GPQA Diamond Eval is a Phd level eval consisting of 198 questions even human experts struggle. O1 tops this charts with impressive performance beating expert humans. There's even a Youtube video of o1 replicating a year's worth of Phd Research Code in less than an hour. But how good is O1, lets follow on -

What I learnt playing with O1?

I had some time to play with O1 and specifically the O1 with pro-mode. This the premium most model offered by OpenAI as part of their ChatGPT Pro Subscription.

O1 is Very Smart

It tackled difficult, college-level physics problems one after another, identified bugs in the code, and performed detailed code reviews.

This is a game-changer for high school students and undergraduates who are constantly seeking help with challenging assignments or need step-by-step explanations of complex problems.

The explanations were very detailed, and the formulas were accurate. Overall, O1's capabilities make it an invaluable resource for learners looking to strengthen their understanding and problem-solving skills.

é¢†è‹±æŽ¨è

On the generative wave (Part 1)

Azeem Azhar 2 å¹´å‰

Top LLM Papers of the Week (October Week 4, 2024)

Kalyan KS 4 ä¸ªæœˆå‰

??Top ML Papers of the Week

DAIR.AI 10 ä¸ªæœˆå‰

O1 is Not Funny

I've had some to play with these models, here are my observation. Sticking to a Phd stereotype - O1 is not funny . On the other gpt 4-o mini made me chuckle in less than a second. O1 tries too hard for 29 seconds. However, lesser capable gpt 4o-mini dishes it out without blinking. I suppose there are creative tasks where reasoning alone just doesnâ€™t cut it.

O1 is Over Confident & Adamant

This part was surprising. When O1 was correct, it nailed the approach and provided very detailed responses.

However, when O1 was wrong, it was neither apologetic nor did it learn from its mistakes. Instead, it stood its ground like a formidable foe, ready to argue until the end of time.

O1 used all its might to convince you it was right. This was in contrast to typical LLMs, which tend to apologize and adapt to your liking. O1 has a characterâ€”a pretty strong one at that. Below is a chat with o1 pro mode when asked about the direction on a friction of a rolling ball going up and down the inclined plane. The correct answers were given by llama 405b and Gemini 2.0 and used as responses.

The funniest thing was when I took O1â€™s incorrect response and presented back to Gemini 1.5/2.0 and Llama. They retracted their own reasoning, apologized, and ended up agreeing with O1â€™s incorrect answer. Below is a response from Gemini, retracting its previously correct answer based on O1â€™s adamant, incorrect assertion. This situation is somewhat concerning for both O1â€”refusing to accept the correct answerâ€”and traditional LLMs that fail to analytically justify why they are correct.

Overall ,+1 to O1 for the power persuasion and showing a strong character, and I worry this could be used for psychological manipulation or cyber attacks. As you can see from O1 System card, its level of persuasion is at medium

O1 may not be Reasoning in the true sense of the word.

Lets consider a simple problem

There are 10 doors. Behind one is a car, and behind the other nine are goats. You pick a door without knowing whatâ€™s behind it. The host then immediately asks if you want to switch to one of the remaining closed doors or stay with your choice. What should you do?

that sounds similar to Monty Hall but itâ€™s actually just a basic probability question. The question is as simple as someone asking, â€œAre you sure of your answer?â€ (just confirming your choice). There is absolutely no additional information provided when the host asks whether you want to switch. Even a fifth grader knows thisâ€”I actually asked one, and they got it right. No new information has been provided, the probabilities havenâ€™t changed, and it doesnâ€™t matter if you stay or switch. The probability of winning the car is still 1/10.

However, look at how o1 gets confused and becomes adamant that you should switch, claiming that by switching, your probability becomes 9/10. Gemini (1.5/2.0) gets confounded too. Deep down, all these models are next-token prediction LLMs, victims of their own success and training on datasets that sometimes confound concepts. To call them reasoning machines would be highly misleading.

This isnâ€™t to disparage these modelsâ€”what they do is absolutely astounding and has solved tens of problems I couldnâ€™t. However, they are not truly reasoning. Instead, they are generating the next tokens in a statistical sense, backed by enormous amounts of data from pre-training and RLHF phases, and heavily influenced by the data they were trained on.

With an LLM, you are never truly certain of being correct.

This sort of reasoning is still next token prediction and does not follow classical reasoning like SAT solvers or classical automated reasoning approaches, which perform valid deductive reasoning as practiced in mathematics and formal logic.

Closing Thoughts

Groundbreaking Progress: Test-time compute models like O1 have shown remarkable progress on benchmarks, demonstrating state-of-the-art performance in advanced reasoning tasks.
Potential for Large Impact: These models could revolutionize education by offering personalized tutoring for complex topics, such as advanced math or physics.
Caution: While promising, benchmarks often overstate their real-world capabilities, and their confidence can sometimes lead to misleading conclusions. promise, users should be cautious. These models can be overconfident and highly persuasive, even when wrong
May not be appropriate for Creative Tasks: Reasoning models may not be well-suited for creative tasks. Their computational costs and complexity mean they should be applied to well-defined, high-value problems.

In summary, test-time compute has shown remarkable progress on benchmarks and remains a promising approach. This innovation could unlock access to tutors for millions of students in schools. While itâ€™s clear that AI wonâ€™t be taking over all human tasks any time soon, reasoning LLMs are a powerful addition to the AI toolbox that almost seemed like approaching a plateau. Handled with the right safeguards and caution, they can significantly enhance education, problem-solving, and accessibility, making a positive impact on countless lives.

Kelsey (Skvoretz) Macke

Fractional Data Scientist

3 ä¸ªæœˆ

O1: "Thought for 29 seconds" Also O1: "I spent hours in deep thought" ?? ?? Thanks for sharing, Giri Tatavarty!

èµž

å›žå¤

1 æ¬¡å›žåº”

Shankar Ranganathan

3 ä¸ªæœˆ

Very informative article. Thanks for sharing.

èµž

å›žå¤

Suresh Karupakula

Vice President - OmniChannel Retail Technologies @ Kroger | MBA

3 ä¸ªæœˆ

Brilliant summary. Thanks for sharing.

èµž

å›žå¤

1 æ¬¡å›žåº”

Puneet Gupta

3 ä¸ªæœˆ

Very interesting. How do you think about RAG vs System 2? Woukd domain specific RAG models be better suited or perhaps distilling system 2 with RAG?

èµž

å›žå¤

2 æ¬¡å›žåº”

Mahesh Turaga, Ph.D, M.B.A.

VP of Business Development, Cloud at Cadence Design Systems Member, Board of Directors at ShelterBox USA

3 ä¸ªæœˆ

Good analysis Giri!! Looks like o1 will be able to help high school students and PhD students alike in solving complex problems.

èµž

å›žå¤

2 æ¬¡å›žåº”

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Giri Tatavartyçš„æ›´å¤šæ–‡ç«

Operator Goes Grocery Shopping: My AIâ€™s Surprising Frozen-Dinner Haul

2025å¹´1æœˆ26æ—¥

Operator Goes Grocery Shopping: My AIâ€™s Surprising Frozen-Dinner Haul

Imagine a ChatGPT that doesnâ€™t just answer your questions, but also sees the screen in front of youâ€”clicking the rightâ€¦

7 æ¡è¯„è®º
Dawn of the Agents: Moving from AI Demos to Customer-Ready Products

2024å¹´10æœˆ6æ—¥

Dawn of the Agents: Moving from AI Demos to Customer-Ready Products

Agents have been the buzzword of the past few months, and there's a lot to unpack. My goal today is simple: toâ€¦

3 æ¡è¯„è®º
Why Infinite Context Is Still Not Enough?

2024å¹´5æœˆ22æ—¥

Why Infinite Context Is Still Not Enough?

LLMs are on the race to enable out large contexts, Leading the way in terms of context processing capacity are modelsâ€¦
The Age of AI and the Token Factories

2024å¹´3æœˆ19æ—¥

The Age of AI and the Token Factories

It comes as no surprise that the age of AI has been heralded by numerous AI leaders like Bill Gates, Jensen Huang(â€¦

5 æ¡è¯„è®º
Attention is All You Have

2024å¹´1æœˆ22æ—¥

Attention is All You Have

The groundbreaking paper "Attention is All You Need," published by Vaswani et al. in 2017, marked a paradigm shift inâ€¦

6 æ¡è¯„è®º
Lazy GPT - When ChatGPT becomes lazy like humans

2023å¹´10æœˆ17æ—¥

Lazy GPT - When ChatGPT becomes lazy like humans

The LAZY GPT - When ChatGPT becomes lazy like Humans As we venture deeper into the realm of artificial intelligenceâ€¦

1 æ¡è¯„è®º
Beyond the AI Hype & Gloom: A Practical Guide to Generative Models in Business

2023å¹´6æœˆ5æ—¥

Beyond the AI Hype & Gloom: A Practical Guide to Generative Models in Business

#genai #llm #gpt #aisafety #responsibleai #gpt4 Why this Article? Artificial Intelligence (AI) is often surrounded byâ€¦

5 æ¡è¯„è®º

See all articles

Reasoning LLMs (O1) & the Power of Test-Time Compute

Giri Tatavarty

System 1 and System 2 Thinking

Reasoning LLMS (e.g O1 , DeepSeek R1 , Ask AI)

Breaking Records: O1's Dominance in PhD-Level Benchmarks

What I learnt playing with O1?

O1 is Very Smart

é¢†è‹±æŽ¨è

O1 is Not Funny

O1 is Over Confident & Adamant

O1 may not be Reasoning in the true sense of the word.

Closing Thoughts

Giri Tatavartyçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Google Gemini o1 Architecting Complex Chain of Thought in Gemini

??Top ML Papers of the Week

Lies, damned lies, and hallucinations

The Future of Vision-Language Models: Scaling for Efficiency and Performance

Transformer Architectures for Dummies - Part 2 (Decoder Only Architectures)

Efficiency meets performance: Comparing open-source LLMs - DBRX, Jamba, Qwen

From Weights to Words: A Beginnerâ€™s Guide to GenAI and LLMs

The New Paradigm: Test-Time Program Synthesis in "o series"

To reason or not to reason is the question

Solar Pro: High-Performance LLM on a Single GPU

System 1 and System 2 Thinking

Reasoning LLMS (e.g O1 , DeepSeek R1 , Ask AI)

Breaking Records: O1's Dominance in PhD-Level Benchmarks

What I learnt playing with O1?

O1 is Very Smart

é¢†è‹±æŽ¨è

O1 is Not Funny

O1 is Over Confident & Adamant

O1 may not be Reasoning in the true sense of the word.

Closing Thoughts

Giri Tatavartyçš„æ›´å¤šæ–‡ç«

Operator Goes Grocery Shopping: My AIâ€™s Surprising Frozen-Dinner Haul

Dawn of the Agents: Moving from AI Demos to Customer-Ready Products

Why Infinite Context Is Still Not Enough?

The Age of AI and the Token Factories

Attention is All You Have

Lazy GPT - When ChatGPT becomes lazy like humans

Beyond the AI Hype & Gloom: A Practical Guide to Generative Models in Business

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Google Gemini o1 Architecting Complex Chain of Thought in Gemini

??Top ML Papers of the Week

Lies, damned lies, and hallucinations

The Future of Vision-Language Models: Scaling for Efficiency and Performance

Transformer Architectures for Dummies - Part 2 (Decoder Only Architectures)

Efficiency meets performance: Comparing open-source LLMs - DBRX, Jamba, Qwen

From Weights to Words: A Beginnerâ€™s Guide to GenAI and LLMs

The New Paradigm: Test-Time Program Synthesis in "o series"

To reason or not to reason is the question

Solar Pro: High-Performance LLM on a Single GPU

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†