登录查看更多内容

OpenAI o1 Is Thinking

Cristian Leo

Data Scientist @ AWS | Columbia University

发布日期: 2024年9月22日

How Reinforcement Learning helps with complex reasoning by producing a long internal chain of?thought

OpenAI is back with a new model, one that thinks before answering. This is not just normal thinking but complex reasoning that puts this model back on the LLM’s podium. And the performances are just amazing.

“OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).” (Learning to Reason with LLMs by OpenAI)

But wasn’t the previous model supposed to be thinking before answering? Well, they were. But o1 is just doing it for longer, and that seems to be working well. In a previous article (How LLMs think) we introduced the concept of monosemanticity, which generally means that every word or phrase has a concise meaning. In LLM this translates into every neuron in the model being focused on one topic. In that article, we talk about how Anthropic’s engineers found out that by leveraging Autoeconders and the concept of monosemanticity, they can see what the model thinks as it will attribute a higher score on some of the neurons.

How LLMs Think Research paper in pills: “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet”towardsdatascience.com

However, o1 follows a different strategy. It heavily relies on a chain of thought. Instead of jumping directly to a conclusion, the model thinks through the problem systematically, much like how a human would tackle a complex question by breaking it down into smaller, more manageable parts.

Chain of?Thought

When presented with a complex question, o1 internally dissects it into smaller components. This makes challenging problems more approachable. The model processes each component step by step, ensuring that each part is well-understood before moving on to the next. As it reasons through the steps, o1 can identify inconsistencies or errors in its thinking and adjust accordingly. Then, if the initial approach doesn’t yield the desired result, o1 can rethink its strategy, trying alternative methods to solve the problem.

Let’s now test this model using one of the most crucial questions for LLMs: “How many ‘R’ there are in ‘strawberry’?” First, let’s use 4o:

As expected, it got it wrong, even though the actual thinking process is right. This means that the previous thinking process seems to be more disconnected, every step is not sequential causing the model to answer wrongly.

Let’s now see how o1 performs on this intricate task (just wondering if this was one of the questions for the PhD-level accuracy on a benchmark of physics):

Kudos to the OpanAI team, after years and a few billion we are now able to count the number of R in “strawberry” correctly! Enough with the jokes, let’s get serious and look at the science behind it.

Reinforcement Learning

So, what’s actually happening under the hood that makes o1 so much better at reasoning than its competitors? The key lies in how reinforcement learning is used to train the model to produce a longer and more effective chain of thought.

In Reinforcement learning an “agent” learns to make decisions by performing actions and receiving feedback in the form of rewards or penalties. In the context of o1, RL is used to train the model to think more deeply and systematically before providing an answer.

Here’s how it works:

领英推荐

Llama Recipes, Reinforcement Learning and…

Rami Krispin 4 个月前

(Long Tutorial) Code your GPT2 Architecture from…

Srijit Mukherjee 10 个月前

Issue #229 - THE ML ENGINEER ??

Alejandro Saucedo 1 年前

The model starts with a base level of understanding from pre-training on vast amounts of text data. This gives it general language comprehension and knowledge about the world.

During RL training, o1 is encouraged to generate detailed reasoning steps (chain of thought) when solving a problem. The model’s reasoning process and final answers are evaluated based on their correctness and effectiveness. Positive feedback (rewards) is given for correct reasoning that leads to accurate answers, while incorrect reasoning receives negative feedback (penalties).

Over many iterations, o1 adjusts its internal parameters to maximize the expected reward. This means it learns to produce chains of thought that are more likely to lead to correct solutions.

In this article, we won’t deep dive into Reinforcement Learning. However, here’s a list of RL articles which will help you master this branch of AI:

Reinforcement Learning Edit descriptionmedium.com

Coding

Let’s be frank, if you code, you are likely using ChatGPT or your LLM of choice daily. However how many times do you get frustrated with totally wrong answers, or code that doesn’t compile? Well, I have good news for programmers. o1 was tested by simulating participation in the 2024 International Olympiad in Informatics, one of the most prestigious programming competitions for high school students globally.

The model scored 213 points, placing it in the 49th percentile among human contestants. This means it performed better than nearly half of the human participants. Similarly to human competitors, it had ten hours to solve six challenging algorithmic problems, and it was allowed 50 submissions per problem.

For each problem, o1 generated multiple candidate submissions. Out of these, 50 submissions were selected based on a test-time selection strategy. This strategy evaluated how well the solutions performed on the official test cases provided by the competition. Additionally, it generated new test cases to assess the robustness of the solutions further and created an internal scoring mechanism that helped prioritize the most promising solutions.

OpenAI’s team measured how that would have played out without this chain of thought mechanism. If the model had randomly selected submissions without this strategy, it would have scored an average of 156 points. Therefore, this strategic selection added nearly 60 extra points, highlighting the importance of intelligent decision-making even within AI processes.

That’s an impressive result, but most users won’t be feeding GPT with Olympic-level problems, and we won’t impose constraints on the thinking. I would rather wait a bit longer to receive my answer rather than get a blurb right away that is not correct. Therefore, the model was tested without these constraints, and here o1’s performance soared. When the model was allowed 10,000 submissions per problem, it achieved a score of ~362, surpassing the gold medal threshold.?

Safety and Transparency

One more point for o1 is safety improvement. These ethical and safety alignments are directly integrated into the chain of thought, which lets the model judge every guideline based on the context, and not superficially. This feature enables the model to understand these guidelines at a higher level, by resisting the jailbreak attempts, wherein the user purposefully tries to break these safety guidelines.

As for transparency, here there is a mixed review. The chain of thought improves exponentially the ability to break down this thinking process. For example, developers can intervene and adjust the training accordingly if the model’s reasoning leads toward unsafe conclusions. However, the developers intentionally didn’t make this feature available to the public. The primary reason for hiding the chain of thought is to allow the model to express its thoughts freely and authentically without external interference according to OpenAI.

o1 was subject to extensive safety testing. These tests were conducted in line with OpenAI’s Preparedness Framework, which is a “processes to track, evaluate, forecast, and protect against catastrophic risks posed by increasingly powerful models” (Learning to Reason with LLMs by OpenAI). Here are the results:

Conclusion

By thinking through problems step by step?—?much like we do?—?o1 tackles complex tasks in math, coding, and science, often surpassing human experts. Its use of reinforcement learning to deepen its reasoning sets it apart from previous models. While the full o1 model isn’t publicly available yet, we can catch a glimpse of its performance through o1-preview. This is an incredibly exciting moment in AI, as this model’s leap in performance isn’t just due to more computational power, more data, or longer training. Instead, it’s the result of a different way of thinking. AI is increasingly resembling the human approach to problem-solving. It makes you wonder: how far are we from AGI?

OpenAI o1 Is Thinking

Cristian Leo

Data Scientist @ AWS | Columbia University

How Reinforcement Learning helps with complex reasoning by producing a long internal chain of?thought

Chain of?Thought

Reinforcement Learning

领英推荐

Coding

Safety and Transparency

Conclusion

社区洞察

其他会员也浏览了

AI Mindmap for Studying Machine Learning

Best Programming Language For AI 2025

Computer Vision Roadmap- Step-by-Step Guide

Do you want to learn more about Machine Learning but don't know where to begin?

How to Become an AI Developer?

Excellent Educational Generative Artificial Intelligence (GAI) book by a known innovator - Dr. Jerry Kaplan

What is PyTorch used for (practical use cases)

Decoding ML - From Basic Concepts to Complex Challenges

Machine learning systems can program too

What is PyTorch used for (practical use cases)