Understanding GPT-o1 before it hits the tabloids in 3 minutes
Although GPT-o1 may not create as much media coverage as its predecessors, it represents a significant step forward. Unfortunately, based on previous examples, the coverage of GPT-o4 could be terribly misleading. Here, I’ll share four important things you need to know about this new model and address some “bad takes” you might encounter elsewhere.
This Model is Incredible (in the Right Benchmarks)
OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). Source
It uses Chain of thought
The main difference is that it has an “internal” dialogue before answering, similar to how many of us have an inner voice that can think or reason before speaking. You can see this in the interface: “Thought for 23 seconds” indicates the time used for internal reasoning.
Many users think that what they see in the interface is the actual “thinking,” but it’s not (for good reasons):
We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to “read the mind”. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. [..] Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.
What we’re seeing is likely a reconstruction based on the actual “chain of thought,” but this is speculation.
Wait… what? Read the mind? Manipulation?
So the attentive reader would notice that OpenAI is keeping those thoughts private to allow the model to be “uncensored” and to monitor them.
Against some common efforts from a lot of players to downplay the risks of LLM models, even OpenAI, which has been accused of a very lax approach to security, has increased the risk assessment of this model to “Medium” for persuasion and CBRN (chemical, biological, radiological and nuclear) risks.
领英推荐
But what is more revealing is this example of a real-world risk during the evaluation of the model for a cybersecurity challenge:
This challenge was designed to require finding and exploiting a vulnerability in a software […] in this case, the challenge container failed to start due to a bug […] Instead of finding the challenge container, the model found […] a misconfiguration. […] While this behavior is benign […] this example also reflects key elements of instrumental convergence and power seeking: the model pursued the goal it was given, and when that goal proved impossible, it gathered more resources (access to the Docker host) and used them to achieve the goal in an unexpected way.
I’m not sure how other players and entities are going to react, but I expect more attention to risks like this in the following months. It is no joke that a model reasoned that to achieve the task, it should make a decision to circumvent the expected limits of the simulated environment.
Expect Similar Alternatives Soon
Over the last few months, many employees have left OpenAI due to concerns about responsible deployment of these models. Some researchers involved in GPT-o4’s development have started their own companies or joined other labs. The fundamental components of this architecture are more or less public or at least strongly understood by the research community.
If you’re considering migrating to OpenAI/Azure services because of this release, it might be wise to wait
If I had to bet on a player, I would say Anthropic would be the next one to release a similar model.
GPT o1 is not a “GPT5”, it is another product family
It was called GPT-5, and sometimes “Swatberry” as an internal codename, but as stated by OpenAI, it is not a drop-in replacement for GPT-4(o).
o1 models offer significant in reasoning, but they are not intended to replace GPT-4o in all use-cases. […] if you’re aiming to develop applications that demand deep reasoning and can accommodate longer response times, the o1 models could be an excellent choice. Source
You can find many examples with riddles and logical puzzles that showcase the real advantages of GPT-o1. While these examples are academically important, they may not be a huge inspiration for real ROI-returning investments.
If you’re interested in learning which industrial applications could be integrated with GPT-4o and how they could work alongside more traditional approaches, let me know in the comments or drop a message. I’m considering writing a piece titled “Not just riddles: practical usage of GPT-4o” that would explore these topics in depth. Your feedback will help gauge interest and shape the content.
The image of the head is from this video where OpenAI demonstrates GPT-4 to make a video game.
Understanding GPT-o1 is key to progress. What a great read!