OpenAI o1 Is Out: Embracing Inference-Time Scaling and the Future of AI Reasoning

OpenAI o1 Is Out: Embracing Inference-Time Scaling and the Future of AI Reasoning

We are witnessing a shift toward inference-time

Introducing OpenAI o1-preview

OpenAI has unveiled the o1 series, a new line of AI models designed to spend more time thinking before they respond. These models excel at reasoning through complex tasks and solving harder problems in science, coding, and math. Available starting September 12, 2024, the o1 series represents a significant advancement in AI capability, resetting the counter back to 1 and marking a new era in AI development.

Some fast stats

  • Chain-of-Thought (CoT) reasoning is utilized as a test-time computation strategy to enhance results.
  • Reinforcement learning is employed to fine-tune the CoT reasoning approach.
  • Achieves a ranking in the 89th percentile on competitive programming platforms like Codeforces.
  • Surpasses human PhD-level accuracy on the GPQA benchmark.
  • Performs on par with the top 500 students in the USA Mathematical Olympiad qualifier.
  • While it outperforms GPT-4o in complex reasoning, its performance on many common business tasks is merely equivalent.


1. Reasoning Without Massive Models

You don't need a colossal model to perform effective reasoning. Traditionally, large language models allocate a significant number of parameters to memorize facts to excel in benchmarks like TriviaQA. However, it's possible to separate reasoning from knowledge. By developing a smaller "reasoning core" that adeptly utilizes tools like web browsers and code verifiers, we can reduce the emphasis on pre-training compute while maintaining or even enhancing performance.

OpenAI o1-mini

To offer a more efficient solution for developers, OpenAI is also releasing o1-mini, a faster, cheaper reasoning model particularly effective at coding. As a smaller model, o1-mini is 80% cheaper than o1-preview, making it a powerful, cost-effective option for applications that require reasoning but not broad world knowledge.

2. Shifting Compute to Inference

A substantial amount of compute is now transitioning from pre-training to serving inference. LLMs function as text-based simulators. By exploring numerous possible strategies and scenarios within the simulator, the model eventually converges on optimal solutions. This process mirrors well-established techniques like the Monte Carlo Tree Search (MCTS) used in AlphaGo, emphasizing the power of search during inference.

3. The Inference Scaling Law Unveiled

It appears that OpenAI recognized the potential of inference scaling ahead of the academic curve. Recently, two papers have shed light on this concept:

  • "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling" by Brown et al. demonstrates that DeepSeek-Coder's performance on SWE-Bench jumps from 15.9% with a single sample to 56% with 250 samples, surpassing models like Sonnet-3.5.
  • "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters" by Snell et al. reveals that PaLM 2-S outperforms a model 14 times larger on MATH tasks when employing test-time search.

These findings suggest that increasing inference compute can be more effective than merely scaling model parameters.

4. Challenges in Productionizing o1

Bringing OpenAI o1 into production poses significant challenges beyond academic benchmarks. For real-world reasoning problems, determining when to stop searching, defining reward functions, and setting success criteria are complex tasks. Deciding when to invoke tools like code interpreters adds another layer of complexity, especially when considering the compute cost of these additional processes.

Safety and Alignment

OpenAI has developed a new safety training approach that harnesses the reasoning capabilities of o1 models to adhere to safety and alignment guidelines. By being able to reason about safety rules in context, the models can apply them more effectively. On one of the hardest jailbreaking tests, GPT-4o scored 22 (on a scale of 0-100), while the o1-preview model scored 84, indicating significant improvements in safety compliance.

5. A Data Flywheel in Motion

The o1 series has the potential to create a powerful data flywheel. When the model generates correct answers, the entire search trace becomes a mini-dataset of training examples, encompassing both positive and negative rewards. This continuous feedback loop can enhance the reasoning core for future versions of GPT, much like how AlphaGo's value network improved through iterative MCTS-generated data. Just as I mentioned in my opening statement at Data and Innovation summit earlier this year.

Real-World Applications

These enhanced reasoning capabilities may be particularly useful for tackling complex problems in science, coding, math, and similar fields. For example:

  • Healthcare Researchers: Annotating cell sequencing data more accurately.
  • Physicists: Generating complicated mathematical formulas needed for quantum optics.
  • Developers: Building and executing multi-step workflows with improved efficiency.

Access and Availability

  • ChatGPT Users: ChatGPT Plus and Team users can access o1 models starting today. Both o1-preview and o1-mini can be selected manually, with weekly rate limits of 30 messages for o1-preview and 50 for o1-mini.
  • Enterprise and Educational Users: ChatGPT Enterprise and Edu users will get access to both models beginning next week.
  • Developers: Those who qualify for API usage tier 5 can start prototyping with both models in the API today. Is it good? Oh YES!
  • Future Plans: OpenAI plans to bring o1-mini access to all ChatGPT Free users and continue adding features like browsing and file and image uploading to make the models more useful.

What's Next

This is an early preview of the reasoning models in ChatGPT and the API. OpenAI plans to continue developing and releasing models in the GPT series, in addition to the new OpenAI o1 series. Future updates are expected to enhance the models' capabilities and features, making them even more versatile and powerful.

In essence, OpenAI o1 signifies a significant move toward leveraging inference-time scaling in AI. By focusing on search and learning that scale with compute, we can develop more efficient models that excel in reasoning without the need for enormous parameter counts. This approach not only aligns with the insights from "The Bitter Lesson" but also sets the stage for more dynamic and capable AI systems in the future.

Patrik M?larholm

AI passionated growth hacker - Marketing Manager

3 天前

Great article Stefan! Exciting, AI takes a moment to think before responding – something we all could do more often! ?? Looking forward to seeing what the o1 series can achieve ???

Very nice, thanks for a quick overview!

Stefan Wendin

Driving transformation, innovation & business growth by bridging the gap between technology and business; combining system & design thinking with cutting-edge technologies; Graphs, AI, GenAI, LLM, ML ??

1 周

And it led me to revisit this paper --> https://arxiv.org/abs/2401.00448

要查看或添加评论,请登录

社区洞察

其他会员也浏览了