From Language to Logic: The Game-Changing Impact of OpenAI's Latest AI Model
Agustin Ramirez
Executive Global IT Leader | Driving Innovation & Profitability | Program Management | Strategist of Tomorrow's IT ??
The bulk of LLM progress until now has been language-driven. This new model enters the realm of complex reasoning, with implications for physics, coding, and more.
Introduction of OpenAI o1
Last week, OpenAI released a new model called o1 (previously referred to under the code name “Strawberry” and, before that, Q*) that significantly outperforms GPT-4o for complex reasoning tasks.
Focus on Multistep Reasoning
Unlike previous models that are well suited for language tasks like writing and editing, OpenAI o1 is focused on multistep “reasoning,” the type of process required for advanced mathematics, coding, or other STEM-based questions. It uses a “chain of thought” technique, according to OpenAI. This technique allows the model to recognize and correct its mistakes, break down tricky steps into simpler ones, and try different approaches when the current one isn’t working.
Performance and Accuracy
OpenAI’s tests point to resounding success. The model ranks in the 89th percentile on questions from the competitive coding organization Codeforces and would be among the top 500 high school students in the USA Math Olympiad, which covers geometry, number theory, and other math topics. The model is also trained to answer PhD-level questions in subjects ranging from astrophysics to organic chemistry.
In math olympiad questions, the new model is 83.3% accurate, versus 13.4% for GPT-4o. In the PhD-level questions, it averaged 78% accuracy, compared with 69.7% from human experts and 56.1% from GPT-4o.
Significance of the New Model
The bulk of LLM progress until now has been language-driven, resulting in chatbots or voice assistants that can interpret, analyze, and generate words. However, these LLMs have failed to demonstrate the types of skills required to solve important problems in fields like drug discovery, materials science, coding, or physics. OpenAI’s o1 is one of the first signs that LLMs might soon become genuinely helpful companions to human researchers in these fields.
领英推荐
Expert Opinions
Matt Welsh, an AI researcher and founder of the LLM startup Fixie, highlights the significance of this development. He states that the reasoning abilities are directly in the model, rather than one having to use separate tools to achieve similar results. Welsh expects that this will raise the bar for what people expect AI models to be able to do.
However, it’s best to take OpenAI’s comparisons to “human-level skills” with a grain of salt, says Yves-Alexandre de Montjoye, an associate professor in math and computer science at Imperial College London. It’s very hard to meaningfully compare how LLMs and people go about tasks such as solving math problems from scratch.
Challenges in Measuring Reasoning
AI researchers say that measuring how well a model like o1 can “reason” is harder than it sounds. If it answers a given question correctly, is that because it successfully reasoned its way to the logical answer? Or was it aided by a sufficient starting point of knowledge built into the model? The model “still falls short when it comes to open-ended reasoning,” Google AI researcher Fran?ois Chollet wrote on X.
Cost and Accessibility
Finally, there’s the price. This reasoning-heavy model doesn’t come cheap. Though access to some versions of the model is included in premium OpenAI subscriptions, developers using o1 through the API will pay three times as much as they pay for GPT-4o—$15 per 1 million input tokens in o1, versus $5 for GPT-4o. The new model also won’t be most users’ first pick for more language-heavy tasks, where GPT-4o continues to be the better option, according to OpenAI’s user surveys.
Potential and Future Applications
AI systems that can solve complex math could allow us to build more powerful AI tools. What will it unlock? We won’t know until researchers and labs have the access, time, and budget to tinker with the new model and find its limits. But it’s surely a sign that the race for models that can outreason humans has begun.
Client Director
6 个月I’ve been following Sam Altman’s discussions—amazing to think about what this could unlock for research and innovation.?