Openai o1 Pro Vs o1 systematic testing
I conducted an experiment to see whether longer reasoning times in a language model lead to better outcomes. Specifically, I examined two models:
Project Overview
I approached this in a gradual manner by testing two main tasks:
Story Writing Results
For the story generation task, I used prompts that were more than 6,000 words each time. The o1 Pro model consistently took about 12 minutes to produce a response. One observation was that it tended to provide the entire output at once, rather than displaying it incrementally as it composed the text. In contrast, o1 usually started writing within 30 to 60 seconds. This means I could generate around ten o1 responses in the time it took o1 Pro to produce a single one.
Despite the time difference, o1 Pro usually demonstrated a higher quality of content. Its outputs were more creative, and they were more accurate on the first try. With o1, I found that I needed about six regenerated outputs to achieve something on par with o1 Pro. In practice, that just meant clicking the “regenerate” button multiple times without introducing any new information to the system. While I cannot confirm whether or not the system internally learns from previous failed attempts, I have noticed that sometimes o1’s responses remain very similar from one generation to the next so I don't believe they have a specific step to ensure regeneration is different.
领英推荐
Python Coding Results
When it comes to coding, o1 is more suitable for rapid iteration, but o1 Pro offers more accurate solutions upfront. While regenerating o1 multiple times did not typically produce fundamentally different answers or coding approaches, o1 Pro was able to output significantly longer and more complete code. In many cases, o1 would condense or omit functions without recognizing the omission, making its results less comprehensive.
There were instances where o1 got “stuck” on the same broken solution. To address that, I had to deliberately adjust the prompt—such as requesting a complete rebuild of the code by a “professional Python scientist”—to introduce enough variation that it could overcome its repetitive patterns. It wasn't so much that it couldn't come up with novel ways to approach things but that the code base itself biased it toward the same response every time.
Looking Ahead
I anticipate that future iterations of long-thinking Llama-based models will make it easier to replicate the extended reasoning approach. My preliminary experiments in coaxing llama models to think longer show that further fine-tuning from Meta or similar organizations with millions of dollars for GPU's might be necessary to avoid overly brief responses that limit the model’s potential.
For most users, it may be more practical to take a multi-shot approach with o1. You can adjust your prompt and generate multiple responses, then select the best one. If you have an agent-based system in place, you could even automate the evaluation of each output and choose the top performer. This approach might demand less time overall than waiting for a single o1 Pro response.
It is worth noting that o1 Pro does not rely on repeated attempts for its results; instead, its extended reasoning step seems to be the reason behind its more consistent accuracy. Still, o1 can reach a similar level of quality if you are willing to regenerate responses several times and filter out any suboptimal output. The main argument for preferring o1 Pro is that it tends to produce a higher-quality answer initially, leaving fewer uncertainties about whether there might be a better option that would appear after multiple tries.
?
Ex-Physician | Health Informatician | Committed to Unlocking Predictive AI’s Potential with FHIR
2 个月Jeremy Harper, Thank You so much for these insights. This is very helpful. One thing I have tried is to use the O1 Pro for the initial query and then move to O1 or even 4o. That allows you to do the initial heavy lifting and deep analysis from pro and then use o1/4o in completing the rest. Interestingly, you can downgrade from a higher model to lower in the same chat but you can't go back to the higher model. That has been my experience. Once you come down to 4o, you can also use websearch and canvas to get a through analysis.
Executive Director, QED Institute. Catalyzing collaborative production of high-quality knowledge.
2 个月Thanks, Jeremy Harper -- this answers a helpful practical question and one that has been on my mind. I appreciate your reporting on your evaluation.