登录查看更多内容

Openai o1 Pro Vs o1 systematic testing

Jeremy Harper

Biomedical Informatician

发布日期: 2025年1月8日

I conducted an experiment to see whether longer reasoning times in a language model lead to better outcomes. Specifically, I examined two models:

o1 Pro: This is the newest public model, designed to allow extended planning and reasoning.
o1: This is the final version of the regular thinking model. Typically, it does not exceed one minute in its planning steps unless there are very specific constraints that force extended deliberation (for example, requiring each sentence to start and end with the letter “E,” contain exactly seven words, and include onomatopoeia).

Project Overview

I approached this in a gradual manner by testing two main tasks:

A Python coding script
Fiction novel outlines, where each model was asked to write a few pages of text. I then evaluated each system’s output based on how immersive and engaging I found the story.

Story Writing Results

For the story generation task, I used prompts that were more than 6,000 words each time. The o1 Pro model consistently took about 12 minutes to produce a response. One observation was that it tended to provide the entire output at once, rather than displaying it incrementally as it composed the text. In contrast, o1 usually started writing within 30 to 60 seconds. This means I could generate around ten o1 responses in the time it took o1 Pro to produce a single one.

Despite the time difference, o1 Pro usually demonstrated a higher quality of content. Its outputs were more creative, and they were more accurate on the first try. With o1, I found that I needed about six regenerated outputs to achieve something on par with o1 Pro. In practice, that just meant clicking the “regenerate” button multiple times without introducing any new information to the system. While I cannot confirm whether or not the system internally learns from previous failed attempts, I have noticed that sometimes o1’s responses remain very similar from one generation to the next so I don't believe they have a specific step to ensure regeneration is different.

领英推荐

Best Language for Machine Learning 2024

Andrew Atlas 3 个月前

Exploring Python’s Role in Machine Learning and AI

Naresh i Technologies 2 个月前

Build a RAG App in Python Using Llama 3.2 ??

Clarifai 5 个月前

Python Coding Results

When it comes to coding, o1 is more suitable for rapid iteration, but o1 Pro offers more accurate solutions upfront. While regenerating o1 multiple times did not typically produce fundamentally different answers or coding approaches, o1 Pro was able to output significantly longer and more complete code. In many cases, o1 would condense or omit functions without recognizing the omission, making its results less comprehensive.

There were instances where o1 got “stuck” on the same broken solution. To address that, I had to deliberately adjust the prompt—such as requesting a complete rebuild of the code by a “professional Python scientist”—to introduce enough variation that it could overcome its repetitive patterns. It wasn't so much that it couldn't come up with novel ways to approach things but that the code base itself biased it toward the same response every time.

Looking Ahead

I anticipate that future iterations of long-thinking Llama-based models will make it easier to replicate the extended reasoning approach. My preliminary experiments in coaxing llama models to think longer show that further fine-tuning from Meta or similar organizations with millions of dollars for GPU's might be necessary to avoid overly brief responses that limit the model’s potential.

For most users, it may be more practical to take a multi-shot approach with o1. You can adjust your prompt and generate multiple responses, then select the best one. If you have an agent-based system in place, you could even automate the evaluation of each output and choose the top performer. This approach might demand less time overall than waiting for a single o1 Pro response.

It is worth noting that o1 Pro does not rely on repeated attempts for its results; instead, its extended reasoning step seems to be the reason behind its more consistent accuracy. Still, o1 can reach a similar level of quality if you are willing to regenerate responses several times and filter out any suboptimal output. The main argument for preferring o1 Pro is that it tends to produce a higher-quality answer initially, leaving fewer uncertainties about whether there might be a better option that would appear after multiple tries.

Large Language Models/AI

532 位关注者

Pawan Jindal, MD

Ex-Physician | Health Informatician | Committed to Unlocking Predictive AI’s Potential with FHIR

2 个月

Jeremy Harper, Thank You so much for these insights. This is very helpful. One thing I have tried is to use the O1 Pro for the initial query and then move to O1 or even 4o. That allows you to do the initial heavy lifting and deep analysis from pro and then use o1/4o in completing the rest. Interestingly, you can downgrade from a higher model to lower in the same chat but you can't go back to the higher model. That has been my experience. Once you come down to 4o, you can also use websearch and canvas to get a through analysis.

Leon Rozenblit

Executive Director, QED Institute. Catalyzing collaborative production of high-quality knowledge.

2 个月

Thanks, Jeremy Harper -- this answers a helpful practical question and one that has been on my mind. I appreciate your reporting on your evaluation.

查看更多评论

要查看或添加评论，请登录

Jeremy Harper的更多文章

Accelerating Clinical Research Informatics Literature Review with Lightweight AI

2025年3月13日

Accelerating Clinical Research Informatics Literature Review with Lightweight AI

As a biomedical informatician, one of my persistent challenges has been efficiently reviewing the vast number of…

5 条评论
LLM Agent System to document Code

2025年3月7日

LLM Agent System to document Code

TLDR; new github repo with code to document other code. I saw the following post today, its a common problem in cutting…
Use an LLM for ETL first pass

2025年3月5日

Use an LLM for ETL first pass

Here's an example prompt to normalize datasets. I was talking to folks at HIMSS25 who didn't know how to build the…

5 条评论
Voice Cloning Breakthrough: Healthcare's New Communication Frontier

2025年3月3日

Voice Cloning Breakthrough: Healthcare's New Communication Frontier

The Game-Changing Arrival of Accessible Voice Cloning Technology Healthcare communication has reached a pivotal moment…
Data Visualization in Biomedical Informatics

2025年3月1日

Data Visualization in Biomedical Informatics

Below are two things I want you to see, the first is the prompt I used to have openAI's deep research module to have it…
Time to test 01 Pro's programming

2025年2月25日

Time to test 01 Pro's programming

I don't know if I'm bored or just brainstorming. I've been prepping the flooded basement for painting and realized my…

3 条评论
Comparing Life Outcomes: Homeschoolers vs. Public School Students in the U.S

2025年2月23日

Comparing Life Outcomes: Homeschoolers vs. Public School Students in the U.S

I'll research the differences in life outcomes between homeschoolers and public school students in the U.S.
Military Disability - Deep Research Overview

2025年2月23日

Military Disability - Deep Research Overview

I have friends being impacted right now and I was curious to understand both the perception of the impact as well as…
Investor and LLM Person?

2025年2月21日

Investor and LLM Person?

I don't know how many of you are investors and into LLM's but I just found a new use for deep research. It produces a…
Looking to understand the author landscape, Revenue, Ads, & Income

2025年2月18日

Looking to understand the author landscape, Revenue, Ads, & Income

This one started with some generic questions I've been asking about what its going to take to grind your way to success…

5 条评论

See all articles

Openai o1 Pro Vs o1 systematic testing

Jeremy Harper

Biomedical Informatician

领英推荐

Large Language Models/AI

532 位关注者

Jeremy Harper的更多文章

社区洞察

其他会员也浏览了

??Pre-Christmas Reads: New Research, Sora, Python Guides, and More

Build a Visual Search app using Python with just a few lines of code ??

Why Python Remains the King of AI and Data Science

I Created a Machine Learning Model with Auto Data Ingestion Using ChatGPT and Python!

ChatGPT for Oil and Gas - Part 3: Python (for Non-Coders)

Why is Python the predominant language in AI and machine learning projects?

Why AI Platforms Favor Python and Its Potential to Dominate Future Programming

?? AI-Powered Number-Guessing: A Fun Python Learning Project ??

<<PERFECT DETAILED PRODUCTION QUALITY PYTHON CODE FOR GPT3 POWERED SELF WRITING MASTERMIND PLATFORM WITH QUALITY UNIT TESTS>>

Mastering Prompt Engineering: A Comprehensive Guide for Python Developers

领英推荐

Large Language Models/AI

532 位关注者

Jeremy Harper的更多文章

Accelerating Clinical Research Informatics Literature Review with Lightweight AI

LLM Agent System to document Code

Use an LLM for ETL first pass

Voice Cloning Breakthrough: Healthcare's New Communication Frontier

Data Visualization in Biomedical Informatics

Time to test 01 Pro's programming

Comparing Life Outcomes: Homeschoolers vs. Public School Students in the U.S

Military Disability - Deep Research Overview

Investor and LLM Person?

Looking to understand the author landscape, Revenue, Ads, & Income

社区洞察

其他会员也浏览了

??Pre-Christmas Reads: New Research, Sora, Python Guides, and More

Build a Visual Search app using Python with just a few lines of code ??

Why Python Remains the King of AI and Data Science

I Created a Machine Learning Model with Auto Data Ingestion Using ChatGPT and Python!

ChatGPT for Oil and Gas - Part 3: Python (for Non-Coders)

Why is Python the predominant language in AI and machine learning projects?

Why AI Platforms Favor Python and Its Potential to Dominate Future Programming

?? AI-Powered Number-Guessing: A Fun Python Learning Project ??

<<PERFECT DETAILED PRODUCTION QUALITY PYTHON CODE FOR GPT3 POWERED SELF WRITING MASTERMIND PLATFORM WITH QUALITY UNIT TESTS>>

Mastering Prompt Engineering: A Comprehensive Guide for Python Developers