Is OpenAI’s o1 The AI Doctor We’ve Always Been Waiting For? (Surprisingly, Yes!)
Dr. Ashish Bamania
I simplify the latest advances in AI, Quantum Computing & Software Engineering for you | Tech Writer With 1M+ views | Software Engineer
OpenAI’s o1 is out, and its performance on STEM tasks is mind-bending!
Quoted from OpenAI’s research article titled ‘Learning to Reason with LLMs’:
OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).
The model has been trained using Reinforcement learning and uses a long internal Chain-of-Thought approach to think through the problem before generating an output.
Its performance scales incredibly with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).
Whether mathematics, competitive programming, or Ph. D-level questions in Physics, Chemistry, and Biology, it answers them all with a high degree of correctness.
And, its performance is substantially higher than the previous state-of-the-art GPT-4o.
But what about Medicine?
Reserachers of this new pre-print on ArXiv answered precisely this.
o1 was evaluated over six tasks using data from 37 medical datasets, including challenging question-answering (QA) tasks based on professional medical quizzes from the New England Journal of Medicine and The Lancet.
The results show that o1 surpasses GPT-4 and all other strong LLMs in accuracy and comes close to dominating most evaluations.
Here is a story where we deep dive into o1’s performance in the medical domain, its strengths and weaknesses, and how it can be further enhanced towards an early promising AI doctor candidate.
Let’s go!
How Was o1 Evaluated?
Aspects
Researchers assessed o1's performance in three essential areas of medicine that align with real-world clinician needs.
Prompting Strategies
To explore these areas, three prompting strategies were used:
Datasets
For the evaluation, 35 existing medicine-related datasets and two additional challenging datasets created from professional medical quizzes from the New England Journal of Medicine and The Lancet were used.
These datasets are clubbed into different Tasks that examine the specific capabilities of a model.
Metrics
Five different metrics were employed, and these are:
Models
Four different models were compared with the o1-preview-2024–09–12 model.
These are as follows:
The complete evaluation pipeline is shown below.
And How Well Did o1 Perform?
Understanding Aspect
o1 beats all other models, including GPT-4 and GPT-3.5, in this aspect.
On five concept recognition datasets, o1 outperforms GPT-4 and GPT-3.5 by an average of 7.6% and 26.6%, respectively (i.e., 72.6% vs 65.0% vs 46.0%), in terms of the F1 Score.
Notably, it shows a 24.5% average improvement on the BC4Chem dataset.
In text summarization tasks, o1's ROUGE-1 score is 2.4% higher than GPT-4 and 3.7% higher than GPT-3.5.
Reasoning Aspect
In view of medical mathematical reasoning, o1 achieves a 9.4% higher accuracy than GPT-4 on MedCalc-Bench.
What about real-world diagnostic situations?
o1 outperforms GPT-4 and GPT-3.5 in newly constructed QA tasks datasets NEJMQA and LancetQA, with 8.9% and 27.1% accuracy improvements, respectively.
It also surpasses both GPT-4 and GPT-3.5 (with accuracy gains of 15.5% and 10%) in the AgentClinic benchmark, which evaluates complex reasoning scenarios with multi-turn conversations and medical environment simulations.
It is also noted that o1 answers are more concise and straightforward than GPT-4, which generates long hallucinated explanations for its incorrect answers.
领英推荐
Multilinguality Aspect
o1 beats other models in multilingual questions answering tasks with an average accuracy of 85.2%, compared to GPT-4’s 75.7% and GPT-3.5’s 54.1%.
However, it falls short by 1.6% compared to GPT-4 (43.4% vs. 45.0%) on Chinese agent benchmark AI Hospital in medical examination scenarios.
o1 also struggles with mixed language output generation in the medical setting.
It is thought that this could be due to the lack of multilingual CoT data during o1’s training.
Apart from these evaluation aspects, the research has many other interesting findings.
Let’s talk about them next.
There’s Still No Single Best Model
Although o1 generally outperforms other LLMs in most clinical decision tasks, no single model consistently performs best across all medical tasks.
It is seen that o1 lags behind GPT-4 by 5% in accuracy on the MIMIC4ED-Critical Triage dataset.
Interestingly, Llama 3 outperforms o1 by 20% in the PMC-Patient and PICO-Intervention datasets (96.0% vs. 76.4%).
Chain-of-Thought Prompting Still Improves o1
It is also seen that although o1 takes an internal Chain-of-Thought approach, further Chain-of-Thought (CoT) prompting still improves its performance by 3.18%.
Surprisingly, other prompting strategies like Self-Consistency and Reflex worsen the accuracy of o1 on the LancetQA dataset by up to 24.5% compared to only CoT prompting.
o1 Still Has A Hallucination Problem
o1 is susceptible to generating hallucinated responses, especially on text summarization tasks, where there is a 1.3% decrease in AlignScore compared to GPT-4.
And this is a big issue when using o1 as a real-world AI doctor.
Interestingly, Our Judgement Metrics Are Hugely Biased
The researchers note that the metrics used to evaluate LLMs produce inconsistent performance during evaluation.
This is seen with:
This raises a question— Are our current evaluation metrics reliable enough?
To get around this, an approach such as 'LLM-as-a-Judge' could be implemented using GPT-4, but this is not the best method for evaluating the more advanced o1 model.
Thus, we really need more robust evaluation metrics to assess state-of-the-art LLMs in complex scenarios.
o1 Is Also Painfully Slow
Although o1 surpasses the accuracy of other LLMs, it takes more than 2× and 9× longer to generate outputs compared to GPT-4 and GPT-3.5 on four medical tasks (13.18s for o1 vs. 6.89s for GPT-4 and 1.41s for GPT-3.5).
This is a weaker point of o1 in terms of user experience in a time-critical medical environment.
Coming Back To Our Original Question
Returning to the question that we started with —
“Is OpenAI’s o1 the AI doctor we’ve always been waiting for?”
The answer is — Yes!
The performance of o1 is absolutely remarkable in Medicine!
Although o1 needs further evaluation, especially in the Safety domain, we can pretty confidently say that we are getting closer to a competent and reliable AI doctor faster than we think.
I’m super excited about it!
What are your thoughts on this? Let me know in the comments below.
Further Reading