登录查看更多内容

Is OpenAI’s o1 The AI Doctor We’ve Always Been Waiting For? (Surprisingly, Yes!)

Dr. Ashish Bamania

I simplify the latest advances in AI, Quantum Computing & Software Engineering for you | Tech Writer With 1M+ views | Software Engineer

发布日期: 2024年12月4日

+ 关注

OpenAI’s o1 is out, and its performance on STEM tasks is mind-bending!

Quoted from OpenAI’s research article titled ‘Learning to Reason with LLMs’:

OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).

The model has been trained using Reinforcement learning and uses a long internal Chain-of-Thought approach to think through the problem before generating an output.

Its performance scales incredibly with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).

OpenAI o1’s performance (American Invitational Mathematics Examination (AIME) accuracy) improves with both train and test-time compute (Image from the article titled ‘Learning to Reason with LLMs’ by OpenAI)

Whether mathematics, competitive programming, or Ph. D-level questions in Physics, Chemistry, and Biology, it answers them all with a high degree of correctness.

Performance of o1 as compared to o1 preview and GPT-4o on different STEM benchmarks, where solid bars show Pass@1 accuracy and the shaded region represents the performance of majority vote/ consensus approach (Image from the article titled ‘Learning to Reason with LLMs’ by OpenAI)

And, its performance is substantially higher than the previous state-of-the-art GPT-4o.

Performance improvements of o1 over GPT-4o across different benchmarks (Image from the article titled ‘Learning to Reason with LLMs’ by OpenAI)

But what about Medicine?

Reserachers of this new pre-print on ArXiv answered precisely this.

o1 was evaluated over six tasks using data from 37 medical datasets, including challenging question-answering (QA) tasks based on professional medical quizzes from the New England Journal of Medicine and The Lancet.

The results show that o1 surpasses GPT-4 and all other strong LLMs in accuracy and comes close to dominating most evaluations.

Plot of average accuracy where o1 achieves the highest average accuracy of 74.3% across 19 medical datasets.

Here is a story where we deep dive into o1’s performance in the medical domain, its strengths and weaknesses, and how it can be further enhanced towards an early promising AI doctor candidate.

Let’s go!

How Was o1 Evaluated?

Aspects

Researchers assessed o1's performance in three essential areas of medicine that align with real-world clinician needs.

Understanding
Reasoning
Multi-linguality

Prompting Strategies

To explore these areas, three prompting strategies were used:

Direct prompting — where LLMs were asked to solve specific problems directly
Chain-of-thought prompting — where LLMs were asked to think step-by-step before answering
Few-shot prompting — where LLMs are given several examples of question-and-answer pairs to learn from in the prompt

Datasets

For the evaluation, 35 existing medicine-related datasets and two additional challenging datasets created from professional medical quizzes from the New England Journal of Medicine and The Lancet were used.

These datasets are clubbed into different Tasks that examine the specific capabilities of a model.

Metrics

Five different metrics were employed, and these are:

Accuracy: Measures the percentage of a model's generated answer that exactly matches the ground truth. It is used for multiple-choice questions and for question-answering tasks where the ground truth answer is a single word/ phrase.
F1-Score: It is the harmonic mean of Precision & Recall, and is used in tasks where a model is asked to select multiple correct answers.
BLEU-1 and ROUGE-1: Measures the similarity between a model’s generated answer and the ground truth
AlignScore: Measures the factual consistency/ truthfulness of a model’s generated answer.
MAUVE: Measures the gap between model-generated and human-written text distributions.

Models

Four different models were compared with the o1-preview-2024–09–12 model.

These are as follows:

GPT-3.5 (gpt-3.5-turbo-0125)
GPT-4 (gpt-4–0125-preview)
MEDITRON-70B
Llama-3–8B

The complete evaluation pipeline is shown below.

The evaluation pipeline describing different aspects and tasks, prompting strategies, language models and evaluation criteria

And How Well Did o1 Perform?

Understanding Aspect

o1 beats all other models, including GPT-4 and GPT-3.5, in this aspect.

o1 covers a larger radius in terms of its performance on 12 medical datasets as compared to other LLMs.

On five concept recognition datasets, o1 outperforms GPT-4 and GPT-3.5 by an average of 7.6% and 26.6%, respectively (i.e., 72.6% vs 65.0% vs 46.0%), in terms of the F1 Score.

Notably, it shows a 24.5% average improvement on the BC4Chem dataset.

Average Accuracy and F1 scores on 4 tasks from the Understanding and Reasoning aspects

In text summarization tasks, o1's ROUGE-1 score is 2.4% higher than GPT-4 and 3.7% higher than GPT-3.5.

Reasoning Aspect

In view of medical mathematical reasoning, o1 achieves a 9.4% higher accuracy than GPT-4 on MedCalc-Bench.

What about real-world diagnostic situations?

o1 outperforms GPT-4 and GPT-3.5 in newly constructed QA tasks datasets NEJMQA and LancetQA, with 8.9% and 27.1% accuracy improvements, respectively.

It also surpasses both GPT-4 and GPT-3.5 (with accuracy gains of 15.5% and 10%) in the AgentClinic benchmark, which evaluates complex reasoning scenarios with multi-turn conversations and medical environment simulations.

It is also noted that o1 answers are more concise and straightforward than GPT-4, which generates long hallucinated explanations for its incorrect answers.

领英推荐

Live Webinar: Getting Started with OpenCV on Android

OpenCV 8 个月前

TensorFlow Developer Certificate: Zero to Mastery…

Bluechip Technologies Asia 1 年前

Computer Vision and AI Predictions for 2023, Hurry To…

OpenCV 2 年前

Example of an answer from o1 and GPT-4 on a question from LancetQA, where o1 provides a more concise and accurate reasoning process than GPT-4

Multilinguality Aspect

o1 beats other models in multilingual questions answering tasks with an average accuracy of 85.2%, compared to GPT-4’s 75.7% and GPT-3.5’s 54.1%.

Accuracy of models on multilingual task XMedBench

However, it falls short by 1.6% compared to GPT-4 (43.4% vs. 45.0%) on Chinese agent benchmark AI Hospital in medical examination scenarios.

o1 also struggles with mixed language output generation in the medical setting.

It is thought that this could be due to the lack of multilingual CoT data during o1’s training.

Apart from these evaluation aspects, the research has many other interesting findings.

Let’s talk about them next.

There’s Still No Single Best Model

Although o1 generally outperforms other LLMs in most clinical decision tasks, no single model consistently performs best across all medical tasks.

It is seen that o1 lags behind GPT-4 by 5% in accuracy on the MIMIC4ED-Critical Triage dataset.

Interestingly, Llama 3 outperforms o1 by 20% in the PMC-Patient and PICO-Intervention datasets (96.0% vs. 76.4%).

Chain-of-Thought Prompting Still Improves o1

It is also seen that although o1 takes an internal Chain-of-Thought approach, further Chain-of-Thought (CoT) prompting still improves its performance by 3.18%.

Surprisingly, other prompting strategies like Self-Consistency and Reflex worsen the accuracy of o1 on the LancetQA dataset by up to 24.5% compared to only CoT prompting.

Accuracy surprisingly decreases with Self-Consistency and Reflex prompting

o1 Still Has A Hallucination Problem

o1 is susceptible to generating hallucinated responses, especially on text summarization tasks, where there is a 1.3% decrease in AlignScore compared to GPT-4.

And this is a big issue when using o1 as a real-world AI doctor.

AlignScore and Mauve metrics for 3 tasks from the Understanding and Reasoning aspects

Interestingly, Our Judgement Metrics Are Hugely Biased

The researchers note that the metrics used to evaluate LLMs produce inconsistent performance during evaluation.

This is seen with:

o1 outperforming GPT-4 significantly in ROUGE-1 (24.4% vs. 17.2%) but surprisingly underperforming in BLEU-1 (15.3 vs. 16.2) in the clinical suggestion task.
o1 outperforming GPT-4 in BLEU-1 and ROUGE-1 in text summarization task but falling short by 2.9 points in Mauve.
Llama3 outperforming o1 in accuracy on two concept recognition datasets, but falling behind when evaluated using the F1 score on the same datasets.

This raises a question— Are our current evaluation metrics reliable enough?

To get around this, an approach such as 'LLM-as-a-Judge' could be implemented using GPT-4, but this is not the best method for evaluating the more advanced o1 model.

Thus, we really need more robust evaluation metrics to assess state-of-the-art LLMs in complex scenarios.

o1 Is Also Painfully Slow

Although o1 surpasses the accuracy of other LLMs, it takes more than 2× and 9× longer to generate outputs compared to GPT-4 and GPT-3.5 on four medical tasks (13.18s for o1 vs. 6.89s for GPT-4 and 1.41s for GPT-3.5).

This is a weaker point of o1 in terms of user experience in a time-critical medical environment.

Time cost and average number of tokens required for response generation for different LLMs

Coming Back To Our Original Question

Returning to the question that we started with —

“Is OpenAI’s o1 the AI doctor we’ve always been waiting for?”

The answer is — Yes!

The performance of o1 is absolutely remarkable in Medicine!

Although o1 needs further evaluation, especially in the Safety domain, we can pretty confidently say that we are getting closer to a competent and reliable AI doctor faster than we think.

I’m super excited about it!

What are your thoughts on this? Let me know in the comments below.

Dr. Ashish Bamania的更多文章

Quantum Computation Is The Fundamental Of Them All

2025年2月26日

Quantum Computation Is The Fundamental Of Them All

Computation has been crucial to human progress since higher-order intelligence emerged. From using bones and sticks to…
Grab A 30% Discount On My Book “Systems Design In 100 Images”

2025年2月16日

Grab A 30% Discount On My Book “Systems Design In 100 Images”

Hey, it’s Ashish here! ?? I’ve got something exciting for you. I am offering a 30% discount on my book, “Systems Design…
‘ReXplain’ Is Transforming Radiology With AI Like Never Before

2025年1月22日

‘ReXplain’ Is Transforming Radiology With AI Like Never Before

Subscribe to ‘Into AI’ — my weekly newsletter where I help you explore Artificial Intelligence from the ground up by…
‘Transfusion’ Is Supercharging Training Multi-Modal LLMs Like Never Before

2024年10月24日

‘Transfusion’ Is Supercharging Training Multi-Modal LLMs Like Never Before

Multimodal LLMs are gaining popularity. Give them text, images, audio or video, and they will work with it all.
A Human Brain Inspired RAG Approach Has Reached The New State-of-the-Art

2024年10月12日

A Human Brain Inspired RAG Approach Has Reached The New State-of-the-Art

Human brains store immense amounts of knowledge to thrive in their environment. New experiences continuously update…
Google’s New AI Can Hallucinate An Entire Video Game In Real Time Without A Game Engine

2024年10月3日

Google’s New AI Can Hallucinate An Entire Video Game In Real Time Without A Game Engine

Imagine a simulated world run by a powerful AI that generates new scenes and experiences for its inhabitants in real…
‘MedGraphRAG’ Is A Complete Game Changer For AI In Medicine

2024年9月26日

‘MedGraphRAG’ Is A Complete Game Changer For AI In Medicine

LLMs have transformed how humans search for information for everyday tasks. Although well suited for general scenarios,…

3 条评论
‘SpreadsheetLLM’ Finally Lets LLMs Master Spreadsheets Better Than Ever

2024年9月22日

‘SpreadsheetLLM’ Finally Lets LLMs Master Spreadsheets Better Than Ever

If you’ve ever used an LLM to query spreadsheet data, you would know how tough it is to achieve this. Spreadsheets have…
‘Skeleton Recall Loss’ Is The New Breakthrough In Segmentation

2024年9月7日

‘Skeleton Recall Loss’ Is The New Breakthrough In Segmentation

Precise segmentation is a critical requirement across many domains today. These include training self-driving cars…

2 条评论
Here Is Google DeepMind’s New Research To Build Massive LLMs With A Mixture Of Million Experts

2024年8月29日

Here Is Google DeepMind’s New Research To Build Massive LLMs With A Mixture Of Million Experts

There is an LLM war happening around us. It might not be immediately obvious, but all big tech companies are in a rush…

1 条评论

See all articles

Is OpenAI’s o1 The AI Doctor We’ve Always Been Waiting For? (Surprisingly, Yes!)

Dr. Ashish Bamania

I simplify the latest advances in AI, Quantum Computing & Software Engineering for you | Tech Writer With 1M+ views | Software Engineer