Is OpenAI’s o1 The AI Doctor We’ve Always Been Waiting For? (Surprisingly, Yes!)

Is OpenAI’s o1 The AI Doctor We’ve Always Been Waiting For? (Surprisingly, Yes!)

OpenAI’s o1 is out, and its performance on STEM tasks is mind-bending!

Quoted from OpenAI’s research article titled ‘Learning to Reason with LLMs’:

OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA).

The model has been trained using Reinforcement learning and uses a long internal Chain-of-Thought approach to think through the problem before generating an output.

Its performance scales incredibly with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).

OpenAI o1’s performance (American Invitational Mathematics Examination (AIME) accuracy) improves with both train and test-time compute (Image from the article titled ‘Learning to Reason with LLMs’ by OpenAI)

Whether mathematics, competitive programming, or Ph. D-level questions in Physics, Chemistry, and Biology, it answers them all with a high degree of correctness.

Performance of o1 as compared to o1 preview and GPT-4o on different STEM benchmarks, where solid bars show Pass@1 accuracy and the shaded region represents the performance of majority vote/ consensus approach (Image from the article titled ‘Learning to Reason with LLMs’ by OpenAI)

And, its performance is substantially higher than the previous state-of-the-art GPT-4o.

Performance improvements of o1 over GPT-4o across different benchmarks (Image from the article titled ‘Learning to Reason with LLMs’ by OpenAI)

But what about Medicine?

Reserachers of this new pre-print on ArXiv answered precisely this.

o1 was evaluated over six tasks using data from 37 medical datasets, including challenging question-answering (QA) tasks based on professional medical quizzes from the New England Journal of Medicine and The Lancet.

The results show that o1 surpasses GPT-4 and all other strong LLMs in accuracy and comes close to dominating most evaluations.

Plot of average accuracy where o1 achieves the highest average accuracy of 74.3% across 19 medical datasets.

Here is a story where we deep dive into o1’s performance in the medical domain, its strengths and weaknesses, and how it can be further enhanced towards an early promising AI doctor candidate.

Let’s go!


How Was o1 Evaluated?

Aspects

Researchers assessed o1's performance in three essential areas of medicine that align with real-world clinician needs.

  • Understanding
  • Reasoning
  • Multi-linguality

Prompting Strategies

To explore these areas, three prompting strategies were used:

  • Direct prompting — where LLMs were asked to solve specific problems directly
  • Chain-of-thought prompting — where LLMs were asked to think step-by-step before answering
  • Few-shot prompting — where LLMs are given several examples of question-and-answer pairs to learn from in the prompt

Datasets

For the evaluation, 35 existing medicine-related datasets and two additional challenging datasets created from professional medical quizzes from the New England Journal of Medicine and The Lancet were used.

These datasets are clubbed into different Tasks that examine the specific capabilities of a model.

A table showing different Aspects and Tasks along with the Datasets and Metrics used in this research

Metrics

Five different metrics were employed, and these are:

  • Accuracy: Measures the percentage of a model's generated answer that exactly matches the ground truth. It is used for multiple-choice questions and for question-answering tasks where the ground truth answer is a single word/ phrase.
  • F1-Score: It is the harmonic mean of Precision & Recall, and is used in tasks where a model is asked to select multiple correct answers.
  • BLEU-1 and ROUGE-1: Measures the similarity between a model’s generated answer and the ground truth
  • AlignScore: Measures the factual consistency/ truthfulness of a model’s generated answer.
  • MAUVE: Measures the gap between model-generated and human-written text distributions.

Models

Four different models were compared with the o1-preview-2024–09–12 model.

These are as follows:

The complete evaluation pipeline is shown below.

The evaluation pipeline describing different aspects and tasks, prompting strategies, language models and evaluation criteria

And How Well Did o1 Perform?

Understanding Aspect

o1 beats all other models, including GPT-4 and GPT-3.5, in this aspect.

o1 covers a larger radius in terms of its performance on 12 medical datasets as compared to other LLMs.

On five concept recognition datasets, o1 outperforms GPT-4 and GPT-3.5 by an average of 7.6% and 26.6%, respectively (i.e., 72.6% vs 65.0% vs 46.0%), in terms of the F1 Score.

Notably, it shows a 24.5% average improvement on the BC4Chem dataset.

Average Accuracy and F1 scores on 4 tasks from the Understanding and Reasoning aspects

In text summarization tasks, o1's ROUGE-1 score is 2.4% higher than GPT-4 and 3.7% higher than GPT-3.5.

Average BLEU-1 and ROUGE-1 scores on 3 tasks from the Understanding and Reasoning aspects

Reasoning Aspect

In view of medical mathematical reasoning, o1 achieves a 9.4% higher accuracy than GPT-4 on MedCalc-Bench.

What about real-world diagnostic situations?

o1 outperforms GPT-4 and GPT-3.5 in newly constructed QA tasks datasets NEJMQA and LancetQA, with 8.9% and 27.1% accuracy improvements, respectively.

It also surpasses both GPT-4 and GPT-3.5 (with accuracy gains of 15.5% and 10%) in the AgentClinic benchmark, which evaluates complex reasoning scenarios with multi-turn conversations and medical environment simulations.

It is also noted that o1 answers are more concise and straightforward than GPT-4, which generates long hallucinated explanations for its incorrect answers.

Example of an answer from o1 and GPT-4 on a question from LancetQA, where o1 provides a more concise and accurate reasoning process than GPT-4

Multilinguality Aspect

o1 beats other models in multilingual questions answering tasks with an average accuracy of 85.2%, compared to GPT-4’s 75.7% and GPT-3.5’s 54.1%.

Accuracy of models on multilingual task XMedBench

However, it falls short by 1.6% compared to GPT-4 (43.4% vs. 45.0%) on Chinese agent benchmark AI Hospital in medical examination scenarios.

o1 also struggles with mixed language output generation in the medical setting.

It is thought that this could be due to the lack of multilingual CoT data during o1’s training.

Apart from these evaluation aspects, the research has many other interesting findings.

Let’s talk about them next.


There’s Still No Single Best Model

Although o1 generally outperforms other LLMs in most clinical decision tasks, no single model consistently performs best across all medical tasks.

It is seen that o1 lags behind GPT-4 by 5% in accuracy on the MIMIC4ED-Critical Triage dataset.

Interestingly, Llama 3 outperforms o1 by 20% in the PMC-Patient and PICO-Intervention datasets (96.0% vs. 76.4%).


Chain-of-Thought Prompting Still Improves o1

It is also seen that although o1 takes an internal Chain-of-Thought approach, further Chain-of-Thought (CoT) prompting still improves its performance by 3.18%.

Surprisingly, other prompting strategies like Self-Consistency and Reflex worsen the accuracy of o1 on the LancetQA dataset by up to 24.5% compared to only CoT prompting.

Accuracy surprisingly decreases with Self-Consistency and Reflex prompting

o1 Still Has A Hallucination Problem

o1 is susceptible to generating hallucinated responses, especially on text summarization tasks, where there is a 1.3% decrease in AlignScore compared to GPT-4.

And this is a big issue when using o1 as a real-world AI doctor.

AlignScore and Mauve metrics for 3 tasks from the Understanding and Reasoning aspects

Interestingly, Our Judgement Metrics Are Hugely Biased

The researchers note that the metrics used to evaluate LLMs produce inconsistent performance during evaluation.

This is seen with:

  • o1 outperforming GPT-4 significantly in ROUGE-1 (24.4% vs. 17.2%) but surprisingly underperforming in BLEU-1 (15.3 vs. 16.2) in the clinical suggestion task.
  • o1 outperforming GPT-4 in BLEU-1 and ROUGE-1 in text summarization task but falling short by 2.9 points in Mauve.
  • Llama3 outperforming o1 in accuracy on two concept recognition datasets, but falling behind when evaluated using the F1 score on the same datasets.

This raises a question— Are our current evaluation metrics reliable enough?

To get around this, an approach such as 'LLM-as-a-Judge' could be implemented using GPT-4, but this is not the best method for evaluating the more advanced o1 model.

Thus, we really need more robust evaluation metrics to assess state-of-the-art LLMs in complex scenarios.


o1 Is Also Painfully Slow

Although o1 surpasses the accuracy of other LLMs, it takes more than 2× and 9× longer to generate outputs compared to GPT-4 and GPT-3.5 on four medical tasks (13.18s for o1 vs. 6.89s for GPT-4 and 1.41s for GPT-3.5).

This is a weaker point of o1 in terms of user experience in a time-critical medical environment.

Time cost and average number of tokens required for response generation for different LLMs

Coming Back To Our Original Question

Returning to the question that we started with —

“Is OpenAI’s o1 the AI doctor we’ve always been waiting for?”
The answer is — Yes!

The performance of o1 is absolutely remarkable in Medicine!

Although o1 needs further evaluation, especially in the Safety domain, we can pretty confidently say that we are getting closer to a competent and reliable AI doctor faster than we think.

I’m super excited about it!

What are your thoughts on this? Let me know in the comments below.


Further Reading


Subscribe to ‘Into AI’?—?my weekly newsletter where I help you explore Artificial Intelligence from the ground up by dissecting the original research papers.



要查看或添加评论,请登录

Dr. Ashish Bamania的更多文章

社区洞察

其他会员也浏览了