登录查看更多内容

USMLE score [Google AI (PaLM)]: 67% on 1,273 NBME Questions

Tobi Olatunji MD

NLP/ASR/AI for Global Health @ Intron & BioRAMP Lab | Ex-AWS | Ex-Enlitic | 3x patents

发布日期: 2022年12月30日

Should all my colleagues get scared or excited? Is this a great #USMLE study assistant or will AI take over medicine soon and put us all out of business? Not so fast!

I recall my wife’s harrowing experience with these mind-bending USMLE prep questions. It was the objective benchmark we set to determine when she was ready to take the main exams (steps 1 & 2). Today, a model is thrown at those same questions. What a shame! ??

Healthcare's largest language models are here to stay [Med-PaLM from 谷歌 , GatorTron by 英伟达 and UF Health , ChatGPT by OpenAI , and PubMedGPT by 美国斯坦福大学 ]. Why should we even care?

No alt text provided for this image — PaLM scores 67.6% on USMLE, the highest for any medical LM

While #Google AI's latest Med-PaLM model surgically shatters several glass ceilings (84% on 6 clinical MCQ topics, 57% on Indian MCQ medical entrance exam, 79% on PubMedQA, and SAQ/long answers/explanations), clinician evaluation reveals key gaps in PaLM responses. The paper admits several critical improvements are necessary in order to make these models viable for real-world clinical applications. They acknowledge that medical knowledge is vast in both quantity and quality. LLMs are capable of long, coherent, and complex generations. However, they can also generate statements inconsistent with fact. In medical settings in particular, such failure modes need to be carefully vetted, and in real world applications, generations unlikely to be true should be withheld. Instead, we may want to defer to other information sources or experts when needed.

The most shocking statement

... given the safety-critical requirements of the medical domain, we believe it is important to move beyond automated measures of long-form answer generation quality using metrics such as BLEU to those involving more nuanced human evaluation frameworks ...

I've been saying this for months! Instead of blurting out F1/BLEU/Accuracy scores, the paper admits these objective accuracy metrics are flawed and do not reveal the important model flaws that could impact patient safety. They introduce 12 thoughtful evaluation axes such as likelihood (and extent) of harm, evidence of correct/incorrect reasoning, missing content, and agreement with scientific consensus. This is music to my ears!

The paper enlists a panel of 12 clinician reviewers from US, UK, and India, and 5 lay-person reviewers to evaluate several models along several dimensions, demonstrating sobriety and real intention for clinical value, not just another PR stunt or fancy algorithm announcement.

领英推荐

Top Smart Algorithms In Healthcare

Bertalan Meskó, MD, PhD 5 年前

How can machine learning help improve healthcare?

Machine Learning 2 年前

The AI-empowered patient is coming. Are doctors ready?

Robert Pearl, M.D. 1 年前

The rigorous clinical evaluation of model outputs suggests the Google health AI team is taking a more disciplined and grounded approach to healthcare disruption, a step in the right direction.

Model Size

Models like PaLM, ChatGPT, PubMedGPT, and GatoTron are insanely huge and will be non-trivial for most healthcare institutions to deploy if they ever become clinically useful. For context, at half a trillion (540B) parameters, these models are 5000 times larger than a BERT model, insanely impossible to integrate for a hospital that’s still dealing with on-prem to cloud migration. As the GatoTron paper points out, these models provide marginal gains on information extraction tasks. They shine more in text generation and question-answering tasks, that require extensive reasoning.

Modeling Approach

The key modeling techniques that drove these new SOTA results include:

Scaling compute: Teams with deep pockets can train larger and larger models because of an exponential increase in distributed GPU compute, e.g. 992 A100 80GB GPUs from 124 NVIDIA DGX nodes using the NVIDIA SuperPOD reference cluster architecture
Scaling up model size: exponentially larger number of parameters gives models a higher capacity to learn. 100 million parameters to 540 billion parameters. PubMedGPT
Scaling data:?Nvidia's GatorTron was trained on over 90 billion words and Google AI's PaLM was trained on 780 billion words!
Prompting strategies: such as Few-shot prompting, Chain-of-Thought (CoT) prompting, and Instruction prompt tuning. Essentially, an instruction (text or soft prompt) is prepended to the input sequence allowing the model to adapt to various medical tasks.
Self-consistency: an improvement on CoT prompting that fixes the number of chain-of-thought answer explanation paths leading to significant improvements in medical reasoning

References

Med-PaLM paper: https://arxiv.org/pdf/2212.13138.pdf
GatoTron paper: https://www.nature.com/articles/s41746-022-00742-2
PubMedGPT paper: https://crfm.stanford.edu/2022/12/15/pubmedgpt.html

Kelechi Ronald Ikpe MBBS, MSc Health Informatics

1 年

Love this. Interesting times ahead of us.

2 次回应

?ahika Betül Yayl?, MD

AI Researcher @ Mayo Clinic AI Lab | Medical Doctor

1 年

Nice insights on evaluation metrics!

2 次回应

Arnauld ADJOVI

Senior ML Engineer, Generative AI Specialist at IBM | AI Gouvernance & Digital Transformation Expert

1 年

Sègbédji Junior GOUBALAN, Ph.D Some ideas about our latest discussion ?

2 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

USMLE score [Google AI (PaLM)]: 67% on 1,273 NBME Questions

Tobi Olatunji MD

NLP/ASR/AI for Global Health @ Intron & BioRAMP Lab | Ex-AWS | Ex-Enlitic | 3x patents

The most shocking statement

领英推荐

Model Size

Modeling Approach

References

更多精彩文章

社区洞察

其他会员也浏览了

What is your AI IQ, EQ, and AQ?

Artificial Intelligence Replacing Physicians in the Foreseeable Future: NO CHANCE!

One in five GPs using AI at work despite lack of training - with some even using it in diagnosis

The rise of AI in medical treatments

Rethinking Healthcare in the AI Era: Challenges and Opportunities Ahead

A brief round-up of AI-enabled healthcare & medicine

The Computer Will See You Now: How AI Is Enhancing and Automating Healthcare

Understanding Anchoring Bias in Medical AI

AI + EQ: Raising healthcare's game

2018: The medical odyssey

The most shocking statement

领英推荐

Model Size

Modeling Approach

References

Tackling Medical Misinformation using Machine Translation: NLLB-200, Why should we care, Part 2

2022年8月19日

The decline in minority languages and MetaAI's NLLB model for multi-lingual translation

2022年8月11日

Closing the Mistranslation Chasm between Clinicians and Engineering Teams

2020年9月17日

A dramatic explanation of convex optimization and gradient descent to a group of ER physicians in a language they understand all too well

2020年9月10日

社区洞察

其他会员也浏览了

What is your AI IQ, EQ, and AQ?

Artificial Intelligence Replacing Physicians in the Foreseeable Future: NO CHANCE!

One in five GPs using AI at work despite lack of training - with some even using it in diagnosis

The rise of AI in medical treatments

Rethinking Healthcare in the AI Era: Challenges and Opportunities Ahead

A brief round-up of AI-enabled healthcare & medicine

The Computer Will See You Now: How AI Is Enhancing and Automating Healthcare

Understanding Anchoring Bias in Medical AI

AI + EQ: Raising healthcare's game

2018: The medical odyssey