USMLE score [Google AI (PaLM)]: 67% on 1,273 NBME Questions
Image Credit: https://arxiv.org/pdf/2212.13138.pdf

USMLE score [Google AI (PaLM)]: 67% on 1,273 NBME Questions

Should all my colleagues get scared or excited? Is this a great #USMLE study assistant or will AI take over medicine soon and put us all out of business? Not so fast!

I recall my wife’s harrowing experience with these mind-bending USMLE prep questions. It was the objective benchmark we set to determine when she was ready to take the main exams (steps 1 & 2). Today, a model is thrown at those same questions. What a shame! ??

Healthcare's largest language models are here to stay [Med-PaLM from 谷歌 , GatorTron by 英伟达 and UF Health , ChatGPT by OpenAI , and PubMedGPT by 美国斯坦福大学 ]. Why should we even care?

No alt text provided for this image
PaLM scores 67.6% on USMLE, the highest for any medical LM

While #Google AI's latest Med-PaLM model surgically shatters several glass ceilings (84% on 6 clinical MCQ topics, 57% on Indian MCQ medical entrance exam, 79% on PubMedQA, and SAQ/long answers/explanations), clinician evaluation reveals key gaps in PaLM responses. The paper admits several critical improvements are necessary in order to make these models viable for real-world clinical applications. They acknowledge that medical knowledge is vast in both quantity and quality. LLMs are capable of long, coherent, and complex generations. However, they can also generate statements inconsistent with fact. In medical settings in particular, such failure modes need to be carefully vetted, and in real world applications, generations unlikely to be true should be withheld. Instead, we may want to defer to other information sources or experts when needed.


The most shocking statement

... given the safety-critical requirements of the medical domain, we believe it is important to move beyond automated measures of long-form answer generation quality using metrics such as BLEU to those involving more nuanced human evaluation frameworks ...

I've been saying this for months! Instead of blurting out F1/BLEU/Accuracy scores, the paper admits these objective accuracy metrics are flawed and do not reveal the important model flaws that could impact patient safety. They introduce 12 thoughtful evaluation axes such as likelihood (and extent) of harm, evidence of correct/incorrect reasoning, missing content, and agreement with scientific consensus. This is music to my ears!

The paper enlists a panel of 12 clinician reviewers from US, UK, and India, and 5 lay-person reviewers to evaluate several models along several dimensions, demonstrating sobriety and real intention for clinical value, not just another PR stunt or fancy algorithm announcement.

The rigorous clinical evaluation of model outputs suggests the Google health AI team is taking a more disciplined and grounded approach to healthcare disruption, a step in the right direction.


Model Size

No alt text provided for this image
Large language models are here to stay

Models like PaLM, ChatGPT, PubMedGPT, and GatoTron are insanely huge and will be non-trivial for most healthcare institutions to deploy if they ever become clinically useful. For context, at half a trillion (540B) parameters, these models are 5000 times larger than a BERT model, insanely impossible to integrate for a hospital that’s still dealing with on-prem to cloud migration. As the GatoTron paper points out, these models provide marginal gains on information extraction tasks. They shine more in text generation and question-answering tasks, that require extensive reasoning.

Modeling Approach

The key modeling techniques that drove these new SOTA results include:

  • Scaling compute: Teams with deep pockets can train larger and larger models because of an exponential increase in distributed GPU compute, e.g. 992 A100 80GB GPUs from 124 NVIDIA DGX nodes using the NVIDIA SuperPOD reference cluster architecture
  • Scaling up model size: exponentially larger number of parameters gives models a higher capacity to learn. 100 million parameters to 540 billion parameters. PubMedGPT
  • Scaling data:?Nvidia's GatorTron was trained on over 90 billion words and Google AI's PaLM was trained on 780 billion words!
  • Prompting strategies: such as Few-shot prompting, Chain-of-Thought (CoT) prompting, and Instruction prompt tuning. Essentially, an instruction (text or soft prompt) is prepended to the input sequence allowing the model to adapt to various medical tasks.
  • Self-consistency: an improvement on CoT prompting that fixes the number of chain-of-thought answer explanation paths leading to significant improvements in medical reasoning

References

  1. Med-PaLM paper: https://arxiv.org/pdf/2212.13138.pdf
  2. GatoTron paper: https://www.nature.com/articles/s41746-022-00742-2
  3. PubMedGPT paper: https://crfm.stanford.edu/2022/12/15/pubmedgpt.html

Kelechi Ronald Ikpe MBBS, MSc Health Informatics

Health Informatics | Digital Health | Digital Dictation Systems | Business Analysis | Host of Behind the Scenes in Health with Dr.Ron Podcast | Author

1 年

Love this. Interesting times ahead of us.

?ahika Betül Yayl?, MD

AI Researcher @ Mayo Clinic AI Lab | Medical Doctor

1 年

Nice insights on evaluation metrics!

Arnauld ADJOVI

Senior ML Engineer, Generative AI Specialist at IBM | AI Gouvernance & Digital Transformation Expert

1 年

Sègbédji Junior GOUBALAN, Ph.D Some ideas about our latest discussion ?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了