USMLE score [Google AI (PaLM)]: 67% on 1,273 NBME Questions
Tobi Olatunji MD
NLP/ASR/AI for Global Health @ Intron & BioRAMP Lab | Ex-AWS | Ex-Enlitic | 3x patents
Should all my colleagues get scared or excited? Is this a great #USMLE study assistant or will AI take over medicine soon and put us all out of business? Not so fast!
I recall my wife’s harrowing experience with these mind-bending USMLE prep questions. It was the objective benchmark we set to determine when she was ready to take the main exams (steps 1 & 2). Today, a model is thrown at those same questions. What a shame! ??
Healthcare's largest language models are here to stay [Med-PaLM from 谷歌 , GatorTron by 英伟达 and UF Health , ChatGPT by OpenAI , and PubMedGPT by 美国斯坦福大学 ]. Why should we even care?
While #Google AI's latest Med-PaLM model surgically shatters several glass ceilings (84% on 6 clinical MCQ topics, 57% on Indian MCQ medical entrance exam, 79% on PubMedQA, and SAQ/long answers/explanations), clinician evaluation reveals key gaps in PaLM responses. The paper admits several critical improvements are necessary in order to make these models viable for real-world clinical applications. They acknowledge that medical knowledge is vast in both quantity and quality. LLMs are capable of long, coherent, and complex generations. However, they can also generate statements inconsistent with fact. In medical settings in particular, such failure modes need to be carefully vetted, and in real world applications, generations unlikely to be true should be withheld. Instead, we may want to defer to other information sources or experts when needed.
The most shocking statement
... given the safety-critical requirements of the medical domain, we believe it is important to move beyond automated measures of long-form answer generation quality using metrics such as BLEU to those involving more nuanced human evaluation frameworks ...
I've been saying this for months! Instead of blurting out F1/BLEU/Accuracy scores, the paper admits these objective accuracy metrics are flawed and do not reveal the important model flaws that could impact patient safety. They introduce 12 thoughtful evaluation axes such as likelihood (and extent) of harm, evidence of correct/incorrect reasoning, missing content, and agreement with scientific consensus. This is music to my ears!
The paper enlists a panel of 12 clinician reviewers from US, UK, and India, and 5 lay-person reviewers to evaluate several models along several dimensions, demonstrating sobriety and real intention for clinical value, not just another PR stunt or fancy algorithm announcement.
领英推荐
The rigorous clinical evaluation of model outputs suggests the Google health AI team is taking a more disciplined and grounded approach to healthcare disruption, a step in the right direction.
Model Size
Models like PaLM, ChatGPT, PubMedGPT, and GatoTron are insanely huge and will be non-trivial for most healthcare institutions to deploy if they ever become clinically useful. For context, at half a trillion (540B) parameters, these models are 5000 times larger than a BERT model, insanely impossible to integrate for a hospital that’s still dealing with on-prem to cloud migration. As the GatoTron paper points out, these models provide marginal gains on information extraction tasks. They shine more in text generation and question-answering tasks, that require extensive reasoning.
Modeling Approach
The key modeling techniques that drove these new SOTA results include:
References
Health Informatics | Digital Health | Digital Dictation Systems | Business Analysis | Host of Behind the Scenes in Health with Dr.Ron Podcast | Author
1 年Love this. Interesting times ahead of us.
AI Researcher @ Mayo Clinic AI Lab | Medical Doctor
1 年Nice insights on evaluation metrics!
Senior ML Engineer, Generative AI Specialist at IBM | AI Gouvernance & Digital Transformation Expert
1 年Sègbédji Junior GOUBALAN, Ph.D Some ideas about our latest discussion ?