“Existing evaluations of LLMs mostly focus on accuracy of question answering for medical examinations, without consideration of real patient care data.” We are proud to share new research in JAMA by multiple CHAI members and our very own Head of Policy Lucy Orr-Ewing. Their systematic review shines light on the current state of large language model (LLM) evaluations in healthcare. This research emphasizes the need for certification frameworks that mitigate potential harm from algorithms to marginalized communities. At CHAI, we're dedicated to addressing these gaps and developing robust guidelines for responsible AI implementation in health. Thank you to Suhana Bedi, Yutong Liu, Dev Dash, Sanmi Koyejo, Alison Callahan, Jason Fries, Michael Wornow, Akshay Swaminathan, Lisa Lehmann, Mehr Kashyap, Akash Chaurasia, Nirav R. Shah, Karandeep Singh, Troy Tazbaz, Arnold Milstein, Michael Pfeffer, H. Christy Hong, MD MBA, Nigam Shah, for your contributions.
Our paper got published in JAMA! ?? Earlier this year, Suhana Bedi Yutong Liu and I led a paper at Stanford University School of Medicine that highlights critical gaps in evaluating Large Language Models (LLMs) in healthcare. We categorized all 519 relevant studies from 1 Jan 2022 to 19 Feb 2024 into (1) evaluation data type, (2) health care task, (3) natural language processing (NLP) and natural language understanding (NLU) tasks, (4) dimension of evaluation, and (5) medical specialty. In doing so, we revealed: - Only 5% used real patient care data in their testing and evaluation. - Key tasks like prescription writing and clinical summarization are underexplored. - The focus on accuracy dominates, while vital aspects like fairness, bias, and toxicity remain largely neglected. - Only 1 study assessed the financial impact of LLMs in healthcare. Why does this matter? - Real patient care data encompasses the complexities of clinical practice, and so a thorough evaluation of LLM performance should mirror clinical performance as closely as possible to truly determine its effectiveness. - There are many high-value administrative tasks in health care that are often labor intensive, requiring manual input and contributing to physician burnout, that are currently chronically understudied. - Only 15.8% of studies conducted any evaluation that delves into how factors such as race and ethnicity, gender, or age affect bias in the model’s output. Future research should place greater emphasis?on fairness, bias or toxicity evaluations if we want to stop LLMs from perpetuating bias. - Future evaluations must estimate total implementation costs, including model operation, monitoring, maintenance, and infrastructure adjustments, before reallocating resources from other health care initiatives. The paper calls for standardized evaluation metrics, broader coverage of healthcare applications, and real patient care data to ensure safe and equitable AI integration. This is essential for the responsible adoption of LLMs in healthcare to truly improve patient care. And I am delighted that I get to work on implementing the findings of this research at Coalition for Health AI (CHAI). This paper could not have happened without Nigam Shah's constant support, leadership and guidance, and that of our co-authors Dev Dash Sanmi Koyejo Alison Callahan Jason Fries Michael Wornow Akshay Swaminathan Lisa Lehmann H. Christy Hong, MD MBA Mehr Kashyap Akash Chaurasia Nirav R. Shah Karandeep Singh Troy Tazbaz Arnold Milstein Michael Pfeffer. Thank you also to Nicholas Chedid, MD, MBA Brian Anderson, MD and Justin Norden, MD, MBA, MPhil for your guidance and mentorship. And of course, huge shout out to my co-conspirators Yutong Liu Suhana Bedi - you are the best team. This is the first paper I've ever written, and I'm eternally grateful to you all for showing me how it's done. Full article here: https://lnkd.in/eimh9BNV