ChatGPT Outperforms All Human Doctors In Stanford Study

ChatGPT Outperforms All Human Doctors In Stanford Study

In a recent study at Stanford University, ChatGPT outperformed all human doctors when assessing medical case histories. The objective of the study was to assess if using LLMs could improve diagnostic reasoning performance among physicians in family medicine, internal medicine, or emergency medicine compared with conventional resources. ChatGPT achieved an average score of 90% when diagnosing a medical condition from a case report and explaining its reasoning. The study entitled Large Language Model Influence on Diagnostic Reasoning, A Randomized Clinical Trial, was published in JAMA on October 28, 2024.

Highlights

  • Using ChatGPT did NOT help doctors improve diagnostic accuracy or reasoning.
  • The AI alone outperformed ALL of the humans.
  • Even doctors who were able to see AI's diagnoses and reasoning, did not outperform AI.
  • The doctors didn't listen to AI when AI told them things they didn’t agree with.
  • The study revealed that many doctors did not understand how to use ChatGPT and were treating it as a search engine.

“They were treating ChatGPT like a search engine for directed questions: ‘Is cirrhosis a risk factor for cancer? What are possible diagnoses for eye pain? Only a fraction of the doctors realized they could literally copy-paste in the entire case history into the chatbot and just ask it to give a comprehensive answer to the entire question. Only a fraction of doctors actually saw the surprisingly smart and comprehensive answers the chatbot was capable of producing.”
Jonathan Chen, study author, physician data scientist, Stanford

Study Overview

  • The study was conducted from November 29 to December 29, 2023.
  • The study included 50 doctors from several large American health systems.
  • Participants were given 6 case histories based on real patients.
  • Participants were allocated 60 minutes to review the 6 case histories.
  • They used cases that had never been published to ensure that neither medical students nor ChatGPT would be familiar with the cases.
  • The study was not designed to comprehensively assess a participant’s knowledge, but rather to evaluate their general clinical reasoning across a set of cases.
  • To maximize a range of coverage, the authors deliberately selected cases to capture a broad and relevant cross-section of disciplines and a range of clinical problems.
  • Participants were given access to ChatGPT without explicit training in prompt engineering techniques that could have improved the quality of their interactions with the system. This is consistent with current integrations and was required in this type of evaluation.
  • All of the physicians in the ChatGPT arm tried to use the system, but they were not forced to use it in any consistent way. This was a purposeful design to reflect the clinical practice setting.


Image source: JAMA

“The doctors didn't listen to AI when AI told them things they didn’t agree with.”
Adam Rodman, study author, Beth Israel Deaconess Medical Center

Study Results

  • ChatGPT had an average score of 90% when diagnosing a medical condition from a case report and explaining its reasoning.
  • Doctors randomly assigned to use ChatGPT had an average score of 76%.
  • Doctors randomly assigned not to use ChatGPT had an average score of 74%
  • In this trial, the availability of ChatGPT to physicians as a diagnostic aid did not significantly improve clinical reasoning compared with conventional resources.
  • ChatGPT alone demonstrated higher performance than both physician groups.


Image source JAMA

"Provocative result we did NOT expect. We fully expected the Doctor + GPT4 arm to do better than Doctor + "conventional" Internet resources. Flies in the face of the Fundamental Theorem of Informatics (Human + Computer is Better than Either Alone)."
Jonathan Chen, study author, physician data scientist, Stanford

Conclusions

  1. The study demonstrated that while some doctors are familiar with AI, most don't understand how to take advantage of AI to solve complex diagnostic problems and offer explanations for their diagnoses.
  2. The authors indicated that we need to train medical students and retrain an entire generation of doctors to realize the potential of using AI in clinical practice.
  3. The authors indicated that training clinicians in best prompting practices may improve physician performance with LLMs.
  4. Alternatively, organizations could invest in predefined prompting for diagnostic decision support integrated into clinical workflows and documentation, enabling synergy between the tools and clinicians.

References

Large Language Model Influence on Diagnostic Reasoning, A Randomized Clinical Trial, JAMA Network Open, October 28, 2024. doi:10.1001/jamanetworkopen.2024.40969

Authors: Ethan?Goh,?Robert?Gallo,?Jason?Hom,?Eric?Strong, Hannah Kerman, Joséphine?Cool, Zahir?Kanjee,?Andrew S.?Parsons,?Neera?Ahuja,?Eric?Horvitz, Yingjie?Weng,?Daniel?Yang,?Arnold?Milstein,?Andrew Olson, Adam?Rodman, Jonathan H.?Chen

Subscribe, Comment, Join Group

I'm interested in your feedback - please leave your comments.

To subscribe to the AI in Healthcare Milestones newsletter click here.

To join the AI in Healthcare Milestones Group click here.

Copyright ? 2024 Margaretta Colangelo. All Rights Reserved.

This article was written by Margaretta Colangelo. Margaretta is a leading AI analyst who tracks significant milestones in AI in healthcare. She consults with AI healthcare companies and writes about some of the companies she consults with. Margaretta serves on the advisory board of the AI Precision Health Institute at the University of Hawai?i?Cancer Center @realmargaretta

Hakan ünal

Senior Frontend Developer

2 个月

yes absolutely chatgpt veya low code türevlerini efektif ?ekilde kullanmak performans artt?r?r. Yaz?l?m ekiplerinin (20-30 ki?ilik) aylarca ?al???p belki yapamayaca?? i?leri tek ba??na yapt?rtabiliyorsunuz.

回复
Robin Blackstone, MD

Health 4.0 Architect | AI & Healthcare Policy Leader | Independent Board Director | Board Certified Corporate Executive Surgeon - AI, Obesity & Oncology | Private Family Office | US Army Veteran

3 个月

Just received a “survey” from the AMA. As I made my way through it, it felt like the bias against AI was presented in a way that would lead to very cautionary conclusions about AI. It felt self serving.

回复
Dr. Tamanna A.

Co-Founder @ Centre of Bioinformatics Research and Technology | specializing in Bioinformatics

3 个月

Impressive update! Shows how AI can transform healthcare when used effectively.?

回复
Cristobal Thompson

Coach y Mentor Ejecutivo de Lideres

3 个月

Thanks for sharing Margaretta Colangelo

Luca Bogoni

Founder & Principal | HealthTech Executive, Radiology, Medical Imaging & Devices, Strategic Innovations, AI, Regulatory

3 个月

Thanks Margaretta for sharing this study. Over the past 20 years, during the many studies we conducted, Physician + Assisted tool was shown to be more effective than Physician alone when the physician was trained on how to utilize and interpret the information that the tool shared. Most times, the Physician + tool showed a significant improvement. The training covered also the type of false positive that the device could produce. As noted, the current study succeeded amongst the various endpoints, in demonstrating that in order effectiveness and efficiency require some training on using the tool. Expectation perhaps was that given LLM capabilities, user would be able to take advantage "out of the box". It would be great to see a follow up study(ies), where physicians are trained to use the tool properly and effectively and perhaps exploring different diagnostic paradigms, such as reader first vs. reader second, and compared to standalone.

要查看或添加评论,请登录

Margaretta Colangelo的更多文章

社区洞察

其他会员也浏览了