The Rise of AI in Medical Diagnosis: Redefining Expertise
Two recent papers demonstrate that new large language models (LLMs) from OpenAI, GPT-4 and o1-preview, are capable of superior clinical reasoning, resulting in diagnostic performance exceeding that of human physicians. These findings, reviewed below, indicate that medical diagnosis is becoming a collaborative effort of medical personnel and AI. This is important because diagnostic errors—failures to establish an accurate and timely explanation of a patient’s health problem or to communicate that explanation to the patient—are a serious problem nationally and globally. Estimates put the rate of error at 4.3 to 15 percent, with roughly half being potentially harmful [13, 14, 15, 16].
Because AI systems could contribute to major improvements in diagnosis and treatment, including a reduction in errors, more timely diagnoses, and the extension of diagnostic resources to far more people, it is essential to proactively recognize these capabilities and plan for their evolution in order to maximize their social benefits.
Goh et al., in "Large Language Model Influence on Diagnostic Reasoning" [1] and Brodeur et al. in "Superhuman performance of a large language model on the reasoning tasks of a physician" [2] evaluate the reasoning abilities of LLMs, not simply the diagnostic results of using them. This reflects the fact that "clinical practice requires real-time complex multi-step reasoning processes, constant adjustments based on new data from multiple sources, iteratively refining differential diagnoses and management plans, and making consequential treatment decisions under uncertainty" [2]. And it anticipates a future in which physicians and other diagnosticians collaborate with LLMs throughout clinical procedures, making it essential that LLMs follow and actively contribute to the development of diagnoses and revise provisional assessments in light of new information.
Evaluating Physician-AI Collaboration: ChatGPT’s Diagnostic Skills Surpass Human Experts
Goh et al.'s study used GPT-4 [6] to assess two key aspects of an LLM's potential role in diagnosis:
Physician participants were randomized to either access the LLM in addition to conventional diagnostic resources or to use conventional resources only. They were allocated 60 minutes to review up to 6 clinical vignettes, and were required to:
This framework was designed to promote the practice of deliberate reflection, which has been found to improve physicians' diagnostic performance, particularly in complex diagnostic tasks [3].
Consistent with this, the study's primary outcome was performance on differential diagnosis accuracy, appropriateness of supporting and opposing factors, and next diagnostic evaluation steps. Secondary outcomes included time spent per case and final diagnosis accuracy.
ChatGPT (GPT-4) was tested using the same structured framework as the physician participants. Clinical vignettes were provided to the model with structured prompts designed to elicit diagnostic reasoning aligned with the study's framework. The authors were careful to use cases that were not included in ChatGPT's pretraining: "The cases have never been publicly released to protect the validity of the test materials for future use and therefore are excluded from training data of the LLM" [1].
ChatGPT was asked to list differential diagnoses, identify supporting and opposing findings, rank diagnoses by likelihood, and propose next diagnostic steps. Its responses were evaluated using the same scoring rubric applied to the physicians, allowing for direct comparison of performance on diagnostic reasoning and deliberate reflection.
Results
The results were unexpected and remarkable. The median diagnostic reasoning score per case was:
ChatGPT scored 18 percentage points higher than did physicians using conventional resources (the control group in this study) and 16 percentage points better than the physician + LLM group. And while the small score difference between the two physician groups was not statistically significant, the difference between the control group and ChatGPT was. As Jonathan Chen, one of the authors, put it in a subsequent interview,
The chatbot by itself did surprisingly better than all of the doctors, including the doctors that accessed the chatbot. That flew in the face of the fundamental theorem of informatics: human plus computer will deliver better results than either would alone [4].
These findings highlight the practical potential of LLMs to support physicians in refining diagnostic reasoning, particularly in complex cases, while underscoring the importance of fostering effective human-AI collaboration to maximize these benefits.
Brodeur et al. extend these findings by evaluating the more advanced o1-preview model. They show not only significant improvements in diagnostic accuracy and reasoning but also the potential for LLMs to independently perform at levels surpassing human physicians in key areas of clinical decision-making.
Continuing Advances in LLM Clinical Reasoning: o1-Preview
The study by Brodeur et al. evaluated the performance of the newer o1-preview model across five dimensions of diagnosis: differential diagnosis, diagnostic reasoning, triage differential diagnosis, probabilistic reasoning, and management reasoning [2]. Remarkably, the study was completed and published only three months after o1-preview's release, underscoring both the urgency of evaluating cutting-edge AI tools and the efficiency of the research team.
Results
Differential diagnosis. The study found that o1-preview included the correct diagnosis in its differential in 78.3% of cases, significantly exceeding the previous LLM result of 72.9% by GPT-4 [10] and far above the human clinician result of 33.6% reported by Google researchers [9]. It should be noted, however, that the Google result was obtained when clinicians had to provide a differential diagnosis "based solely on review of the case presentation without using any reference materials" [9]. When they were permitted to use reference tools the percentage of differentials with the correct diagnosis rose to 44.5%.
The ability of o1-preview to consistently include the correct diagnosis in its differential demonstrates the potential for LLMs to reduce cognitive load for physicians and improve the accuracy of assessments.
Diagnostic reasoning. Perhaps the most startling result was o1-preview's score on diagnostic reasoning. In 78 of 80 cases, o1-preview achieved a perfect R-IDEA score. This compared favorably to GPT-4 (47/80), attending physicians (28/80), and resident physicians (16/80) as shown in Figure 2. R-IDEA is a recently developed instrument for evaluating diagnostic reasoning on a ten-point scale across five dimensions: interpretive summary, differential diagnosis, explanation of lead diagnosis, and explanation of alternative diagnoses. This comprehensive and standardized framework ensures a holistic assessment of reasoning quality, making it an essential tool for comparing the diagnostic capabilities of clinicians and AI systems.
In addition, Brodeur et al. replicated Goh's study of diagnostic reasoning, showing that o1-preview's performance was even better than the 92% reported by Goh. However, perhaps because Brodeur only asked o1-preview for one response per case—in contrast to the three obtained by Goh—Brodeur was unable to demonstrate statistical significance.
The exceptional performance of o1-preview in diagnostic reasoning underscores the potential of LLMs to transform clinical decision-making by providing accurate, structured, and comprehensive analyses of complex cases. By augmenting physicians’ cognitive processes, LLMs can reduce diagnostic errors and improve the efficiency of care delivery.
Management reasoning. The study used clinical vignettes based on real cases that were also used in a previous study that evaluated GPT-4's performance. The cases were presented to the physicians and to o1-preview, followed by a series of questions regarding next steps in management. The median score for o1-preview was 86%, compared to GPT-4 (42%), physicians with access to GPT-4 (41%), and physicians with conventional resources (median 34%). This is an extraordinary improvement by o1-preview over its predecessor and far above physician performance.
In another management-related study, o1-preview was asked to select the next test to perform. In 87.5% of cases o1-preview selected the correct test, and in another 11% of cases it selected a helpful test, based on an assessment by two physicians. Only in 1.5% of cases was the selected test considered unhelpful.
These outstanding results in management reasoning are in contrast to more modest, though still impressive, results from another study by Goh et al., "Large Language Model Influence on Management Reasoning" [12]. In this study, the authors attempted to emulate the inherently fuzzy nature of management reasoning, "which encompasses decision making around treatment, testing, patient preferences, social determinants of health, and cost-conscious care, all while managing risk". Their results showed that physicians using an LLM (GPT-4) performed moderately better than physicians relying on convention resources:
领英推荐
Only the first of these results was statistically significant. And there is no way disentangle the extent to which differences in scores between Brodeur et al. and Goh et al. reflect differences in scoring rubrics, statistical methods, or the underlying abilities of the LLMs.
The substantial improvement in management reasoning by o1-preview highlights its potential to guide clinicians in making more informed decisions about next steps in patient care. By suggesting appropriate diagnostic tests and treatments with high accuracy, LLMs can complement physicians’ expertise, especially in complex or uncertain scenarios, fostering a more integrated approach to decision-making in clinical practice.
Triage differential diagnosis. The model was tested on clinical scenarios where prioritizing "cannot-miss" conditions (e.g., life-threatening illnesses) was essential. It demonstrated strong performance in recognizing and including critical diagnoses in its differentials, reinforcing its potential utility in triage scenarios where identifying urgent conditions is paramount. By helping to ensure that life-threatening conditions are not overlooked, LLMs can not only enhance the speed and precision of clinical workflows but also potentially save lives through timely and accurate prioritization.
Probabilistic reasoning. One sub-study compared 01-preview with GPT-4 and human subjects (the latter two from a previous study) on the estimation of pre- and post-test probabilities, with the "true range" defined based on expert guidelines. The model showed room for improvement in probabilistic reasoning compared to its other capabilities. While 01-preview performed well in identifying patterns, its quantitative estimates of risk and probability were less reliable, suggesting that it may suffer from some of the same issues as affect medical personnel, including overestimation of the accuracy of diagnostic tests, anchoring bias, and a flawed understanding of conditional (Bayesian) probability.
If LLMs achieve significant advances in probabilistic reasoning, they could revolutionize clinical decision-making by improving risk stratification, enhancing the interpretation of diagnostic tests, and supporting more nuanced, evidence-based patient care.
Limitations
Study design choices by Brodeur et al. raise concerns about some of their findings, as explained below. Although there are reasons to believe that the findings are an accurate reflection of enhanced reasoning by o1-preview, it would be desirable for a follow-up study to correct these shortcomings.
Data Contamination
In some instances, Brodeur et al. failed to adequately control for the possibility that o1-preview's pretraining included data about cases included in the study. Contamination can distort findings in several respects:
Brodeur et al. did perform a sensitivity analysis on the cases used for the differential diagnosis portion of the study. They compared the model’s performance on cases published before and after 01-preview's pretraining cutoff date and found no statistically significant difference in diagnostic reasoning. In addition, the cases used by Goh et al. and re-used by Brodeur et. al. to replicate Goh's diagnostic reasoning study were shielded from exposure and thus excluded from LLM pretraining.
Small Sample Size
The study used only five cases to evaluate management reasoning. While this may have been motivated by the fact that the previous study of GPT-4 also used five cases, it means that
This problem also affected the "Cannot Miss" sub-study, where a relatively small sample size combined with high performance across groups made it impossible to find statistically significant differences among groups.
Implications & Recommendations
The results of Goh et al. and Brodeur et al. demonstrate the transformative potential of LLMs in clinical diagnostics and raise urgent questions about how AI can best be integrated into clinical workflows to ensure that human-AI collaboration is optimized for reducing diagnostic errors and improving patient outcomes.
There are a number of issues that must be resolved in order for wider adoption of AI to be both effective and accepted. Ullah et al., authors of a recent review of the potential of LLMs for diagnostic medicine in digital pathology, identified "several challenges and barriers associated with the use of LLMs... These included limitations in contextual understanding and interpretability, biases in training data, ethical considerations, impact on healthcare professionals, and regulatory concerns" [18]. While the Goh et al. and Brodeur et al. studies provide evidence of lucid contextual understanding, they were not designed to address the other challenges cited by Ullah et al.
Nevertheless, rapid advances in the capabilities of LLMs and other types of AI are likely to continue or accelerate, along with improvements in the technical, organizational, and sociological aspects of its integration into healthcare systems. It is critical that medical researchers and healthcare institutions adapt. To their credit, the authors of the Goh and Brodeur papers understand the need to both apply and develop new measures of the diagnostic and management capabilities of AI. By doing so, largely successfully, they have shown that LLMs surpass human performance in differential diagnosis, diagnostic reasoning, triage assessment, and some aspects of management reasoning. And they have shown that expectations of human-AI collaboration based on treating AI as a conventional resource are likely to founder.
Unfortunately, skepticism about the evolution and integration of AI may impede adaptation. Ranji, for example, imagines that LLMs will not be able to cope with the "iterative—and complicated" process of diagnosis in a clinical setting:
There are reasons to be skeptical that the performance of LLMs on simulated cases can generalize to the clinical practice setting environment. The [Goh] study’s cases were representative of common general practice diagnoses but are presented in an orderly fashion with the relevant history, physical examination, laboratory, and imaging results necessary to construct a prioritized differential diagnosis. Diagnosis in the clinical setting is an iterative—and complicated—process that takes place amid many competing demands and requires input from the patient, caregivers, and multiple clinicians in addition to objective data. Far from a linear process, diagnosis in the clinical practice setting involves progressively refining diagnoses based on new information, and the distinction between diagnosis and treatment is often blurred as clinicians incorporate treatment response into diagnostic reasoning [17].
Ranji and other skeptics seem not to grasp the implications of AI development trends. The research by Goh et al. and Brodeur et al. shows that LLMs have already become powerful, flexible, and accurate diagnostic reasoners. Ironically, Ranji's skepticism is best understood as describing development goals, goals likely to be achieved in the near future.
In this regard, the composition of the Goh et al. and Brodeur et al. author lists likely reflects a strategic effort to influence the field. For example:
The combination of prominent names, distinguished institutional affiliations, and cross-disciplinary expertise among the authors suggests that the aim is not only to present research findings but also to shift perspectives on the role of AI in clinical practice and accelerate its integration into healthcare systems.
Recommendations
Realizing the potential of LLMs will require proactive strategies to integrate AI into healthcare effectively while addressing limitations and ensuring human oversight. The following recommendations aim to guide clinicians, healthcare organizations, and policymakers in navigating this new landscape.
References
AI Swarm Agent & Automation Expert for the Trades | Co-Founder Trade Automation Pros | Co-Founder Skilled Trades Syndicate | Founder of Service Emperor HVAC | Service Business Mastery podcast | Tri-Star Mechanical
2 个月Exciting advancements Joseph Boland AI's potential to enhance diagnostic accuracy?is really commendable. This is a transformative step for global healthcare