The fallacy of the "human-in-the-loop" as a safety net for Generative-AI applications in healthcare
Jeff Clark
Seasoned technology leader and healthcare analytics expert with 20 years of experience building solutions and leading teams that solve the industry's biggest technology and data challenges. Opinions expressed are my own.
At the time I sit to write this, anything other than unbridled enthusiasm towards Generative AI is a fairly unpopular position in the world of data & technology. Criticism or reservation about its application to business problems is a potentially career-limiting move. One need only witness the meteoric rise of the stock symbols representing the major players in this market to see just how many eggs have been placed in this proverbial basket.
Healthcare, the industry in which I work as a data/technology leader, is not immune. The prospects in an industry representing as much as 18% of the U.S. GDP and filled with labor intensive, manual activities performed by high-cost human resources has giants (like Microsoft) and startups alike locked in a do-or-die scramble to arrive first and capture an incredibly lucrative market.
In all of our enthusiasm, we're forgetting something important: humans.
I'm not referring to the tales of woe from some corners about potentially devastating impacts to the labor market in industries where advances in AI could 'replace' humans. This seems less and less likely as the technology develops. I'm referring to a well-researched phenomenon that occurs when a human interacts with an intermittently unreliable automation technology.
In this article, I'll explain the research, how it applies here, and how it presents a risk that must be resolved before Generative AI (and some other types of AI) is used in a clinical setting. Lastly, I'll make some recommendations about how we should move forward in light of this understanding. Note: For readability, references are relegated to end of the article.
We'll consider applications of Generative AI (LLMs in particular) to business and clinical care problems in healthcare to be a form of the automation of manual tasks. As the EHR vendor Epic's marketing material states, they are using GenAI to "enable your team to break free from time-consuming, repetitive tasks".
For context, a few examples of areas where Generative and other forms of AI are currently being piloted or used by vendors and hospitals to automate manual work include:
Background: The rise of the "copilot"
Generative AI is far less than perfectly reliable. This is well-understood by even the most committed fan of the technology. One ongoing study that monitors LLM reliability shows that the best currently available model (when coupled with Retrieval-Augmented Generation to improve accuracy) is factually inaccurate approximately once out of every 40 tasks. The most commonly used models are closer to a 1-in-20 error rate.
As one AI company CEO said recently, "programs like [LLMs] are prone to making things up. This is one reason why you will always need an AI bot to assist rather than replace a healthcare professional."
Research and effort in this area may eventually improve reliability, but the fundamental reason for this shortcoming does not change: An LLM is merely a probabilistic, synthetic human language generator. There will always be risk of unreliable output.
Advocates and creators of this technology recognize this reality and concluded early on that Generative AI could not operate on its own in many scenarios. Consequently, they quickly adopted the "copilot", "assistant", or "human-in-the-loop" narrative to reassure business leaders that AI solutions are not roaming unchecked. A human will always be monitoring, we're told.
Microsoft has gone so far as to make Copilot the brand name under which many of its GenAI products reside. In a recent interview with Healthcare Leader about Microsoft's healthcare AI efforts, (and its AI development partnership with Epic) Jacob West, managing director of Healthcare & Life Sciences at Microsoft said, "We always say, ‘Keep a human in the loop’ when you’re thinking about AI technology and healthcare".
A healthcare tech journalist recently described a Microsoft demo she participated in, in which a Microsoft GenAI Copilot was used in a mock patient exam room to automate the medical documentation by listening in on the interaction between patient and doctor. She described the instances where the Copilot incorrectly documented her injury, but cheerfully assures the reader that, while it's not fool proof, "...the doctor can review and fix any errors before approving the note."
Healthcare leaders have echoed this narrative, especially when reassuring the public. In a recent article in Becker's Health IT raising questions about the reliability and safety of AI, one hospital CIO responded, "it's crucial that AI tools serve as assistants in the decision-making process, not stand on their own in providing care." Another large hospital system told Becker's, "we ensure the results from AI tools are correct...AI does not replace human assessment," while another hospital spokesperson said that while they use "evidence-based software to help inform patient acuity...the patient’s needs and acuity are ultimately determined by the assessment of a licensed registered nurse."
I don't doubt that these statements from healthcare leaders are both accurate and well-intentioned. I'm less confident in the vendors creating this narrative. While possibly also well-intentioned, they are doubtless motivated in part to:
An illusion of safety
Here is the problem: Humans are astonishingly bad at monitoring automation. Research shows that, in certain scenarios, the outcome from a "human-in-the-loop" scenario can be worse than if the task was done entirely by the human, that is, with no automation support.
They are equally bad at detecting when an automated system they are monitoring has made a mistake. Human performance in these scenarios ranges from an 80% detection rate in ideal circumstances, to as low as 20% in other scenarios. Said another way, in some scenarios, humans only detect 1 out of 5 errors the automation they are monitoring makes. This requires careful consideration in a setting where life and death decisions are made and the technology supporting the clinician is unreliable.
Researchers who study human performance factors use the term "vigilance" to describe a human's ability to detect unpredictable events over a period of time. Vigilance tends to degrade over time, (what the literature calls a "vigilance decrement") and a number of factors can contribute to this. Some of those factors are intuitive: noise, heat, lack of sleep, and boredom increase the rate of degradation. However, the factor most relevant to this discussion is reliance on automation, called automation complacency.
We don't need new research in order to understand this phenomenon. Not only can you probably find examples in your own life, (tried taking a road trip without a GPS navigator lately?) it has already been studied extensively in another field, aviation, where safety is paramount and errors can be fatal.
In the past 30 years, much attention has been given to the increasing role of automation support in aircraft and its impact on the pilot's ability to detect and recover from errors both in aircraft systems and in the automation itself. To summarize, the research shows that increased automation increases the rate at which vigilance degrades in the short term, and degrades the skills of the pilot in the long term. Over time, pilots lose the mental and motor skills necessary to perform critical flight functions, while also losing the skills to maintain general situational awareness.
Some airline pilots, who at the end of their career, leave behind their highly automated jets and return to flying smaller, simpler private planes as a hobby need to return to a flight instructor for a period of time to relearn how to fly!
This is an area where I have personal experience. I've been a pilot longer than I've been a data and analytics professional, and I started flying in the days when everything was analog, paper-based, and manual. A few years ago, however, my wife and I purchased an airplane with a brand-new, fully automated cockpit, capable of navigating and flying autonomously for every phase of flight except the first and last 30 seconds (takeoff and landing).
Not only was the negative effect on my skills as a pilot immediate, the complacency that set in within a few weeks was alarming. My vigilance in monitoring the autopilot's actions, cross-checking the instruments to detect failures, etc. declined rapidly, followed by a decrease in my ability to fly manually when needed. As a result, I've chosen to periodically fly the airplane with all automation aids disabled, and conduct training and emergency maneuvers manually to retain the skills needed to fly in the event of an automation failure.
My personal experience aside, this research in aviation reveals several additional important principles:
Another field where there has been extensive study on the effects of automation, particularly human performance when monitoring an automation system that is sometimes unreliable, is autonomous vehicles. The results here are similar. In one study in which a subject was asked to monitor a car capable of autonomous driving, the ability to detect and avoid hazards missed by the AI system decreased by 30% in just 45 minutes.
Notably, researchers have also found that autonomous vehicle capabilities did not reduce the driver's perceived workload or stress, as the cognitive demand to remain vigilant and monitor the AI exceeded the workload of manually completing the task.
Anyone who has worked in a clinical setting can see the corollaries here (a multi-task environment with partial automation of some tasks) and identify how the use of GenAI-based automation in clinical settings could present a similar environment, and similar risks.
By applying this research to a specific clinical scenario, we can hypothesize an expected outcome. Consider a scenario where GenAI is used to summarize a patient's visit into a clinical note, an activity the doctor is expected to monitor and review before signing. Additionally, the GenAI may make recommendations about a diagnosis based on the information gathered from the patient.
We can expect that within hours the doctor's vigilance in detecting errors will decrement, and the rate at which they detect them will decline from around 80% to as low as 20%. Depending on their perception of the reliability of the automation (and other factors, like their manual workload), they will begin to develop a bias towards accepting what the GenAI said, even when other information in the patient's chart is contradictory. Over the long-term, the doctor's ability to listen critically to the patient and decide a correct diagnosis and treatment plan will begin to degrade.
How severe this effect will be is currently unstudied. Before the explosion of ChatGPT's popularity and the rush to monetize LLMs, humans have never before interacted with an automation tool that was so believable and human-like.
Understanding the human factor
Why does this effect exist? It seems self-evident that the use of AI to automate labor-intensive tasks is a clear win that will improve the lives of both the patient and the clinician. But, the research shows it isn't. This facet of the "automation paradox" is due to factors in human cognitive performance. Researchers have developed frameworks to better understand this phenomenon.
First is the tendency of humans to choose the path of least cognitive effort in decision making. This is described by researchers as the "cognitive-miser hypothesis". We humans will, when possible, make choices based on a simple heuristic or decision rule rather than basing the decisions on a comprehensive analysis of available information. This is especially true when the human is uncertain about their own capabilities.
领英推荐
A second factor is the tendency of humans to perceive automated aids as powerful agents with superior analysis capability. This is especially true when the complexity of the automation is high and the inner workings of the automation are not clear. Researchers find that this perception is greatest when the method of interaction with the automation is human-like (via human speech, or human-like test). Since as early as the late 1990s and early 2000s, researchers have warned against creating automation aids that have human-like interaction with their user, as studies show that this leads quickly to an over-reliance on the aid, and an overestimation of its capabilities and accuracy.
A third factor relates to the diffusion of responsibility that occurs when humans work in a group setting. Sharing monitoring and decision-making tasks with an automated aid appears to lead to the same psychological effects that occur when humans share tasks or decisions with a group of other humans. An individual tends to perceive themselves as less responsible for the decision or task and, in the worst case, this results in “social loafing”, wherein the individual (intentionally or unintentionally) reduces their own effort and allows other members of the group to take the burden.
In short, we seek ways to make our lives easier, and an automation aid that feels human-like (If GenAI excels in anything, it's in mimicking human communication) and handles complex tasks or decision-making is something that we will tend to trust. And when it makes mistakes, we will tend to miss them. The human intended to be in-the-loop is quickly asleep at the wheel.
Solutions
I don't like to be one to present problems without at least suggesting solutions. I'm not advocating that we abandon any efforts to automate clinical tasks and sell our GenAI-related stock holdings (although you may want to consider the latter as the hype crests the peak). However, some significant changes are needed to the industry's break-neck scramble to put AI in the patient exam/hospital room.
First, the human performance factors discussed here and present in past literature need to be better studied and mitigated before probabilistic automation aids enter the clinical care realm. It is not enough to measure the performance of the GenAI-based automation aid. We must also evaluate the reliability of the human monitoring when errors do occur.
Monitoring the accuracy of medical documentation created by an LLM seems intuitively a far more cognitively burdensome task than monitoring an aircraft autopilot, or ensuring an autonomous vehicle avoids pedestrians. It stands to reason that vigilance will be significantly more difficult to maintain. This needs to be studied and understood.
Second, standards for accuracy and efficacy need to be established for probabilistic tools and automation aids intended for use in clinical care. Regulatory oversight and certification is needed to audit and enforce the standards. Regulators should ensure that the vendors of these solutions retain skin in the game, and liability for the reliability of the tools they create.
In aviation, for example, a rigorous multi-year testing and certification process occurs with any new device or technology before it can be used in flight. It is worth noting that, as of yet, probabilistic automation is not used in safety-critical flight systems. Automation aids for critical systems are deterministic and often doubly or triply redundant to reduce failure/error rates to the extremely low levels that regulators require (as high as 1 failure per 10 billion hours of flight for some components).
Lastly, we need to establish in the public an appropriate level of trust in these tools, and ensure that the behavior of, and messaging about these tools does not create inappropriate expectations or trust on the part of the user. This starts with more transparency and (dare I say) honesty from vendors.
Past research provides guidelines for establishing the appropriate level of trust in an automation aid:
To summarize this article in a paragraph: Humans are very unreliable when placed "in-the-loop". If the technology can't be fully trusted, and we humans cannot be trusted to keep an eye on it, we need to slow down before we hurt someone.
References
Manzey D, Reichenbach J, Onnasch L. Human Performance Consequences of Automated Decision Aids: The Impact of Degree of Automation and System Experience. Journal of cognitive engineering and decision making. 2012;6(1):57-87.
Greenlee ET, DeLucia PR, Newton DC. Driver Vigilance in Automated Vehicles: Hazard Detection Failures Are a Matter of Time. Human factors. 2018;60(4):465-476.
Parasuraman, R., & Manzey, D. H. (2010). Complacency and Bias in Human Use of Automation: An Attentional Integration. Human Factors, 52(3), 381-410.
Wickens, C. D., & Hollands, J. G. (2000). Engineering psychology and human performance. New York, NY: Prentice Hall.
Lee, J. D., & See, J. (2004). Trust in automation and technology: Designing for appropriate reliance. Human Factors, 46, 50–80.
Karau, S. J., & Williams, K. D. (1993). Social-loafing: A meta- analytic review and theoretical integration. Journal of Personality and Social Psychology, 65, 681–706.
Ebbatson M, Harris D, Huddlestone J, Sears R. The relationship between manual handling performance and recent flying experience in air transport pilots. Ergonomics. 2010;53(2):268-277.
Disclaimer: Opinions expressed are my own and not representative of any employer's opinion.
Senior Director, Healthcare Analytics & Technology at Huron
9 个月Jeff Clark what a great read (and a lot to consider)! I also loved how you used your experience as a pilot to put the topic into perspective.
Data+Design+People | Better Healthcare for All
9 个月This is incredible reading and a timely warning to pause and think deeper about what we’re about to undertake - thank you for sharing this Jeff
Director, Medication Systems and Informatics at St. Jude Children's Research Hospital
9 个月Thoughtful and easy to digest. Thanks for sharing.