With Generative AI, Overconfidence Is a Vice
MIT Sloan Management Review
Transforming how people lead and innovate
It’s a generative AI conundrum: People want to use options like ChatGPT to save time, yet that can lead to embarrassing errors. New research from MIT and Accenture shows that people often overestimate their ability to spot generative AI mistakes — but tools that nudge us to review LLM-generated content can improve accuracy. Best of all, these nudges don’t involve a significant time penalty. MIT Sloan School of Management’s Renée Richardson Gosline and her coauthors share lessons from this research in the full article below.?
For more advice on generative AI, watch our short video, The Generative AI Blind Spot Leaders Have Now, featuring interviews with AI experts and CIOs at the MIT Sloan CIO Symposium.
— Laurianne McLaughlin, senior editor, digital, MIT Sloan Management Review
Further Reading on This Topic:
Nudge Users to Catch Generative AI Errors
Renée Richardson Gosline , Yunhao Zhang, Haiwen Li, Paul Daugherty , Arnab Chakraborty , Philippe Roussiere , and Patrick Connolly
OpenAI’s ChatGPT has generated excitement since its release in November 2022, but it has also created new challenges for managers. On the one hand, business leaders understand that they cannot afford to overlook the potential of generative AI large language models (LLMs). On the other hand, apprehensions surrounding issues such as bias, inaccuracy, and security breaches loom large, limiting trust in these models.
In such an environment, responsible approaches to using LLMs are critical to the safe adoption of generative AI. Consensus is building that humans must remain in the loop (a scenario in which human oversight and intervention places the algorithm in the role of a learning apprentice) and responsible AI principles must be codified. Without a proper understanding of AI models and their limitations, users could place too much trust in AI-generated content. Accessible and user-friendly interfaces like ChatGPT, in particular, can present errors with confidence while lacking transparency, warnings, or any communication of their own limitations to users. A more effective approach must assist users with identifying the parts of AI-generated content that require affirmative human choice, fact-checking, and scrutiny.
Are you enjoying this article? Get unlimited access to MIT Sloan Management Review’s content with a subscription.
In a recent field experiment, we explored a way to assist users in this endeavor. We provided global business research professionals at Accenture with a tool developed at Accenture’s Dock innovation center, designed to highlight potential errors and omissions in LLM content. We then measured the extent to which adding this layer of friction had the intended effect of reducing the likelihood of uncritical adoption of LLM content and bolstering the benefits of having humans in the loop.
The findings revealed that consciously adding some friction to the process of reviewing LLM-generated content can lead to increased accuracy — without significantly increasing the time required to complete the task. This has implications for how companies can deploy generative AI applications more responsibly.
Experiment With Friction
Friction has a bad name in the realm of digital customer experience, where companies strive to eliminate any roadblocks to satisfying user needs. But recent research suggests that organizations should embrace beneficial friction in AI systems to improve human decision-making. Our experiment set out to practically explore this hypothesis in the field by measuring the efficiency and accuracy trade-offs of adding targeted friction, or cognitive and procedural speed bumps, to LLM outputs in the form of error highlighting. We tested whether intentional structurally embedded resistance to the uninterrupted and automatic application of AI would slow the user process and make potential errors more likely to be noticed. We believed that this would encourage participants to engage in what is referred to in behavioral economics as System 2 thinking, a more conscious and deliberative type of cognitive processing than the more intuitive System 1 thinking, akin to accuracy nudges in misinformation research.
The study, a collaborative effort between MIT and Accenture, aimed to explore the integration of an LLM into a task familiar to business research professionals. The objective was to complete and submit two executive summaries of company profiles (Task 1 and Task 2) within a 70-hour time frame and seek and reference any available sources, simulating real work conditions. The research participants were given text output from ChatGPT, along with the corresponding prompts, and were told that they could use as much or as little of the content as they saw fit.
Passages from the provided ChatGPT output and prompts were highlighted in different colors. Participants were informed that the highlighting features were part of a hypothetical tool Accenture could potentially develop and that the highlights conveyed different meanings depending on the color. Text highlighted in purple matched terms used in the prompt and terms in internal databases and publicly available information sources; text highlighted in orange indicated potentially untrue statements that should be considered for removal or replacement; text that was in the prompt but omitted in the output was indicated below the generated output and highlighted in blue; and text that was not identified as belonging to any of these categories was left unhighlighted.
领英推荐
Ideally, this hypothetical tool would combine natural language processing (NLP) techniques and an AI model to query all outputs against a predefined source of truth to highlight potential errors or omissions, but for the purposes of this experiment, the highlighting was done using a combination of algorithmic and human inputs. In addition, we purposely baked in some attention-check errors (nonhighlighted) to measure the circumstances under which adding friction in LLM use led to greater error detection (and improved accuracy) by participants.
Participants were randomly assigned to one of three experimental conditions, with varying levels of cognitive speed bumps in the form of highlighting:
Our findings revealed that introducing friction that nudges users to more carefully scrutinize LLM-generated text can help them catch inaccuracies and omissions. Participants in the no-highlight control condition missed more errors than those in either of the conditions with error labeling (31% more in Task 1 and 10% more in Task 2). Moreover, the proportion of omissions detected was 17% in the no-highlight condition but 48% in the full-highlight condition and 54% in the error-highlight condition.
As anticipated, these improvements did come with a trade-off: Participants in the full-highlight group saw a statistically significant increase (an average of 43% and 61% in Tasks 1 and 2, respectively) in the time required to complete the tasks versus the control group. However, in the error-only highlight condition, the average difference in the time taken versus the control was not statistically significant. Considering that each task typically took one to two hours on average without the assistance of generative AI, this trade-off was considered acceptable. Thus, the second condition, which involved medium friction, demonstrated a way to optimize the balance between accuracy and efficiency.
Three Behavioral Insights
The results of our field experiment point to actions organizations can take to help employees more effectively incorporate generative AI tools into their work and be more likely to recognize potential errors and biases.
Ensure thoughtfulness in crafting the prompt — a touch point for beneficial friction — given users’ tendency toward cognitive anchoring on generative AI output. Participants’ final submissions were lexically very similar to the LLM-generated content (60% to 80% identical content, as measured by NLP similarity scores). This suggests that the participants anchored on that output, even when they were asked to consider it as merely an input to their own writing. This underscores the importance of being thoughtful about the prompt provided to the LLM, since its output can set the trajectory for the final version of the content. Recent research suggests that anchoring may prove beneficial under some circumstances when generative AI content is perceived as high in quality and can play a compensatory role for an error-prone writer. But, given our findings of high similarity between the LLM-generated text and the final submissions from human participants, it could also lead a user down the wrong path.
Recognize that confidence is a virtue but overconfidence is a vice. Highlighting errors did indeed draw participants’ attention and improved accuracy via error correction. Yet participants across the three conditions self-reported virtually no difference in response to the follow-up survey item “I am more aware of the types of errors to look for when using GenAI.” This presents a reason to be cautious: Users may overestimate their ability to identify AI-generated errors. A tool that adds friction by making potential errors more conspicuous could help users calibrate their trust in generative AI content by mitigating overconfidence.
Additionally, our findings suggest that highlighting errors had no significant impact on participants’ self-reported trust in LLM tools or their willingness to use them.
Experiment, experiment, experiment. Before AI tools and models are deployed, it is imperative to test how humans interact with them and how they impact accuracy, speed, and trust. As indicated above, we observed a difference in self-reported attitudes and actual error detection. We urge organizations to adopt experiments as a means of understanding how best to elevate the role of employees in human-in-the-loop systems and to measure the impact on their understanding, behaviors, and biases.
The ease of use and broad availability of LLMs has enabled their rapid spread through many organizations, even as issues with their accuracy remain unresolved. We must seek ways to enhance humans’ ability to improve accuracy and efficiency when working with AI-generated outputs. Our study suggests that humans in the loop can play an important interventional role in AI-enabled systems and that beneficial friction can nudge users to exercise their responsibility for the quality of their organization’s content.
ABOUT THE AUTHOR
Renée Richardson Gosline is head of the Human-First AI Group at MIT’s Initiative on the Digital Economy and a senior lecturer and research scientist?at the MIT Sloan School of Management. Yunhao Zhang is a postdoctoral fellow at the Psychology of Technology Institute. Haiwen Li is a doctoral candidate at the MIT Institute?for Data, Systems, and Society. Paul Daugherty is chief technology and innovation officer at Accenture. Arnab D. Chakraborty is the global responsible AI lead and a senior managing director at Accenture. Philippe Roussiere is global lead, Paris, for research innovation and AI at Accenture. Patrick Connolly is global responsible AI/generative AI research manager at Accenture Research, Dublin.
Empowering enterprise companies to leverage collaborative intelligence and build a futuristic workforce | AI co-workers in action | Manager, Digital Transformation, E42.ai
1 个月This article effectively highlights the critical issue of overconfidence in generative AI. While these tools can produce impressive results, it's essential to remain cautious and not take their outputs at face value. Overconfidence can lead to significant errors, especially when users fail to critically evaluate the information generated. Implementing strategies that encourage users to verify and scrutinize AI outputs is vital for responsible use. As we navigate this evolving landscape, fostering a culture of critical thinking and oversight will be key to harnessing the full potential of generative AI while mitigating its risks. https://bityl.co/RT9u
Programme Manager- Safaricom & M-Pesa Foundations | Advisory Committee Member
3 个月Good point!
CEO, Founder @Stalk Yourself? | Burnout Coach | BLM | Anti-disciplinary | Creative System Designer [UI, UR, UX, Product, Service, Security, Event, Sound, Light] | Visual Artist | Sound Engineer | Singer | ENTP ?? ??
3 个月Thank you for those insights on keeping our critical thinking afloat with generative AI MIT Sloan Management Review. #LetsCareTogether #MentalHealth #MentalWealth?
Managing Partner at ATD Homes
3 个月Edit all the time and do it on the same post for months and years. The germ of truth can grow into a tree.