Under-thinking in LLMs: Understanding the Phenomenon and Its Implications
Large Language Models (LLMs) have shown impressive reasoning abilities by generating step-by-step “chains of thought.” However, recent research has identified a counterintuitive limitation in some of the most advanced reasoning models, including DeepSeek-R1 (a 671B-parameter RL-trained model), OpenAI’s emerging o3 (and related GPT-4 class models), and Qwen (Alibaba’s open LLM). The issue is termed “underthinking” – where an LLM fails to think deeply enough along a single line of reasoning despite producing a lot of reasoning text. This article explores the theoretical underpinnings of underthinking in LLMs, how different architectures and training methods contribute to it, and (almost) real-world case studies where underthinking leads to mistakes.
TL;DR
When deploying advanced LLMs for complex tasks, be aware that more reasoning doesn’t always mean better reasoning. Underthinking can lurk beneath a verbose response. To ensure reliability, incorporate safeguards: use enhanced decoding methods, consider secondary checks (like verifier models or consistency checks), and favor model designs that enforce focused reasoning. By doing so, you can tap into the phenomenal capabilities of models like GPT-4, Qwen, and DeepSeek without falling victim to their “thoughts being all over the place.” In short, encourage your AI to not just think out loud, but to think things through.
Market Insights on Competing Reasoning Models: Recent evaluations [2][3][4] have shown that while DeepSeek R1 and Google’s Gemini 2.0 Flash Thinking exhibit impressive performance and cost efficiency, OpenAI’s o1-pro continues to excel in both accuracy and reasoning coherence across a wide range of tasks. In rigorous multi-step puzzles and complex code-generation challenges, o1-pro consistently outperforms its rivals that demonstrates a superior ability to maintain focused, deep reasoning. Although accessing o1-pro comes at a premium, its performance sets a high benchmark that drives market competition. This robust performance has made o1-pro the gold standard for high-stakes applications.
I think there’s no question that o1-pro is the best reasoning model on the market right now. Gemini 2.0 Flash Thinking is probably a better model than R1, but that seems more debatable. R1 and Gemini 2.0 Flash Thinking have different reasoning styles, and as a result one model will be a better choice for some problems and the other will be better for others.
I have been collecting a bunch of Reddit posts and I will soon post a summary of my findings with the links to the posts.
Defining “Underthinking” in Large Language Models
Underthinking is a recently identified phenomenon in advanced reasoning LLMs (sometimes called “o1-like” models in research). In simple terms, underthinking happens when a model prematurely abandons promising lines of thought and jumps to new approaches too frequently. Instead of fully working through one strategy, the model keeps switching tactics – for example, starting to solve a math problem one way, then saying “Alternatively, let’s try this…” and switching to a different method, and so on. This frequent reset of reasoning means the model never digs deep enough into any single approach to reach the correct answer. The result is shallow reasoning in each attempt, even though the overall response may be long-winded.
Researchers observed underthinking by studying top-tier reasoning models like OpenAI’s “o1” (a precursor to GPT-4/GPT-5-class capabilities) and its open-source replicas (e.g. Qwen, DeepSeek, Kimi). These models are designed for intensive chain-of-thought reasoning – they can generate very long thought processes, mimicking human-like deliberation to solve hard problems. Paradoxically, the question “Are they thinking deeply enough?” arose when analyses showed that on hard tasks (like challenging math or logic problems), these models often switch thoughts too quickly and fail to fully explore a promising idea. In other words, the model’s “thoughts are all over the place,” hopping from one line of reasoning to another without reaching a conclusion in each.
Several patterns characterize underthinking:
Notably, researchers quantified underthinking with a metric called UT (Underthinking) score, which essentially measures how inefficiently a model used its tokens in an incorrect solution. They found that in wrong answers, over 70% of responses contained at least one valid reasoning thread that was dropped too early. The wrong answers also used 225% more tokens and changed strategies 418% more often than correct answers. In contrast, when these same models got the answer right, they tended to stick to one line of reasoning and solve it more succinctly. Figure 1 of the study illustrates this stark difference: on average, these “deep reasoning” models consumed far more tokens on incorrect solutions due to excessive thought switching.
It’s important to distinguish underthinking from its opposite, often called overthinking. Overthinking in LLMs refers to wasting computation on unnecessary steps – for example, taking many steps to answer a trivial question like 2+3=?. Underthinking, on the other hand, is about insufficient depth on necessary steps – the model should think longer in one direction but doesn’t. In practice, an advanced model might even exhibit both: it could overthink easy tasks (using overly elaborate reasoning with minimal benefit) and underthink hard tasks (jumping between partial solutions without finishing any). Both are forms of reasoning inefficiency, but underthinking is particularly pernicious for complex problem-solving because the model essentially gives up on good ideas prematurely.
Why Does Underthinking Occur? (Architecture and Training Factors)
Underthinking seems to be a side-effect of the very features that give modern LLMs their reasoning power. Models like DeepSeek-R1, OpenAI’s o3, GPT-4, and Qwen are designed with architectures or inference strategies that allow iterative, long chain-of-thought reasoning. For example, some use tree-of-thought or scratchpad techniques, others use self-reflection loops, and DeepSeek in particular was trained via reinforcement learning explicitly to enhance reasoning depth. These designs enable the model to explore multiple solution strategies – but without proper guidance, the model may thrash between strategies instead of pursuing one to completion. Think of it as giving the model a lot of “mental agility” but not enough “patience” to stick with one idea.
A key factor is how these models are trained and decoded at inference time:
One revealing observation from the underthinking study is how different models behaved on hard questions versus easier ones. The researchers compared “o1-like” models (those with long chain-of-thought capabilities) to more conventional LLMs on the same tasks. They found that for models like QwQ-32B (a reasoning-intensive model) and DeepSeek-R1-671B, the incorrect solutions were much longer on average than the correct ones, filled with many shifts in strategy. Meanwhile, a more traditional model like Qwen-Math-72B or Llama3.3-70B (which don’t heavily engage in multi-step reasoning by themselves) showed no significant length difference between their correct and incorrect answers. In other words, a conventional model either solved the problem in a straightforward way or failed quickly – it didn’t waste time thrashing about. The advanced models, by contrast, would produce very long wrong answers. This suggests that the very ability to think in many steps (a strength) became a weakness when not managed properly. The architecture/training gave them a bigger “search space” for solutions, but without enough discipline they wandered in that space.
To illustrate, consider GPT-4 vs. a smaller code model. If asked a tricky coding question, GPT-4 might start explaining one approach, then reconsider and outline a different approach, and so on – ending with a lot of text but maybe not a runnable solution. A smaller code model might just try one approach to the best of its ability and stop. GPT-4’s rich reasoning training means it has the capacity to try multiple angles (which is why it often succeeds where others fail), yet if it’s going to fail, it might do so in an “underthinking” fashion – an elaborate attempt that ultimately missed the mark. In fact, even with factual questions, GPT-4 can give very detailed answers that sound logical but include unsupported claims if it didn’t stick to verified facts. A Stanford study on medical QA found that even GPT-4 (with retrieval) had about **30% of its statements unsupported by the sources it provided, and nearly half of its answers contained at least one unsupported claim. This indicates that GPT-4 sometimes doesn’t thoroughly check whether each part of its reasoning is correct, which is analogous to underthinking (jumping to a plausible statement without fully verifying the line of reasoning).
In summary, underthinking is most pronounced in LLMs explicitly trained or engineered for multi-step reasoning – DeepSeek and Qwen being prime examples from the open-source world, and GPT-4 (and upcoming models like “o3”) in the proprietary realm. Different training methodologies contribute to the phenomenon: chain-of-thought prompting enables it, RL-based reward schedules can inadvertently encourage it, and insufficient step-level supervision fails to rein it in. Understanding this helps us contextualize why underthinking arises and sets the stage for how to address it.
Real-World Case Studies: Underthinking in Action
Underthinking isn’t just a theoretical quirk; it has real implications when LLMs are deployed in applications. Below I explore a few domains – coding assistants, medical AI, and autonomous decision-making – where underthinking-like behavior has led to suboptimal reasoning, incorrect outputs, or other failures.
领英推荐
Coding Assistants and Code Generation
Developers using AI coding assistants (like GitHub Copilot or GPT-4’s code mode) may have encountered the model writing a lot of code that almost works but ultimately fails. Often, the AI will start implementing one idea and then midway decide to tweak the approach, resulting in code that’s inconsistent or incomplete. This is a form of underthinking: the model didn’t fully think through the initial solution before switching. For example, it might begin with a dynamic programming approach to a problem, then abruptly shift to a greedy method – leaving behind remnants of the first approach in the code. The final code might have unused variables or half-implemented logic from the abandoned path.
Studies of LLM-generated code errors back this up. Common issues include logical errors and missing pieces in the code. One guide on using LLMs for code generation notes that “LLMs often misinterpret the logical requirements of a task, leading to incorrect or nonsensical code behavior,” and sometimes important sections of code are simply left out. These “incomplete code” errors suggest the model started writing a solution but didn’t follow through on every part – akin to not fully exploring the code path it started. In practice, this might manifest as the AI producing a function that handles one case (or one part of the input) and neglects other cases, because it jumped to considering a different angle too soon.
For instance, a coding assistant asked to implement a complex algorithm might produce a verbose explanation and a chunk of code. If it’s underthinking, the explanation could enumerate multiple strategies (“We could do X, or possibly Y...”), and the code might include fragments of multiple approaches merged incorrectly. The developer then finds that the code doesn’t run or fails tests because the logic is internally inconsistent.
Another telltale sign is when the AI provides an overly convoluted solution for something that has a straightforward answer – it’s as if the model lost track of the straightforward path by continuously exploring side routes. This can waste a lot of a programmer’s time, as they sift through a meandering AI-generated solution that should have been simple.
In summary, underthinking in coding assistants leads to bloated, half-baked code. The model writes a lot, but the depth of reasoning in any given segment is shallow. Developers have observed AI solutions with extra steps that aren’t needed, or solutions that try two methods at once and succeed at neither. Recognizing this pattern can help users prompt the model better (e.g., “Stick to one approach”) or know when to double-check critical sections of AI-written code.
Medical AI and Diagnostic Reasoning
Medical applications of LLMs range from answering health questions to assisting in diagnosis. These are high-stakes tasks where reasoning needs to be both deep and correct. Underthinking here can lead to plausible-sounding but incorrect or unverified medical advice, which is dangerous.
Imagine an LLM-driven medical assistant analyzing a patient’s symptoms. An underthinking failure might look like this: the model begins to consider one diagnosis (say, it starts explaining why it could be lupus), then halfway it switches: “Alternatively, these symptoms might point to Lyme disease.” It then gives a conclusion based on the second path but never fully reconciled or completed the reasoning for the first possibility. If the first line of reasoning was actually the correct diagnosis, the model abandoned it prematurely. The patient (or doctor) reading the output gets a fragmented consultation – maybe a list of possible conditions without a well-argued conclusion for any. This could erode trust or worse, lead to the wrong treatment if the model’s final answer is taken at face value.
There have been instances where LLMs gave confident medical answers that were later found to be incorrect or not supported by evidence. A study evaluating GPT-4’s responses on medical exam questions found various error types – some answers were just wrong despite sounding reasonable to medical professionals. More concretely, research on LLMs citing medical references showed that models often fail to back up their claims. In one evaluation, even the best model (GPT-4 with tools to retrieve sources) had nearly 30% of its statements not supported by any source it provided. That means the model asserted facts or reasoning steps in its answer that weren’t actually verified by the literature it cited. This is a subtle form of underthinking: the model looks up relevant info (good), starts giving an answer with citations (good), but then makes a claim that the sources don’t support (it has jumped to a conclusion without evidence). Essentially, the model didn’t fully think through whether its intermediate reasoning step was justified. It switched from “summarizing sources” to “injecting its own assumption” – a kind of thought switch that led to a partially incorrect answer.
In a real-world scenario, consider the case of ChatGPT being used by doctors or patients. There was a well-publicized story of a young boy’s illness being correctly diagnosed by ChatGPT after many human doctors were stumped – highlighting the potential of these models. But for each success, there could be many instances where the model’s underthought reasoning misled someone. If an AI health chatbot gives a patient a likely diagnosis with a certain medication, but it arrived at that by discarding a more thorough analysis of the patient’s history, it might overlook a dangerous contraindication or a rarer but critical condition. Medical experts warn that these AIs can “hallucinate” convincing but wrong answers, and without proper verification, that’s essentially an under-thought response presented as fact.
Thus, underthinking in medical AI can lead to incomplete diagnostic reasoning and unsupported medical advice. The model might enumerate a few possibilities but not really work through any (leaving the user with uncertainty or an incorrect final guess). Or it might give a single answer that sounds detailed, yet hides the fact that the reasoning process had gaps. This underscores why, in medicine, AI outputs must be treated carefully – they should ideally be checked by a human professional or by an automated system that verifies each step, to catch those leaps in logic.
Autonomous Decision-Making Agents
Another arena to consider is autonomous agents powered by LLMs – for example, systems like AutoGPT, BabyAGI, or other “agentic” AI that attempt to plan and execute tasks in the real world (or a simulated environment) by breaking them into steps. These agents use LLMs to reason about goals, make plans, and adjust their approach based on feedback. Underthinking in this context can result in the agent looping or thrashing without making progress.
A vivid example came from early experiments with AutoGPT (an open-source autonomous agent using GPT-4). Users observed that AutoGPT would often get stuck in a loop of planning without actually completing the task. One analysis described that AutoGPT “creates elaborate plans that are completely unnecessary” and even on a simple query (like retrieving a car’s turning radius), it “mostly takes it into a loop of trying to figure out what its goal is... by googling”. In other words, instead of executing the straightforward steps to get the answer, the agent kept revising its strategy, reinterpreting the goal, and essentially chasing its tail. This is a real-world manifestation of underthinking: the agent never fully commits to any one plan. It keeps thinking about the problem (and thinking about its thinking) without doing the problem. The result is zero productive output – a failure to accomplish the task due to incessant switching of focus.
Why does this happen? In autonomous decision-making, an LLM is often the brain generating possible actions. If that brain underthinks, it might, for example, start a plan, then second-guess and start a new plan, then another, without executing any plan long enough to yield results. We see this when an AI assistant starts to do something, then says “Actually, let me try a different approach,” and so on, until it runs out of time or hits some limit. The feedback loop inherent in these agents (they critique their own actions and replan) can actually exacerbate underthinking if not properly tuned. The agent might interpret lack of immediate success as “my approach must be wrong, let’s try a new approach” every single time, instead of refining the current approach.
Consider an autonomous drone controlled by an LLM-based planner. If it’s trying to navigate an obstacle course and it underthinks, it might constantly switch strategies for pathfinding: go left, then midway decide to go right, then stop and try a completely different route – ultimately getting stuck or lost, whereas a well-reasoned single plan would have succeeded.
The consequences of such behavior range from inefficiency (wasting a lot of API calls, computation, or time) to critical failures (the agent not achieving an objective, or doing something unintended because it lost the thread of the plan). In safety-critical systems, this is obviously a big concern. You wouldn’t want an autonomous car’s AI to frequently change its mind about whether to brake or swerve in an emergency.
In summary, underthinking in autonomous agents leads to erratic, looped, or incomplete task execution. The agent appears busy (lots of “thought” output) but isn’t effectively moving toward the goal. This has been observed directly in systems like AutoGPT, where the agent’s verbose planning and re-planning ended up as a hindrance rather than a help. It highlights the need for mechanisms to keep an AI agent focused and to know when to carry an idea to completion versus when to truly change course.
Mitigating Underthinking: Strategies and Improvements
Although this section focuses a lot of the model's training / architecture aspects, these are still early days to come up accurate and reproducible prompt strategies (think Chain of Thought approaches) to solve the specific problem at hand. More practical user insights that I have captured from Reddit and other forums to follow this post.
Underthinking is a challenge, but researchers and engineers are developing strategies to combat it. Broadly, solutions fall into two categories: improving the model’s decoding process (how it generates answers) and improving the model’s training/architecture (how it learns to reason). Here I have outlined several practical approaches to mitigate underthinking:
Each of these strategies has its own trade-offs. Some (like TIP and Laconic Decoding) are easy to apply and don’t require changing the model, but they rely on correlations that hold in many cases, not guarantees – there might be instances where a longer answer is actually correct or where sticking to a single approach too long is harmful. Others, like step verification or training changes, require more engineering effort and computational cost, but promise a more fundamental fix. In practice, a combination of methods can be used. For example, one could use TIP during generation and then apply Laconic filtering on multiple outputs, while also having a verifier double-check the final reasoning. Indeed, the research community is actively experimenting with such combinations to yield LLMs that think both broadly and deeply – exploring different ideas when needed, but fully working through the promising ones.
Sources
Architecting great teams so they can architect great systems
2 周Keep writing, Setu, I love it!