Under-thinking in LLMs: Understanding the Phenomenon and Its Implications

Setu Chokshi

AI GBB @ Microsoft | Machine Learning / Artificial Intelligence

发布日期: 2025年2月5日

Large Language Models (LLMs) have shown impressive reasoning abilities by generating step-by-step “chains of thought.” However, recent research has identified a counterintuitive limitation in some of the most advanced reasoning models, including DeepSeek-R1 (a 671B-parameter RL-trained model), OpenAI’s emerging o3 (and related GPT-4 class models), and Qwen (Alibaba’s open LLM). The issue is termed “underthinking” – where an LLM fails to think deeply enough along a single line of reasoning despite producing a lot of reasoning text. This article explores the theoretical underpinnings of underthinking in LLMs, how different architectures and training methods contribute to it, and (almost) real-world case studies where underthinking leads to mistakes.

TL;DR

When deploying advanced LLMs for complex tasks, be aware that more reasoning doesn’t always mean better reasoning. Underthinking can lurk beneath a verbose response. To ensure reliability, incorporate safeguards: use enhanced decoding methods, consider secondary checks (like verifier models or consistency checks), and favor model designs that enforce focused reasoning. By doing so, you can tap into the phenomenal capabilities of models like GPT-4, Qwen, and DeepSeek without falling victim to their “thoughts being all over the place.” In short, encourage your AI to not just think out loud, but to think things through.

Underthinking in LLMs = Shallow Thinking Despite Lengthy Output: Advanced LLMs like DeepSeek, GPT-4 and Qwen can appear to think a lot (producing long reasoning chains) but still miss the answer because they keep switching strategies. They abandon good ideas too early, leading to inadequate depth of reasoning.
A Symptom of Cutting-Edge “Reasoners”: Underthinking has been observed in models specifically optimized for multi-step reasoning (OpenAI’s o1/o3-series, GPT-4, DeepSeek-R1, etc.). These models use techniques like chain-of-thought and reinforcement learning to tackle hard problems, but without careful control, those same techniques let the model wander. Conventional LLMs that don’t generate long chains don’t show this behavior.
Real-World Impact – From Buggy Code to Risky Advice: Underthinking isn’t just academic. In coding assistants, it leads to incomplete or convoluted code where the AI started one approach and half-implemented another.
Mitigation Strategies Are Emerging: We can make LLMs “think” more effectively through a mix of decoding strategies and training improvements. For example, adding a Thought Switching Penalty (TIP) during generation discourages the model from jumping between ideas too quickly.

Market Insights on Competing Reasoning Models: Recent evaluations [2][3][4] have shown that while DeepSeek R1 and Google’s Gemini 2.0 Flash Thinking exhibit impressive performance and cost efficiency, OpenAI’s o1-pro continues to excel in both accuracy and reasoning coherence across a wide range of tasks. In rigorous multi-step puzzles and complex code-generation challenges, o1-pro consistently outperforms its rivals that demonstrates a superior ability to maintain focused, deep reasoning. Although accessing o1-pro comes at a premium, its performance sets a high benchmark that drives market competition. This robust performance has made o1-pro the gold standard for high-stakes applications.

Comparison across various problems for DeepSeek R1; Gemini Flash Thinking; and OpenAI o1-Pro. Source [2].

I think there’s no question that o1-pro is the best reasoning model on the market right now. Gemini 2.0 Flash Thinking is probably a better model than R1, but that seems more debatable. R1 and Gemini 2.0 Flash Thinking have different reasoning styles, and as a result one model will be a better choice for some problems and the other will be better for others.

I have been collecting a bunch of Reddit posts and I will soon post a summary of my findings with the links to the posts.

Defining “Underthinking” in Large Language Models

Underthinking is a recently identified phenomenon in advanced reasoning LLMs (sometimes called “o1-like” models in research). In simple terms, underthinking happens when a model prematurely abandons promising lines of thought and jumps to new approaches too frequently. Instead of fully working through one strategy, the model keeps switching tactics – for example, starting to solve a math problem one way, then saying “Alternatively, let’s try this…” and switching to a different method, and so on. This frequent reset of reasoning means the model never digs deep enough into any single approach to reach the correct answer. The result is shallow reasoning in each attempt, even though the overall response may be long-winded.

Researchers observed underthinking by studying top-tier reasoning models like OpenAI’s “o1” (a precursor to GPT-4/GPT-5-class capabilities) and its open-source replicas (e.g. Qwen, DeepSeek, Kimi). These models are designed for intensive chain-of-thought reasoning – they can generate very long thought processes, mimicking human-like deliberation to solve hard problems. Paradoxically, the question “Are they thinking deeply enough?” arose when analyses showed that on hard tasks (like challenging math or logic problems), these models often switch thoughts too quickly and fail to fully explore a promising idea. In other words, the model’s “thoughts are all over the place,” hopping from one line of reasoning to another without reaching a conclusion in each.

Several patterns characterize underthinking:

It occurs more frequently on harder problems (when a solution isn’t obvious, the model flails between approaches).
It involves frequent switching between different thoughts (“Let’s try X… Alternatively, Y might work… Maybe Z…”), without fully working out any single approach.
It strongly correlates with incorrect answers – the more a model underthinks (switches thoughts), the more likely it is to get the answer wrong

Notably, researchers quantified underthinking with a metric called UT (Underthinking) score, which essentially measures how inefficiently a model used its tokens in an incorrect solution. They found that in wrong answers, over 70% of responses contained at least one valid reasoning thread that was dropped too early. The wrong answers also used 225% more tokens and changed strategies 418% more often than correct answers. In contrast, when these same models got the answer right, they tended to stick to one line of reasoning and solve it more succinctly. Figure 1 of the study illustrates this stark difference: on average, these “deep reasoning” models consumed far more tokens on incorrect solutions due to excessive thought switching.

It’s important to distinguish underthinking from its opposite, often called overthinking. Overthinking in LLMs refers to wasting computation on unnecessary steps – for example, taking many steps to answer a trivial question like 2+3=?. Underthinking, on the other hand, is about insufficient depth on necessary steps – the model should think longer in one direction but doesn’t. In practice, an advanced model might even exhibit both: it could overthink easy tasks (using overly elaborate reasoning with minimal benefit) and underthink hard tasks (jumping between partial solutions without finishing any). Both are forms of reasoning inefficiency, but underthinking is particularly pernicious for complex problem-solving because the model essentially gives up on good ideas prematurely.

Why Does Underthinking Occur? (Architecture and Training Factors)

Underthinking seems to be a side-effect of the very features that give modern LLMs their reasoning power. Models like DeepSeek-R1, OpenAI’s o3, GPT-4, and Qwen are designed with architectures or inference strategies that allow iterative, long chain-of-thought reasoning. For example, some use tree-of-thought or scratchpad techniques, others use self-reflection loops, and DeepSeek in particular was trained via reinforcement learning explicitly to enhance reasoning depth. These designs enable the model to explore multiple solution strategies – but without proper guidance, the model may thrash between strategies instead of pursuing one to completion. Think of it as giving the model a lot of “mental agility” but not enough “patience” to stick with one idea.

A key factor is how these models are trained and decoded at inference time:

Extended Chain-of-Thought (CoT): Advanced LLMs are often encouraged (by prompting or architecture) to produce step-by-step reasoning. This is great for complex tasks, but it also means the model has the option to start over with a new approach mid-solution. If the model isn’t explicitly trained to recognize “I was on the right path, keep going,” it might interpret uncertainty as a cue to try something entirely different. The research noted many wrong solutions contained an interjection like “Alternatively,...” where the model abruptly pivots
Reinforcement Learning (RL) for Reasoning: DeepSeek-R1 is an example of using RL post-training to push a model to solve harder questions by any means necessary. This can lead to creative reasoning, but also may encourage the model to keep generating steps (to find a reward-winning solution) even if those steps are disjointed. If the reward only checks the final answer, the model is not penalized for switching strategies along the way – so long as eventually one of the attempts gets it right. This can reinforce underthinking behavior: the model learns to “try a bunch of approaches” rather than systematically deepen one approach.
Human Feedback and Fine-Tuning: Models like GPT-4 undergo extensive fine-tuning with human feedback (RLHF). Interestingly, RLHF often optimizes for answers that sound coherent and confident to humans. It might indirectly mitigate some underthinking – e.g., a human evaluator might prefer a concise correct reasoning over a rambling, incoherent one. GPT-4 is known to usually provide well-structured answers, which could mean it internally prunes away some wild goose chases. However, RLHF doesn’t guarantee true reasoning depth; it can also encourage the model to write plausible-sounding answers even if the reasoning was flawed (a phenomenon known as illusory or unfaithful reasoning). So while GPT-4 might appear to underthink less often, it can still produce convincing but incorrect answers if it subtly abandoned a correct reasoning path for a superficially appealing narrative.
Architecture (Monolithic vs. Modular): Most mainstream LLMs (GPT-4, Qwen, etc.) are monolithic Transformer models – they generate one token at a time, whether it’s part of reasoning or the final answer. Newer “reasoning-centric” architectures introduce modules or explicit deliberation steps (for example, a separate verifier or a scratchpad memory). If these components aren’t well-orchestrated, the model may not know how to use its extra thinking time effectively. Underthinking could emerge if, say, a planner module keeps suggesting new plans before the solver module fully works out the previous plan. Essentially, greater complexity in architecture can solve some problems and create new failure modes if not tuned properly.

One revealing observation from the underthinking study is how different models behaved on hard questions versus easier ones. The researchers compared “o1-like” models (those with long chain-of-thought capabilities) to more conventional LLMs on the same tasks. They found that for models like QwQ-32B (a reasoning-intensive model) and DeepSeek-R1-671B, the incorrect solutions were much longer on average than the correct ones, filled with many shifts in strategy. Meanwhile, a more traditional model like Qwen-Math-72B or Llama3.3-70B (which don’t heavily engage in multi-step reasoning by themselves) showed no significant length difference between their correct and incorrect answers. In other words, a conventional model either solved the problem in a straightforward way or failed quickly – it didn’t waste time thrashing about. The advanced models, by contrast, would produce very long wrong answers. This suggests that the very ability to think in many steps (a strength) became a weakness when not managed properly. The architecture/training gave them a bigger “search space” for solutions, but without enough discipline they wandered in that space.

Underthinking scores (UT) of different models in logic tasks. The UT score measures the frequency of strategy changes during the reasoning process. Source [2]

To illustrate, consider GPT-4 vs. a smaller code model. If asked a tricky coding question, GPT-4 might start explaining one approach, then reconsider and outline a different approach, and so on – ending with a lot of text but maybe not a runnable solution. A smaller code model might just try one approach to the best of its ability and stop. GPT-4’s rich reasoning training means it has the capacity to try multiple angles (which is why it often succeeds where others fail), yet if it’s going to fail, it might do so in an “underthinking” fashion – an elaborate attempt that ultimately missed the mark. In fact, even with factual questions, GPT-4 can give very detailed answers that sound logical but include unsupported claims if it didn’t stick to verified facts. A Stanford study on medical QA found that even GPT-4 (with retrieval) had about **30% of its statements unsupported by the sources it provided, and nearly half of its answers contained at least one unsupported claim. This indicates that GPT-4 sometimes doesn’t thoroughly check whether each part of its reasoning is correct, which is analogous to underthinking (jumping to a plausible statement without fully verifying the line of reasoning).

In summary, underthinking is most pronounced in LLMs explicitly trained or engineered for multi-step reasoning – DeepSeek and Qwen being prime examples from the open-source world, and GPT-4 (and upcoming models like “o3”) in the proprietary realm. Different training methodologies contribute to the phenomenon: chain-of-thought prompting enables it, RL-based reward schedules can inadvertently encourage it, and insufficient step-level supervision fails to rein it in. Understanding this helps us contextualize why underthinking arises and sets the stage for how to address it.

Real-World Case Studies: Underthinking in Action

Underthinking isn’t just a theoretical quirk; it has real implications when LLMs are deployed in applications. Below I explore a few domains – coding assistants, medical AI, and autonomous decision-making – where underthinking-like behavior has led to suboptimal reasoning, incorrect outputs, or other failures.

领英推荐

How To Use Prompt Engineering With Large Language…

Arcitech 1 年前

The World of LLM and its Importance

Wizards 1 年前

Advanced Prompting Techniques in Large Language Models

Sanjay Kumar MBA,MS,PhD 5 个月前

Coding Assistants and Code Generation

Developers using AI coding assistants (like GitHub Copilot or GPT-4’s code mode) may have encountered the model writing a lot of code that almost works but ultimately fails. Often, the AI will start implementing one idea and then midway decide to tweak the approach, resulting in code that’s inconsistent or incomplete. This is a form of underthinking: the model didn’t fully think through the initial solution before switching. For example, it might begin with a dynamic programming approach to a problem, then abruptly shift to a greedy method – leaving behind remnants of the first approach in the code. The final code might have unused variables or half-implemented logic from the abandoned path.

Studies of LLM-generated code errors back this up. Common issues include logical errors and missing pieces in the code. One guide on using LLMs for code generation notes that “LLMs often misinterpret the logical requirements of a task, leading to incorrect or nonsensical code behavior,” and sometimes important sections of code are simply left out. These “incomplete code” errors suggest the model started writing a solution but didn’t follow through on every part – akin to not fully exploring the code path it started. In practice, this might manifest as the AI producing a function that handles one case (or one part of the input) and neglects other cases, because it jumped to considering a different angle too soon.

For instance, a coding assistant asked to implement a complex algorithm might produce a verbose explanation and a chunk of code. If it’s underthinking, the explanation could enumerate multiple strategies (“We could do X, or possibly Y...”), and the code might include fragments of multiple approaches merged incorrectly. The developer then finds that the code doesn’t run or fails tests because the logic is internally inconsistent.

Another telltale sign is when the AI provides an overly convoluted solution for something that has a straightforward answer – it’s as if the model lost track of the straightforward path by continuously exploring side routes. This can waste a lot of a programmer’s time, as they sift through a meandering AI-generated solution that should have been simple.

In summary, underthinking in coding assistants leads to bloated, half-baked code. The model writes a lot, but the depth of reasoning in any given segment is shallow. Developers have observed AI solutions with extra steps that aren’t needed, or solutions that try two methods at once and succeed at neither. Recognizing this pattern can help users prompt the model better (e.g., “Stick to one approach”) or know when to double-check critical sections of AI-written code.

Medical AI and Diagnostic Reasoning

Medical applications of LLMs range from answering health questions to assisting in diagnosis. These are high-stakes tasks where reasoning needs to be both deep and correct. Underthinking here can lead to plausible-sounding but incorrect or unverified medical advice, which is dangerous.

Imagine an LLM-driven medical assistant analyzing a patient’s symptoms. An underthinking failure might look like this: the model begins to consider one diagnosis (say, it starts explaining why it could be lupus), then halfway it switches: “Alternatively, these symptoms might point to Lyme disease.” It then gives a conclusion based on the second path but never fully reconciled or completed the reasoning for the first possibility. If the first line of reasoning was actually the correct diagnosis, the model abandoned it prematurely. The patient (or doctor) reading the output gets a fragmented consultation – maybe a list of possible conditions without a well-argued conclusion for any. This could erode trust or worse, lead to the wrong treatment if the model’s final answer is taken at face value.

There have been instances where LLMs gave confident medical answers that were later found to be incorrect or not supported by evidence. A study evaluating GPT-4’s responses on medical exam questions found various error types – some answers were just wrong despite sounding reasonable to medical professionals. More concretely, research on LLMs citing medical references showed that models often fail to back up their claims. In one evaluation, even the best model (GPT-4 with tools to retrieve sources) had nearly 30% of its statements not supported by any source it provided. That means the model asserted facts or reasoning steps in its answer that weren’t actually verified by the literature it cited. This is a subtle form of underthinking: the model looks up relevant info (good), starts giving an answer with citations (good), but then makes a claim that the sources don’t support (it has jumped to a conclusion without evidence). Essentially, the model didn’t fully think through whether its intermediate reasoning step was justified. It switched from “summarizing sources” to “injecting its own assumption” – a kind of thought switch that led to a partially incorrect answer.

In a real-world scenario, consider the case of ChatGPT being used by doctors or patients. There was a well-publicized story of a young boy’s illness being correctly diagnosed by ChatGPT after many human doctors were stumped – highlighting the potential of these models. But for each success, there could be many instances where the model’s underthought reasoning misled someone. If an AI health chatbot gives a patient a likely diagnosis with a certain medication, but it arrived at that by discarding a more thorough analysis of the patient’s history, it might overlook a dangerous contraindication or a rarer but critical condition. Medical experts warn that these AIs can “hallucinate” convincing but wrong answers, and without proper verification, that’s essentially an under-thought response presented as fact.

Thus, underthinking in medical AI can lead to incomplete diagnostic reasoning and unsupported medical advice. The model might enumerate a few possibilities but not really work through any (leaving the user with uncertainty or an incorrect final guess). Or it might give a single answer that sounds detailed, yet hides the fact that the reasoning process had gaps. This underscores why, in medicine, AI outputs must be treated carefully – they should ideally be checked by a human professional or by an automated system that verifies each step, to catch those leaps in logic.

Autonomous Decision-Making Agents

Another arena to consider is autonomous agents powered by LLMs – for example, systems like AutoGPT, BabyAGI, or other “agentic” AI that attempt to plan and execute tasks in the real world (or a simulated environment) by breaking them into steps. These agents use LLMs to reason about goals, make plans, and adjust their approach based on feedback. Underthinking in this context can result in the agent looping or thrashing without making progress.

A vivid example came from early experiments with AutoGPT (an open-source autonomous agent using GPT-4). Users observed that AutoGPT would often get stuck in a loop of planning without actually completing the task. One analysis described that AutoGPT “creates elaborate plans that are completely unnecessary” and even on a simple query (like retrieving a car’s turning radius), it “mostly takes it into a loop of trying to figure out what its goal is... by googling”. In other words, instead of executing the straightforward steps to get the answer, the agent kept revising its strategy, reinterpreting the goal, and essentially chasing its tail. This is a real-world manifestation of underthinking: the agent never fully commits to any one plan. It keeps thinking about the problem (and thinking about its thinking) without doing the problem. The result is zero productive output – a failure to accomplish the task due to incessant switching of focus.

Why does this happen? In autonomous decision-making, an LLM is often the brain generating possible actions. If that brain underthinks, it might, for example, start a plan, then second-guess and start a new plan, then another, without executing any plan long enough to yield results. We see this when an AI assistant starts to do something, then says “Actually, let me try a different approach,” and so on, until it runs out of time or hits some limit. The feedback loop inherent in these agents (they critique their own actions and replan) can actually exacerbate underthinking if not properly tuned. The agent might interpret lack of immediate success as “my approach must be wrong, let’s try a new approach” every single time, instead of refining the current approach.

Consider an autonomous drone controlled by an LLM-based planner. If it’s trying to navigate an obstacle course and it underthinks, it might constantly switch strategies for pathfinding: go left, then midway decide to go right, then stop and try a completely different route – ultimately getting stuck or lost, whereas a well-reasoned single plan would have succeeded.

The consequences of such behavior range from inefficiency (wasting a lot of API calls, computation, or time) to critical failures (the agent not achieving an objective, or doing something unintended because it lost the thread of the plan). In safety-critical systems, this is obviously a big concern. You wouldn’t want an autonomous car’s AI to frequently change its mind about whether to brake or swerve in an emergency.

In summary, underthinking in autonomous agents leads to erratic, looped, or incomplete task execution. The agent appears busy (lots of “thought” output) but isn’t effectively moving toward the goal. This has been observed directly in systems like AutoGPT, where the agent’s verbose planning and re-planning ended up as a hindrance rather than a help. It highlights the need for mechanisms to keep an AI agent focused and to know when to carry an idea to completion versus when to truly change course.

Mitigating Underthinking: Strategies and Improvements

Although this section focuses a lot of the model's training / architecture aspects, these are still early days to come up accurate and reproducible prompt strategies (think Chain of Thought approaches) to solve the specific problem at hand. More practical user insights that I have captured from Reddit and other forums to follow this post.

Underthinking is a challenge, but researchers and engineers are developing strategies to combat it. Broadly, solutions fall into two categories: improving the model’s decoding process (how it generates answers) and improving the model’s training/architecture (how it learns to reason). Here I have outlined several practical approaches to mitigate underthinking:

Thought Switching Penalty (TIP) – Decoding-Time Fix: One immediate solution from the research is to modify how the model generates its answer by penalizing it for switching thoughts too often. The Thought Switching Penalty (abbreviated as “TIP”) is a decoding strategy that watches for telltale words or phrases that signal a new approach (e.g., “Alternatively,” “Another way,” “On the other hand,” etc.)
Laconic Decoding – Pick the Shortest Coherent Answer: An insightful and surprisingly effective trick, Laconic Decoding leverages the observation that underthinking tends to produce long, rambling wrong answers, whereas correct answers (from a well-reasoned chain) are often more concise
Self-Consistency (Ensemble-of-Thought) – Majority Voting on Solutions: A related method to Laconic decoding is the self-consistency technique. Here, the idea is to sample multiple reasoning paths from the model (using slightly different randomness each time) and then see which final answer most frequently occurs among those attempts. That answer is likely to be correct, under the assumption that if many independent “trains of thought” from the model converge on the same result, it’s probably not a fluke. Self-consistency has been shown to improve reasoning accuracy by filtering out lone, incorrect reasoning paths that an underthinking model might produce. While this doesn’t directly stop the model from underthinking on a single run, it gives a way to avoid unreliable outputs after the fact. For example, if out of 5 runs, 4 runs give answer “42” (each maybe with different reasoning text lengths) and 1 run gives answer “17” with a very long tangled reasoning, the system would output 42 as the final answer (assuming that was the majority). In practice, this approach significantly boosts performance on math and logic tasks by averaging out the reasoning flaws of any single run. It’s computationally more expensive (you need to run the model multiple times), but pairs well with techniques like Laconic decoding. In fact, some advanced pipelines combine diverse sampling + voting + even verifying steps (as described next) to get the best of all worlds
Step-by-Step Verification (Step-Aware Verifiers) – Training/Architecture Fix: Another powerful strategy is to have the model (or a companion model) verify each step of its reasoning as it goes. Instead of relying on the final answer alone, the idea is to train a verifier that looks at the chain-of-thought and checks if each inference is valid. If the model knows its intermediate step is wrong, it can correct it or avoid following that path further – preventing the kind of flailing underthinking where it jumps to a new idea after making an error. One implementation of this is the DiVeRSe technique (Diverse Verifier on Reasoning Step), which adds a step-aware verifier into the reasoning process
Training-Time Incentives for Deep Reasoning – Policy and Data Improvements: In the long run, we can aim to reduce underthinking by how we train these models. One approach is to explicitly reward consistency and depth in the training process. For example, with reinforcement learning, rather than only giving a reward for the final correct answer, we could give partial rewards for following through on a correct reasoning path without unnecessary resets. If the model says “Alternatively” too many times in its chain-of-thought during training and that correlates with a lower reward, it will learn to avoid gratuitous switches. Another angle is to fine-tune on datasets of good reasoning traces – i.e., show the model many examples where a problem is solved with a single, coherent line of reasoning, and penalize examples where the reasoning jumps around. Some researchers are also exploring architectural changes like a planning module that commits to a plan (sequence of steps) before execution, thereby reducing mid-course switching. If the model must lay out a plan first, it might be less prone to continually revising the plan on the fly. Additionally, prompt engineering can help: instructing the model with something like “Work on one approach at a time and only consider alternatives if you reach a clear impasse” could nudge its behavior in the right direction.
Laconic Prompting and Response Regularization – Enforcing brevity and focus: Inspired by the Laconic Decoding idea, one can also prompt the model in ways that encourage concise reasoning. For instance, one could set a guideline like “explain your reasoning briefly and avoid redundant steps.” This can sometimes prevent the model from going into an exhaustive (and exhausting) exploration of every possible route. However, this is a bit of a double-edged sword: you don’t want to under-shoot and have the model not show its reasoning at all. A safer approach is after generating a solution, have the model or another checker go through the reasoning and trim any segments that are unrelated to the conclusion. This post-processing can tighten the chain-of-thought, making it effectively more laconic and to the point, which as we know correlates with correctness in these scenarios.

Each of these strategies has its own trade-offs. Some (like TIP and Laconic Decoding) are easy to apply and don’t require changing the model, but they rely on correlations that hold in many cases, not guarantees – there might be instances where a longer answer is actually correct or where sticking to a single approach too long is harmful. Others, like step verification or training changes, require more engineering effort and computational cost, but promise a more fundamental fix. In practice, a combination of methods can be used. For example, one could use TIP during generation and then apply Laconic filtering on multiple outputs, while also having a verifier double-check the final reasoning. Indeed, the research community is actively experimenting with such combinations to yield LLMs that think both broadly and deeply – exploring different ideas when needed, but fully working through the promising ones.

Sources

Central Resource:Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Hands-on Review of DeepSeek R1: I spent two days testing DeepSeek R1 – by Timothy B. Lee https://www.understandingai.org/p/i-spent-two-days-testing-deepseek
Other DeepSeek R1 Experiments: More experiments with R1 – by Russ Abbott https://russabbott.substack.com/p/more-experiments-with-r1
Other DeepSeek R1 Experiments: DeepSeek R1 gets confused playing Reverse Tic-Tac-Toe – by Russ Abbott https://russabbott.substack.com/p/deepseek-r1-gets-confused-playing
Using LLMs for Code Generation: A Guide to Improving Accuracy and Addressing Common Issues – by Dan Cleary https://www.prompthub.us/blog/using-llms-for-code-generation-a-guide-to-improving-accuracy-and-addressing-common-issues
Easy Problems That LLMs Get Wrong

The Learning Curve

762 位关注者

Alexander Nikitin

Architecting great teams so they can architect great systems

2 周

Keep writing, Setu, I love it!

1 次回应

查看更多评论

要查看或添加评论，请登录

Setu Chokshi的更多文章

Mastering Document Field Matching: A complete (?) guide

2025年2月11日

Mastering Document Field Matching: A complete (?) guide

Modern AI models like GPT-4 Vision and Qwen-2-VL can analyze images and generate structured JSON outputs directly…

Under-thinking in LLMs: Understanding the Phenomenon and Its Implications

Setu Chokshi

AI GBB @ Microsoft | Machine Learning / Artificial Intelligence

TL;DR

Defining “Underthinking” in Large Language Models

Why Does Underthinking Occur? (Architecture and Training Factors)

Real-World Case Studies: Underthinking in Action

领英推荐

Mitigating Underthinking: Strategies and Improvements

Sources

The Learning Curve

762 位关注者

Setu Chokshi的更多文章

社区洞察

其他会员也浏览了

?????? LLMs Opening Their Inner Eyes

Watch#8: Extreme Teachers and Mixing Tokens, not Experts

Prompt Engineering: Unlocking the Power of Large Language Models

Gen-AI may be massively hyped, but the potential is huge: Here are ten big technological shifts creating the disruptive opportunity of GPT-4

Impossible Distillation: How to Make High-quality Lemonade out of Small, Low-quality Model.

Crafting Intelligence: The Art of Tailoring Large Language Models for Precision and Relevance

The Alchemy of Language: Distilling High-Quality Models from Small Language Models (SLMs)

GPT-3 writes like a writer, programs like a programmer, and can be ... dangerous

Extending RAG

TL;DR

Defining “Underthinking” in Large Language Models

Why Does Underthinking Occur? (Architecture and Training Factors)

Real-World Case Studies: Underthinking in Action

领英推荐

Mitigating Underthinking: Strategies and Improvements

Sources

The Learning Curve

762 位关注者

Setu Chokshi的更多文章

Mastering Document Field Matching: A complete (?) guide

社区洞察

其他会员也浏览了

?????? LLMs Opening Their Inner Eyes

Watch#8: Extreme Teachers and Mixing Tokens, not Experts

Prompt Engineering: Unlocking the Power of Large Language Models

Gen-AI may be massively hyped, but the potential is huge: Here are ten big technological shifts creating the disruptive opportunity of GPT-4

Impossible Distillation: How to Make High-quality Lemonade out of Small, Low-quality Model.

Crafting Intelligence: The Art of Tailoring Large Language Models for Precision and Relevance

The Alchemy of Language: Distilling High-Quality Models from Small Language Models (SLMs)

GPT-3 writes like a writer, programs like a programmer, and can be ... dangerous

Extending RAG