Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Today's paper presents the a comprehensive survey on efficient reasoning for Large Language Models (LLMs). It addresses the "overthinking phenomenon" where reasoning models generate unnecessarily verbose outputs, leading to computational inefficiency. The paper systematically categorizes existing approaches to efficient reasoning and explores methods to optimize reasoning length while maintaining accuracy.
Overview
The paper organizes efficient reasoning approaches into three main categories: model-based, reasoning output-based, and input prompts-based methods.
Model-based efficient reasoning focuses on fine-tuning LLMs to improve their intrinsic ability to reason concisely. This category includes two main approaches. First, Reinforcement Learning (RL) with length reward design, where models are trained using rewards that favor shorter, correct answers while penalizing lengthy or incorrect ones. Various length reward formulations are explored, such as cosine rewards, length-harmonizing rewards, and exceed length penalties. Second, Supervised Fine-Tuning (SFT) with variable-length Chain-of-Thought (CoT) data, which involves constructing datasets with varying reasoning lengths and fine-tuning models on these datasets. Methods for collecting short CoT data include post-reasoning compression (reducing redundant steps after full-length reasoning) and obtaining compressed data during reasoning.
Reasoning output-based efficient reasoning modifies the output paradigm to enhance reasoning efficiency. One approach compresses reasoning steps into fewer latent representations, treating the final-layer hidden states of an LLM as "continuous thought" to replace traditional discrete tokens. This can be achieved by training LLMs to leverage latent representations or using auxiliary models. Another approach implements dynamic reasoning paradigms during inference, using criteria such as rewards, confidence/certainty, or consistency to guide the reasoning strategy. For example, Speculative Rejection generates multiple responses until memory limits are reached, then discards low-quality outputs based on evaluation by a reward model.
Input prompts-based efficient reasoning focuses on enforcing length constraints or routing LLMs based on input prompt characteristics. Prompt-guided efficient reasoning explicitly instructs LLMs to generate fewer reasoning steps through prompts like "Let's think step by step and use less than X tokens" or "Be concise." Prompts attribute-driven reasoning routing dynamically determines how language models handle queries based on their complexity, routing simpler queries to faster but less reasoning-capable LLMs while directing more complicated queries to stronger reasoning LLMs.
Key findings
The survey reveals that efficient reasoning approaches can significantly reduce computational costs while maintaining reasoning accuracy. For example, RL-based methods with length rewards can mitigate overthinking in reasoning-capable LLMs, achieving nearly lossless alignment with original reasoning capabilities while reducing token usage. SFT with variable-length CoT data enables LLMs to learn compact reasoning chains that encapsulate effective knowledge. Latent reasoning approaches improve both accuracy and efficiency by reducing the number of intermediate "thinking" tokens.
The paper also highlights that smaller language models can retain strong reasoning capabilities through appropriate distillation and compression techniques. Quantization preserves reasoning performance remarkably well, while pruning tends to lead to severe degradation in reasoning quality. This suggests that compression-based approaches are more effective than training small language models from scratch.
In practical applications, efficient reasoning LLMs offer significant benefits across various domains, including healthcare diagnostics, autonomous driving, embodied AI systems, and financial algorithmic trading, by enabling quicker and more resource-efficient decision-making.
Conclusion
This comprehensive survey provides the first structured overview of efficient reasoning in LLMs, categorizing existing approaches and discussing their strengths and limitations. By addressing the overthinking phenomenon, efficient reasoning methods offer practical benefits such as reduced computational costs and improved responsiveness for real-world applications. For more information please consult the full paper.
Congrats to the authors for their work!
Sui, Yang, et al. "Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models." arXiv preprint arXiv:2503.16419 (2025).