Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Today's paper introduces DeltaBench, a comprehensive benchmark for evaluating the ability of Large Language Models (LLMs) to detect errors in long Chain-of-Thought (CoT) reasoning. The authors analyze the quality of long CoTs generated by o1-like models and assess the critique abilities of existing LLMs, providing valuable insights into the limitations of current models and potential areas for improvement.
Method Overview
DeltaBench is constructed through a multi-stage process that begins with collecting diverse queries from various open-source datasets across domains including mathematics, programming, physics, chemistry, biology, and general reasoning. The queries undergo clustering, deduplication, difficulty filtering, and subcategory sampling to ensure diversity and balance.
The benchmark then generates long CoT solutions using several o1-like models such as QwQ-32B-Preview, DeepSeek-R1, and Gemini 2.0 Flash Thinking. Rather than evaluating these solutions at the step level, which can be overly granular, DeltaBench divides each long CoT into multiple sections representing independent sub-tasks. This approach aligns better with human cognitive patterns and facilitates more meaningful annotation.
Each section is then annotated with tags for strategy shift (whether it introduces a new approach), reasoning usefulness (whether it contributes to solving the problem), reasoning correctness (whether it contains errors), and reflection efficiency (whether it includes effective self-reflection). The annotation process uses human annotators (Master's and Ph.D. graduates from various disciplines) to ensure high-quality evaluations.
For evaluation, DeltaBench uses recall, precision, and macro-F1 scores to measure how well Process Reward Models (PRMs) and critic models can identify erroneous sections. PRMs use an outlier detection technique based on Z-Score to make predictions, while critic models are prompted to identify all erroneous sections within a long CoT.
Results
The key findings are:
When evaluating the critique abilities of LLMs and PRMs, the paper found:
Conclusion
DeltaBench provides a comprehensive framework for evaluating both the quality of long CoT reasoning and the critique abilities of LLMs. The findings highlight significant limitations in current models, including the prevalence of fundamental errors, low reflection efficiency, and redundant reasoning processes. For more information please consult the full paper.
Congrats to the authors for their work!
He, Yancheng, et al. "Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?" arXiv preprint arXiv:2502.19361 (2025).