Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Credit: https://arxiv.org/pdf/2502.19361

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Today's paper introduces DeltaBench, a comprehensive benchmark for evaluating the ability of Large Language Models (LLMs) to detect errors in long Chain-of-Thought (CoT) reasoning. The authors analyze the quality of long CoTs generated by o1-like models and assess the critique abilities of existing LLMs, providing valuable insights into the limitations of current models and potential areas for improvement.

Method Overview

DeltaBench is constructed through a multi-stage process that begins with collecting diverse queries from various open-source datasets across domains including mathematics, programming, physics, chemistry, biology, and general reasoning. The queries undergo clustering, deduplication, difficulty filtering, and subcategory sampling to ensure diversity and balance.

The benchmark then generates long CoT solutions using several o1-like models such as QwQ-32B-Preview, DeepSeek-R1, and Gemini 2.0 Flash Thinking. Rather than evaluating these solutions at the step level, which can be overly granular, DeltaBench divides each long CoT into multiple sections representing independent sub-tasks. This approach aligns better with human cognitive patterns and facilitates more meaningful annotation.

Each section is then annotated with tags for strategy shift (whether it introduces a new approach), reasoning usefulness (whether it contributes to solving the problem), reasoning correctness (whether it contains errors), and reflection efficiency (whether it includes effective self-reflection). The annotation process uses human annotators (Master's and Ph.D. graduates from various disciplines) to ensure high-quality evaluations.

For evaluation, DeltaBench uses recall, precision, and macro-F1 scores to measure how well Process Reward Models (PRMs) and critic models can identify erroneous sections. PRMs use an outlier detection technique based on Z-Score to make predictions, while critic models are prompted to identify all erroneous sections within a long CoT.

Results

The key findings are:

  1. Fundamental errors (calculation errors, syntax errors, format errors) are common in existing o1-like models, accounting for approximately 25% of errors in QwQ-32B-Preview and 23% in Gemini 2.0 Flash Thinking.
  2. The proportion of effective reflection is very low, with approximately 67.8% of reflections in the collected long CoT responses being useless.
  3. Long CoT reasoning processes contain significant redundancy, with an average of 27% of reasoning sections being unnecessary.

When evaluating the critique abilities of LLMs and PRMs, the paper found:

  1. The ability to identify errors in long CoT reasoning is limited even for top-performing models. GPT-4-turbo-128k achieved the highest F1-score of only 40.8%.
  2. O1-like models showed no advantage over non-o1-like models in critique abilities, with o1-preview performing worse than GPT-4o-mini.
  3. Models demonstrated weaker self-critique abilities compared to their critique abilities on other models' outputs. For example, DeepSeek-R1 showed a 36% reduction in self-critique performance.
  4. Critic models' performance degraded significantly with longer contexts, while PRMs maintained more consistent evaluation capability across varying lengths.
  5. Models generally performed better at identifying calculation errors but struggled with strategy errors, indicating limited generalization across error types.

Conclusion

DeltaBench provides a comprehensive framework for evaluating both the quality of long CoT reasoning and the critique abilities of LLMs. The findings highlight significant limitations in current models, including the prevalence of fundamental errors, low reflection efficiency, and redundant reasoning processes. For more information please consult the full paper.

Congrats to the authors for their work!

He, Yancheng, et al. "Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?" arXiv preprint arXiv:2502.19361 (2025).

要查看或添加评论,请登录

Vlad Bogolin的更多文章