登录查看更多内容

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2025年3月2日

Today's paper introduces DeltaBench, a comprehensive benchmark for evaluating the ability of Large Language Models (LLMs) to detect errors in long Chain-of-Thought (CoT) reasoning. The authors analyze the quality of long CoTs generated by o1-like models and assess the critique abilities of existing LLMs, providing valuable insights into the limitations of current models and potential areas for improvement.

Method Overview

DeltaBench is constructed through a multi-stage process that begins with collecting diverse queries from various open-source datasets across domains including mathematics, programming, physics, chemistry, biology, and general reasoning. The queries undergo clustering, deduplication, difficulty filtering, and subcategory sampling to ensure diversity and balance.

The benchmark then generates long CoT solutions using several o1-like models such as QwQ-32B-Preview, DeepSeek-R1, and Gemini 2.0 Flash Thinking. Rather than evaluating these solutions at the step level, which can be overly granular, DeltaBench divides each long CoT into multiple sections representing independent sub-tasks. This approach aligns better with human cognitive patterns and facilitates more meaningful annotation.

Each section is then annotated with tags for strategy shift (whether it introduces a new approach), reasoning usefulness (whether it contributes to solving the problem), reasoning correctness (whether it contains errors), and reflection efficiency (whether it includes effective self-reflection). The annotation process uses human annotators (Master's and Ph.D. graduates from various disciplines) to ensure high-quality evaluations.

For evaluation, DeltaBench uses recall, precision, and macro-F1 scores to measure how well Process Reward Models (PRMs) and critic models can identify erroneous sections. PRMs use an outlier detection technique based on Z-Score to make predictions, while critic models are prompted to identify all erroneous sections within a long CoT.

Results

The key findings are:

Fundamental errors (calculation errors, syntax errors, format errors) are common in existing o1-like models, accounting for approximately 25% of errors in QwQ-32B-Preview and 23% in Gemini 2.0 Flash Thinking.
The proportion of effective reflection is very low, with approximately 67.8% of reflections in the collected long CoT responses being useless.
Long CoT reasoning processes contain significant redundancy, with an average of 27% of reasoning sections being unnecessary.

When evaluating the critique abilities of LLMs and PRMs, the paper found:

The ability to identify errors in long CoT reasoning is limited even for top-performing models. GPT-4-turbo-128k achieved the highest F1-score of only 40.8%.
O1-like models showed no advantage over non-o1-like models in critique abilities, with o1-preview performing worse than GPT-4o-mini.
Models demonstrated weaker self-critique abilities compared to their critique abilities on other models' outputs. For example, DeepSeek-R1 showed a 36% reduction in self-critique performance.
Critic models' performance degraded significantly with longer contexts, while PRMs maintained more consistent evaluation capability across varying lengths.
Models generally performed better at identifying calculation errors but struggled with strategy errors, indicating limited generalization across error types.

Conclusion

DeltaBench provides a comprehensive framework for evaluating both the quality of long CoT reasoning and the critique abilities of LLMs. The findings highlight significant limitations in current models, including the prevalence of fundamental errors, low reflection efficiency, and redundant reasoning processes. For more information please consult the full paper.

Congrats to the authors for their work!

He, Yancheng, et al. "Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?" arXiv preprint arXiv:2502.19361 (2025).

AI Paper of the Day

1,303 位关注者

要查看或添加评论，请登录

Vlad Bogolin的更多文章

START: Self-taught Reasoner with Tools

2025年3月8日

START: Self-taught Reasoner with Tools

Today's paper introduces START (Self-taught Reasoner with Tools), a novel approach that enhances large language models'…
Token-Efficient Long Video Understanding for Multimodal LLMs

2025年3月7日

Token-Efficient Long Video Understanding for Multimodal LLMs

Today's paper introduces STORM (Spatiotemporal Token Reduction for Multimodal LLMs), a novel architecture for efficient…
Predictive Data Selection: The Data That Predicts Is the Data That Teaches

2025年3月6日

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Today's paper introduces PRESELECT, a novel approach for selecting high-quality data for language model pretraining…
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

2025年3月5日

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Today's paper introduces MultiAgentBench, a comprehensive benchmark designed to evaluate Large Language Model (LLM)…
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

2025年3月4日

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Today's paper introduces Phi-4-Mini and Phi-4-Multimodal, two compact yet powerful language models. Phi-4-Mini is a 3.
How far can we go with ImageNet for Text-to-Image generation?

2025年3月3日

How far can we go with ImageNet for Text-to-Image generation?

Today's paper challenges the prevailing "bigger is better" paradigm in text-to-image generation by demonstrating that…
Language Models' Factuality Depends on the Language of Inquiry

2025年3月1日

Language Models' Factuality Depends on the Language of Inquiry

Today's paper investigates an interesting limitation in multilingual language models (LMs): their inconsistency in…
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

2025年2月28日

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

Today's paper introduces REFUTE, a novel benchmark for evaluating language models' ability to falsify incorrect…

1 条评论
OpenAI GPT-4.5 System Card

2025年2月27日

OpenAI GPT-4.5 System Card

Today's paper introduces OpenAI GPT-4.5, the company's largest and most knowledgeable model to date.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

2025年2月26日

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Today's paper introduces SWE-RL, an approach that uses reinforcement learning to enhance large language models'…

See all articles

Method Overview

Results

Conclusion

AI Paper of the Day

1,303 位关注者

Vlad Bogolin的更多文章

START: Self-taught Reasoner with Tools

Token-Efficient Long Video Understanding for Multimodal LLMs

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

How far can we go with ImageNet for Text-to-Image generation?

Language Models' Factuality Depends on the Language of Inquiry

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

OpenAI GPT-4.5 System Card

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution