登录查看更多内容

START: Self-taught Reasoner with Tools

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2025年3月8日

Today's paper introduces START (Self-taught Reasoner with Tools), a novel approach that enhances large language models' reasoning capabilities by integrating external tools, particularly Python code execution, into the reasoning process. The method addresses the limitations of existing large reasoning models (LRMs) which often suffer from hallucinations and computational inaccuracies by enabling models to perform complex computations, self-check their work, explore diverse methods, and self-debug through code execution.

Method Overview

START introduces a self-learning framework that teaches LLMs to utilize external tools (specifically Python interpreters) during their reasoning process. The approach consists of two key techniques: Hint-infer and Hint Rejection Sampling Fine-Tuning (Hint-RFT).

Hint-infer works by inserting artificially designed hints (such as "Wait, maybe using Python here is a good idea") at strategic points during a model's reasoning process. These hints are inserted either after specific conjunction words (like "Alternatively" or "Wait") that typically indicate the model is reconsidering its approach, or before the model's stop token. When the model encounters these hints, it's prompted to use external tools like Python code execution to verify calculations or test solutions.

Hint-RFT builds upon Hint-infer by creating a self-training framework. First, the authors create a "Hint-Library" containing various types of hints tailored to different reasoning scenarios (mathematical reasoning, code debugging, etc.). They then apply Hint-infer to generate reasoning trajectories that include tool invocation. These trajectories are scored, filtered, and modified to create a seed dataset (Dseed). The base model (QwQ-32B) is fine-tuned on this dataset to create START-0, which can use tools without explicit hints. Finally, START-0 is used to generate more diverse training data through rejection sampling, resulting in the final START model after another round of fine-tuning.

The training process involves two main phases: first applying Hint-RFT to teach the model to use tools when appropriate, then using RFT (Rejection sampling Fine-Tuning) to enhance the diversity and quality of the model's tool usage patterns. This approach allows the model to learn when and how to use external tools during complex reasoning tasks without requiring human-annotated examples of tool use.

Results

START demonstrates significant improvements across various challenging reasoning benchmarks compared to its base model (QwQ) and achieves performance comparable to state-of-the-art models:

On PhD-level science QA (GPQA), START achieves 63.6% accuracy, a 5.5% absolute improvement over QwQ.
On mathematical benchmarks, START shows substantial gains: 3.8% improvement on MATH500 (94.4%), 15.0% on AMC23 (95.0%), 16.7% on AIME24 (66.7%), and 7.1% on AIME25 (47.1%).
On the programming benchmark LiveCodeBench, START achieves 47.3%, a 5.9% improvement over QwQ.

The paper also demonstrates that these improvements come primarily from the tool integration capability rather than just additional training data. When comparing START to a version of QwQ fine-tuned on the same data but without tool integration (QwQ-RFT), START consistently outperforms across all benchmarks.

Conclusion

START represents combines long Chain-of-thought reasoning with tool integration. By teaching models to leverage external tools like Python interpreters, START addresses key limitations of existing reasoning models, particularly in tasks requiring complex computations or code execution. The self-learning framework (Hint-RFT) provides an effective way to train models to use tools without requiring human demonstrations. For more information please consult the full paper.

Congrats to the authors for their work!

Li, Chengpeng, et al. "START: Self-taught Reasoner with Tools." arXiv preprint arXiv:2503.04625 (2025).

AI Paper of the Day

1,304 位关注者

要查看或添加评论，请登录

Vlad Bogolin的更多文章

Token-Efficient Long Video Understanding for Multimodal LLMs

2025年3月7日

Token-Efficient Long Video Understanding for Multimodal LLMs

Today's paper introduces STORM (Spatiotemporal Token Reduction for Multimodal LLMs), a novel architecture for efficient…
Predictive Data Selection: The Data That Predicts Is the Data That Teaches

2025年3月6日

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Today's paper introduces PRESELECT, a novel approach for selecting high-quality data for language model pretraining…
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

2025年3月5日

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Today's paper introduces MultiAgentBench, a comprehensive benchmark designed to evaluate Large Language Model (LLM)…
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

2025年3月4日

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Today's paper introduces Phi-4-Mini and Phi-4-Multimodal, two compact yet powerful language models. Phi-4-Mini is a 3.
How far can we go with ImageNet for Text-to-Image generation?

2025年3月3日

How far can we go with ImageNet for Text-to-Image generation?

Today's paper challenges the prevailing "bigger is better" paradigm in text-to-image generation by demonstrating that…
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

2025年3月2日

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Today's paper introduces DeltaBench, a comprehensive benchmark for evaluating the ability of Large Language Models…
Language Models' Factuality Depends on the Language of Inquiry

2025年3月1日

Language Models' Factuality Depends on the Language of Inquiry

Today's paper investigates an interesting limitation in multilingual language models (LMs): their inconsistency in…
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

2025年2月28日

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

Today's paper introduces REFUTE, a novel benchmark for evaluating language models' ability to falsify incorrect…

1 条评论
OpenAI GPT-4.5 System Card

2025年2月27日

OpenAI GPT-4.5 System Card

Today's paper introduces OpenAI GPT-4.5, the company's largest and most knowledgeable model to date.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

2025年2月26日

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Today's paper introduces SWE-RL, an approach that uses reinforcement learning to enhance large language models'…

See all articles

Method Overview

Results

Conclusion

AI Paper of the Day

1,304 位关注者

Vlad Bogolin的更多文章

Token-Efficient Long Video Understanding for Multimodal LLMs

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

How far can we go with ImageNet for Text-to-Image generation?

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Language Models' Factuality Depends on the Language of Inquiry

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

OpenAI GPT-4.5 System Card

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution