START: Self-taught Reasoner with Tools
Credit: https://arxiv.org/pdf/2503.04625

START: Self-taught Reasoner with Tools

Today's paper introduces START (Self-taught Reasoner with Tools), a novel approach that enhances large language models' reasoning capabilities by integrating external tools, particularly Python code execution, into the reasoning process. The method addresses the limitations of existing large reasoning models (LRMs) which often suffer from hallucinations and computational inaccuracies by enabling models to perform complex computations, self-check their work, explore diverse methods, and self-debug through code execution.

Method Overview

START introduces a self-learning framework that teaches LLMs to utilize external tools (specifically Python interpreters) during their reasoning process. The approach consists of two key techniques: Hint-infer and Hint Rejection Sampling Fine-Tuning (Hint-RFT).

Hint-infer works by inserting artificially designed hints (such as "Wait, maybe using Python here is a good idea") at strategic points during a model's reasoning process. These hints are inserted either after specific conjunction words (like "Alternatively" or "Wait") that typically indicate the model is reconsidering its approach, or before the model's stop token. When the model encounters these hints, it's prompted to use external tools like Python code execution to verify calculations or test solutions.

Hint-RFT builds upon Hint-infer by creating a self-training framework. First, the authors create a "Hint-Library" containing various types of hints tailored to different reasoning scenarios (mathematical reasoning, code debugging, etc.). They then apply Hint-infer to generate reasoning trajectories that include tool invocation. These trajectories are scored, filtered, and modified to create a seed dataset (Dseed). The base model (QwQ-32B) is fine-tuned on this dataset to create START-0, which can use tools without explicit hints. Finally, START-0 is used to generate more diverse training data through rejection sampling, resulting in the final START model after another round of fine-tuning.

The training process involves two main phases: first applying Hint-RFT to teach the model to use tools when appropriate, then using RFT (Rejection sampling Fine-Tuning) to enhance the diversity and quality of the model's tool usage patterns. This approach allows the model to learn when and how to use external tools during complex reasoning tasks without requiring human-annotated examples of tool use.

Results

START demonstrates significant improvements across various challenging reasoning benchmarks compared to its base model (QwQ) and achieves performance comparable to state-of-the-art models:

  • On PhD-level science QA (GPQA), START achieves 63.6% accuracy, a 5.5% absolute improvement over QwQ.
  • On mathematical benchmarks, START shows substantial gains: 3.8% improvement on MATH500 (94.4%), 15.0% on AMC23 (95.0%), 16.7% on AIME24 (66.7%), and 7.1% on AIME25 (47.1%).
  • On the programming benchmark LiveCodeBench, START achieves 47.3%, a 5.9% improvement over QwQ.

The paper also demonstrates that these improvements come primarily from the tool integration capability rather than just additional training data. When comparing START to a version of QwQ fine-tuned on the same data but without tool integration (QwQ-RFT), START consistently outperforms across all benchmarks.

Conclusion

START represents combines long Chain-of-thought reasoning with tool integration. By teaching models to leverage external tools like Python interpreters, START addresses key limitations of existing reasoning models, particularly in tasks requiring complex computations or code execution. The self-learning framework (Hint-RFT) provides an effective way to train models to use tools without requiring human demonstrations. For more information please consult the full paper.

Congrats to the authors for their work!

Li, Chengpeng, et al. "START: Self-taught Reasoner with Tools." arXiv preprint arXiv:2503.04625 (2025).

要查看或添加评论,请登录

Vlad Bogolin的更多文章