登录查看更多内容

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2025年2月26日

Today's paper introduces SWE-RL, an approach that uses reinforcement learning to enhance large language models' reasoning capabilities for software engineering tasks. The method leverages software evolution data (like GitHub pull requests) and rule-based rewards to train LLMs to solve real-world software issues. SWE-RL enables open-source models to achieve competitive performance on software engineering benchmarks.

Method Overview

SWE-RL is a reinforcement learning framework that trains LLMs to solve software engineering tasks using real-world software evolution data. The process begins with curating a comprehensive dataset of GitHub pull requests (PRs), which includes issue descriptions, code contexts, and the corresponding patches that fixed those issues. This data serves as the foundation for the reinforcement learning process.

SWE-RL trains a policy LLM to generate code changes through reasoning. For each issue in the training data, the model is presented with the issue description and relevant code context. It then attempts to solve the issue by generating a patch. The quality of this patch is evaluated using a simple rule-based reward function: if the format is incorrect, the model receives a negative reward; otherwise, it receives a reward based on the similarity between the predicted patch and the oracle (ground truth) patch.

What makes SWE-RL unique is its approach to providing context. The model is given the complete content of each file in the input prompt, which implicitly teaches it to reason about precise fault locations before suggesting repair edits. This forces the model to develop both bug diagnosis and repair generation capabilities. The training uses Group Relative Policy Optimization (GRPO), where multiple rollouts are generated for each problem, and the policy is updated based on the normalized rewards within each group.

Importantly, SWE-RL only requires the model to generate repair edits during training, yet the resulting model can generalize to other related tasks like file-level fault localization and test generation. This emergent capability demonstrates how reinforcement learning can help models develop broader reasoning skills beyond the specific training objective.

Results

The paper's main result is Llama3-SWE-RL-70B, a model trained with SWE-RL on top of Llama-3.3-70B-Instruct. This model achieves a 41.0% solve rate on SWE-bench Verified, a human-verified collection of real-world GitHub issues. This performance represents the best result among medium-sized language models (<100B parameters) and is comparable to leading proprietary models like GPT-4o.

When compared to a supervised fine-tuning (SFT) baseline trained on the same data, Llama3-SWE-RL-70B demonstrates superior performance not only on SWE-bench but also on out-of-domain tasks. The model shows improved results on five different categories: function coding, library use, code reasoning, mathematics, and general language understanding. This indicates that SWE-RL helps the model develop generalized reasoning skills that transfer across domains, whereas the SFT approach tends to overfit to specific task distributions.

The paper also demonstrates that using a continuous reward function (based on sequence similarity) outperforms a discrete reward function (exact match only) in SWE-RL. The continuous reward better captures partial correctness and allows for more nuanced learning of repair strategies.

Conclusion

SWE-RL leverages reinforcement learning on software evolution data and enables models to develop strong reasoning capabilities for solving real-world software issues without relying on proprietary models. The resulting Llama3-SWE-RL-70B model achieves state-of-the-art performance among medium-sized models on SWE-bench. For more information please consult the full paper.

Congrats to the authors for their work!

Wei, Yuxiang, et al. "SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution." arXiv preprint arXiv:2502.18449 (2025).

AI Paper of the Day

1,303 位关注者

要查看或添加评论，请登录

Vlad Bogolin的更多文章

START: Self-taught Reasoner with Tools

2025年3月8日

START: Self-taught Reasoner with Tools

Today's paper introduces START (Self-taught Reasoner with Tools), a novel approach that enhances large language models'…
Token-Efficient Long Video Understanding for Multimodal LLMs

2025年3月7日

Token-Efficient Long Video Understanding for Multimodal LLMs

Today's paper introduces STORM (Spatiotemporal Token Reduction for Multimodal LLMs), a novel architecture for efficient…
Predictive Data Selection: The Data That Predicts Is the Data That Teaches

2025年3月6日

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Today's paper introduces PRESELECT, a novel approach for selecting high-quality data for language model pretraining…
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

2025年3月5日

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Today's paper introduces MultiAgentBench, a comprehensive benchmark designed to evaluate Large Language Model (LLM)…
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

2025年3月4日

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Today's paper introduces Phi-4-Mini and Phi-4-Multimodal, two compact yet powerful language models. Phi-4-Mini is a 3.
How far can we go with ImageNet for Text-to-Image generation?

2025年3月3日

How far can we go with ImageNet for Text-to-Image generation?

Today's paper challenges the prevailing "bigger is better" paradigm in text-to-image generation by demonstrating that…
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

2025年3月2日

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Today's paper introduces DeltaBench, a comprehensive benchmark for evaluating the ability of Large Language Models…
Language Models' Factuality Depends on the Language of Inquiry

2025年3月1日

Language Models' Factuality Depends on the Language of Inquiry

Today's paper investigates an interesting limitation in multilingual language models (LMs): their inconsistency in…
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

2025年2月28日

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

Today's paper introduces REFUTE, a novel benchmark for evaluating language models' ability to falsify incorrect…

1 条评论
OpenAI GPT-4.5 System Card

2025年2月27日

OpenAI GPT-4.5 System Card

Today's paper introduces OpenAI GPT-4.5, the company's largest and most knowledgeable model to date.

See all articles

Method Overview

Results

Conclusion

AI Paper of the Day

1,303 位关注者

Vlad Bogolin的更多文章

START: Self-taught Reasoner with Tools

Token-Efficient Long Video Understanding for Multimodal LLMs

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

How far can we go with ImageNet for Text-to-Image generation?

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Language Models' Factuality Depends on the Language of Inquiry

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

OpenAI GPT-4.5 System Card