Alibaba Qwen QwQ-32B: Scaled Reinforcement Learning Showcase
Alibaba has unveiled Qwen QwQ-32B, a 32-billion-parameter AI model that demonstrates performance rivaling the much larger DeepSeek-R1. This breakthrough highlights the potential of scaling Reinforcement Learning (RL) on robust foundation models.
Advancing Reinforcement Learning in AI
The Qwen team has successfully integrated agent capabilities into the reasoning model, enabling it to think critically, utilize tools, and adapt based on environmental feedback.
“Scaling RL has the potential to enhance model performance beyond conventional pretraining and post-training methods,” the team stated. “Recent studies have demonstrated that RL can significantly improve the reasoning capabilities of models.”
QwQ-32B achieves comparable performance to DeepSeek-R1, which boasts 671 billion parameters (with 37 billion activated). This demonstrates RL’s effectiveness in bridging the gap between model size and performance.
Benchmark Performance
QwQ-32B was evaluated across key benchmarks assessing mathematical reasoning, coding proficiency, and problem-solving skills. The results highlight its competitiveness against other leading models, including DeepSeek-R1-Distilled-Qwen-32B, DeepSeek-R1-Distilled-Llama-70B, and o1-mini.
Benchmark Scores:
The RL Training Approach
The Qwen team implemented a multi-stage RL process using outcome-based rewards. The first stage focused on RL scaling for math and coding tasks, leveraging accuracy verifiers and code execution servers. The second stage expanded RL to general capabilities, integrating general reward models and rule-based verifiers.
“We find that this stage of RL training with a small number of steps can enhance general capabilities such as instruction following, alignment with human preference, and agent performance without significant performance drops in math and coding,” the team explained.
Open Access and Future Directions
QwQ-32B is open-weight and available on Hugging Face and ModelScope under the Apache 2.0 license. It can also be accessed via Qwen Chat.
The Qwen team views this as a foundational step in scaling RL for advanced reasoning. Their future work will explore deeper integration of RL with agent-based learning for long-horizon reasoning.
“As we work towards developing the next generation of Qwen, we believe that combining stronger foundation models with RL powered by scaled computational resources will bring us closer to achieving Artificial General Intelligence (AGI),” the team stated.