登录查看更多内容

?? QwQ-32B: 20x smaller than DeepSeek-R1

Pascal Biese

Daily AI highlights for 70k+ experts ???? AI/ML Engineer

发布日期: 2025年3月7日

+ 关注

In this issue:

China just did it again: a new open source powerhouse
The art of post-training reasoning models
A new kind of generative model

Accelerate your AI projects with Prolific. Claim $50 free credits and get quality human data in minutes from 200,000+ taskers.

No setup cost, no subscription, no delay—get started, top up your account to claim your free credit, and test Prolific for yourself now.

Use code: LLM-WATCH-50

$50 Free Credits

1. QwQ-32B: Embracing the Power of Reinforcement Learning

Watching: QwQ-32B (blog/model)

What problem does it solve? Traditional language model training often plateaus with conventional pretraining and post-training methods, limiting models' reasoning capabilities. The Qwen team's research explores how to effectively scale Reinforcement Learning (RL) to enhance large language model intelligence beyond these limitations. This is particularly challenging because applying RL at scale has been primarily the domain of large, proprietary models with massive parameter counts. The research tackles the fundamental question: can a relatively smaller, open model leverage RL techniques effectively enough to compete with much larger models?

How does it solve the problem? The Qwen team implemented a multi-stage RL scaling approach driven by outcome-based rewards rather than traditional reward models. Starting with a cold-start checkpoint, they first scaled RL specifically for math and coding tasks, using an accuracy verifier for math problems and a code execution server to assess code correctness against test cases. This allowed the model to receive direct feedback based on actual outcomes rather than proxy reward models. After this initial stage showed continuous improvement in performance, they added a second stage of RL training for general capabilities, using a combination of general reward models and rule-based verifiers, enhancing instruction following and alignment with human preferences without sacrificing the specialized capabilities gained in the first stage.

What are the key findings? QwQ-32B, with just 32 billion parameters, achieves performance comparable to DeepSeek-R1, which has 671 billion parameters (with 37 billion activated). This represents a remarkable 20x reduction in active parameter count while maintaining similar capabilities. The research demonstrates that RL can significantly enhance a model's reasoning abilities when applied to foundation models pretrained on extensive world knowledge. Additionally, they successfully integrated agent-related capabilities into the reasoning model, enabling critical thinking while utilizing tools and adapting reasoning based on environmental feedback.

Why does it matter? It challenges the assumption that achieving cutting-edge AI capabilities requires massive model scaling. By demonstrating that a relatively smaller open-weight model can perform comparably to much larger models through targeted RL techniques, QwQ-32B opens the path for more efficient and accessible AI development. This approach could democratize access to high-performing AI systems by reducing the enormous computational resources typically required. Furthermore, the successful integration of agent capabilities suggests a path toward more general intelligence that can reason adaptively and learn from its environment, potentially accelerating progress toward artificial general intelligence while making such technology more widely available under open licenses.

2. LLM Post-Training: A Deep Dive into Reasoning Large Language Models

Watching: LLM Post-Training (paper/code)

What problem does it solve? Large Language Models (LLMs) have transformed the natural language processing landscape, but despite impressive pretraining capabilities, they still suffer from critical shortcomings like hallucinations, logical inconsistencies, and misalignment with human values. While pretraining on vast web-scale data establishes broad linguistic foundations, it's insufficient for ensuring robust reasoning, factual accuracy, and ethical alignment. This comprehensive survey systematically explores post-training methodologies—techniques applied after initial pretraining—and how they can refine LLMs' capabilities, addressing the research gap between model pretraining and deployment readiness.

How does it solve the problem? The authors provide a systematic taxonomy of post-training approaches organized into three complementary categories: fine-tuning, reinforcement learning (RL), and test-time scaling (TTS). For fine-tuning, they analyze various adaptation strategies from full-model tuning to parameter-efficient methods like LoRA. The RL section examines reward modeling techniques and policy optimization algorithms—from conventional approaches like PPO to newer methods such as GRPO (Group Relative Policy Optimization) and DPO (Direct Preference Optimization). For test-time scaling, they categorize techniques like Chain-of-Thought prompting, Tree-of-Thoughts, Monte Carlo Tree Search, and self-consistency methods that enhance reasoning during inference without model updates. Throughout, the authors provide practical benchmarks, evaluation metrics, and identify emerging research directions while highlighting the synergies between these complementary approaches.

What are the key findings? The survey reveals that combining multiple post-training approaches yields optimal results, with process-based rewards generally outperforming outcome-based rewards for complex reasoning tasks. Remarkably, smaller models with effective test-time compute allocation can sometimes outperform much larger models (up to 14× bigger) on intermediate difficulty tasks while reducing inference costs by 4×. Novel RL methods like GRPO can simplify training by eliminating separate value functions while maintaining performance.

Why does it matter? These findings are important because they provide a unified framework for improving LLMs beyond pretraining, offering researchers and practitioners a systematic approach to navigate computational trade-offs. The insights about compute-optimal scaling suggest that we don't always need bigger models—sometimes smarter inference strategies deliver better results more efficiently.

3. Fractal Generative Models

Watching: Fractal Generative Models (paper/code)

What problem does it solve? Generative AI models have become incredibly powerful, but they still face challenges when dealing with very high-dimensional data like pixel-by-pixel image generation. Traditional approaches either require tokenizers that compress images (losing information) or become computationally prohibitive when working directly with pixels. This paper introduces a novel framework called "fractal generative models" that tackles the fundamental question: how can we build more efficient generative models by abstracting existing ones into modular components that can be recursively called?

How does it solve the problem? Taking inspiration from fractal patterns in nature, the researchers developed a divide-and-conquer approach where generative models recursively call themselves to create self-similar architectures across different levels. Think of it like Russian nesting dolls, but for AI! They instantiated this framework using autoregressive models (the kind that predict one token at a time) as their atomic building blocks. For image generation, they first model relationships between 16×16 patches, then within each patch, they model 4×4 sub-patches, continuing down to individual pixels. This hierarchical approach dramatically reduces computational complexity - modeling a 256×256 image requires only twice the computation of a 64×64 image.

What are the key findings? Their "FractalMAR" model successfully generates high-quality 256×256 images pixel-by-pixel with competitive metrics compared to traditional approaches. It achieves an FID score of 6.15 and an Inception Score of 348.9, with generation taking just 1.29 seconds per image. On likelihood estimation tasks (measuring how well the model captures the true data distribution), it achieved 3.14 bits per dimension, significantly outperforming previous autoregressive models. The method also shows promising results for conditional image editing tasks like inpainting, outpainting, and class-conditional editing.

Why does it matter? This fractal approach represents a fundamental shift in how we can design generative models. Rather than scaling up monolithic architectures, it offers a way to build increasingly complex models through recursive composition of smaller, more manageable components. This is particularly important for modeling data with intrinsic hierarchical structures, which extends far beyond images to domains like molecular configurations, protein structures, and biological neural networks. The paper demonstrates that this approach can be both more computationally efficient and more effective than traditional methods, potentially opening new avenues for generative AI in fields where high-dimensional structured data is common.

Papers of the Week:

Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners
ATLaS: Agent Tuning via Learning Critical Steps
MPO: Boosting LLM Agents with Meta Plan Optimization
ReSo: A Reward-driven Self-organizing LLM-based Multi-Agent System for Reasoning Tasks
Nature-Inspired Population-Based Evolution of Large Language Models
Structured Outputs Enable General-Purpose LLMs to be Medical Experts

?? If you enjoyed this article, give it a like and share it with your peers.

LLM Watch

53,805 位关注者

Ashish Patel ????

1 周

Pascal Biese, your insights into the advancements surrounding QwQ-32B are compelling. Can you elaborate on the specific methodologies adopted in the two-stage RL approach that distinguish it from traditional models? Furthermore, considering the skewed perceptions around parameters in AI models, what are the implications of your findings for future model evaluations in terms of effectiveness versus size? Lastly, how do you foresee the impact of this parameter efficiency trend on industry standards for developing AI solutions?

Jérémie Nunez

Full-Stack & AI Dev: Next.js/Svelte, NestJS/Python. ML (PyTorch/LSTM/), Data (NumPy/Pandas), LLM (Prompt Eng/DSPy/RAG/KGs). Builds AI solutions, RAG+KG systems & data tools. Content writing in ML research.

1 周

Hello Pascal I cannot find where are the hardware requiremnent to run it locally thanks

2 次回应

Jon Keskitalo

Transformation at Scale

2 周

Well, we have a few days to short Nvidia this time

1 次回应

Qazi Abulaala

Artificial Intelligence??Machine Learning ??Deep Learning ??Neural Networks NLP??Computer Vision??Data Science PredictiveModeling??Python?? Tensorflow ?? PyTorch

2 周

It's remarkable ??

2 次回应

查看更多评论

要查看或添加评论，请登录

Pascal Biese的更多文章

?? Quantum-Enhanced AI - It's Here

2025年3月21日

?? Quantum-Enhanced AI - It's Here

In this issue: Chinese researchers introduce quantum-enhanced fine-tuning Enabling open-source reinforcement learning…

3 条评论
?? Search-R1, Gemini Embeddings & Controlled Reasoning with L1

2025年3月14日

?? Search-R1, Gemini Embeddings & Controlled Reasoning with L1

In this issue: Emergent search behavior in LLMs Stopping reasoning models from “overthinking” The best embeddings - for…

1 条评论
OpenAI Can Not Be Happy About This

2025年2月28日

OpenAI Can Not Be Happy About This

In this issue: OpenAI releases first “vibe” model Microsoft bets on data quality and efficiency When old benchmarks…
?????? One Giant Leap for AI Optimization

2025年2月21日

?????? One Giant Leap for AI Optimization

In this issue: Sakana’s AI CUDA Engineer Inner Thinking Transformers Better Code Generation for any model Accelerate…
LLM Watch#74: DeepSeek-R1 Was Only The Beginning

2025年2月14日

LLM Watch#74: DeepSeek-R1 Was Only The Beginning

In this issue: 1B model > 405B model AI winning Olympic Gold Generating world models on the fly For those of you that…

5 条评论
?? Massive Progress in Reasoning Models

2025年2月7日

?? Massive Progress in Reasoning Models

In this issue: Beating OpenAI with Open-Source 99% performance with only 1% data Chain-of-Associated-Thoughts (CoAT)…

2 条评论
??? Automatic Prompt Engineering 2.0

2025年1月31日

??? Automatic Prompt Engineering 2.0

Foreword: hi everyone, I hope you had a great week! Before we dive into this newsletter and its (hopefully) exciting…

5 条评论
?? This AI Makes Big Tech Panic

2025年1月24日

?? This AI Makes Big Tech Panic

In this issue: Re-defining what’s possible in AI DeepMind going even deeper Self-training agents are coming 1…

11 条评论
?? Google Releases Transformer 2.0

2025年1月17日

?? Google Releases Transformer 2.0

In this issue: From Transformers to Titans Smaller, weaker, yet better O1-preview-level results for $450 Interested in…

9 条评论
???? AI Cutting Research Costs by 84%

2025年1月10日

???? AI Cutting Research Costs by 84%

In this issue: AI helping researchers to be more efficient LLMs being unreliable when reasoning about time Evaluating…

3 条评论

See all articles

In this issue:

1. QwQ-32B: Embracing the Power of Reinforcement Learning

2. LLM Post-Training: A Deep Dive into Reasoning Large Language Models

3. Fractal Generative Models

Papers of the Week:

?? If you enjoyed this article, give it a like and share it with your peers.

LLM Watch

53,805 位关注者

Pascal Biese的更多文章

?? Quantum-Enhanced AI - It's Here

?? Search-R1, Gemini Embeddings & Controlled Reasoning with L1

OpenAI Can Not Be Happy About This

?????? One Giant Leap for AI Optimization

LLM Watch#74: DeepSeek-R1 Was Only The Beginning

?? Massive Progress in Reasoning Models

??? Automatic Prompt Engineering 2.0

?? This AI Makes Big Tech Panic

?? Google Releases Transformer 2.0

???? AI Cutting Research Costs by 84%

社区洞察