登录查看更多内容

?? Quantum-Enhanced AI - It's Here

Pascal Biese

Daily AI highlights for 70k+ experts ???? AI/ML Engineer

发布日期: 2025年3月21日

+ 关注

In this issue:

Chinese researchers introduce quantum-enhanced fine-tuning
Enabling open-source reinforcement learning at scale
Better collaboration between AI agents and humans

My e-mail newsletter features additional exclusive content.

Get all updates directly to your e-mail

1. Quantum-Enhanced LLM Efficient Fine Tuning

Watching: Quantum-Enhanced LLMs (paper)

What problem does it solve? Large Language Models (LLMs) have brought significant advancements to natural language processing, but their massive size creates deployment challenges. Parameter-Efficient Fine-Tuning (PEFT) techniques like Low-Rank Adaptation (LoRA) address this by enabling efficient adaptation through trainable low-rank matrices. However, these methods face an inherent "expressive bottleneck" - their low-rank representation capacity is constrained when handling complex tasks or high-rank dependencies, potentially limiting model adaptability and performance in sophisticated applications.

How does it solve the problem? Quantum Weighted Tensor Hybrid Networks (QWTHNs) cleverly combine quantum neural networks (QNN) with tensor networks based on Matrix Product Operators (MPO). This hybrid approach decomposes pre-trained weights into quantum neural network and tensor network representations, leveraging quantum superposition to break through classical rank limitations. QWTHN performs element-wise multiplication between QNN outputs and classical neural network outputs, enhancing expressive power in low-rank spaces while maintaining a lightweight parameter footprint. What's particularly noteworthy is that this is the first implementation of quantum computing inference on real quantum machines for LLM fine-tuning.

What are the key findings? The experiments demonstrated QWTHN's remarkable efficiency - reducing trainable parameters by 76% compared to LoRA while simultaneously cutting training loss by up to 15% on datasets like CPsyCounD and R1-Distill-SFT. The approach achieved consistent improvements across text generation quality metrics and showed an 8.4% performance boost on test datasets.

Why does it matter? These findings validate quantum computing's potential to address computational bottlenecks in LLM development, establishing an engineering-ready foundation for quantum-enhanced AI systems. With LLM sizes and computational demands growing exponentially, approaches like QWTHN could make advanced language models more accessible while improving their capabilities - potentially changing how we approach the scaling challenges facing modern AI.

2. DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Watching: DAPO (paper/code)

What problem does it solve? While inference-time scaling techniques like those used in OpenAI's o1 and DeepSeek's R1 have revolutionized LLMs' reasoning abilities, the key reinforcement learning (RL) methods powering these advances remain largely undisclosed in technical reports. This lack of transparency creates significant reproducibility challenges for the broader AI community. Initial attempts to implement these methods encounter substantial obstacles including entropy collapse (where model outputs become too similar), reward noise, and training instability. The paper aims to democratize access to state-of-the-art LLM reasoning capabilities by revealing the critical components needed for successful large-scale RL training.

How does it solve the problem? DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) incorporates four key technical innovations. First, "Clip-Higher" uses asymmetric clipping thresholds to maintain exploration diversity and prevent entropy collapse. Second, "Dynamic Sampling" filters out prompts with perfect or zero accuracy to ensure effective gradient signals. Third, they implemented token-level policy gradient loss to properly weight tokens in long reasoning chains. Finally, "Overlong Reward Shaping" addresses the challenges of truncated samples through length-aware penalties. These techniques collectively enable stable and efficient reinforcement learning for complex mathematical reasoning.

What are the key findings? DAPO achieved remarkable results, scoring 50 points on the AIME 2024 mathematical competition using a Qwen2.5-32B base model, surpassing DeepSeek-R1-Zero-Qwen-32B (47 points) while requiring only half the training steps. The progressive application of each technique demonstrated meaningful improvements over the naive GRPO baseline, which only reached 30 points. Interestingly, the researchers observed the emergence of reflective reasoning behaviors not initially present in the base model, such as checking and correcting previous steps - capabilities that developed organically through the reinforcement learning process.

Why does it matter? It’s democratizing access to cutting-edge LLM reasoning capabilities by fully open-sourcing the algorithm, training code, and datasets. Prior to this work, organizations like OpenAI and DeepSeek kept critical technical details private, creating a significant barrier to entry for many researchers and smaller organizations. By revealing the "secret sauce" of successful large-scale LLM RL, DAPO enables broader experimentation and innovation in the field. The paper demonstrates that with the right technical approaches, the research community can independently produce state-of-the-art reasoning models without relying on proprietary systems or undisclosed techniques.

3. SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

Watching: SWEET-RL (paper/code)

What problem does it solve? Training large language model (LLM) agents for multi-turn interactions like collaborating on code or design presents unique challenges that existing reinforcement learning algorithms struggle to address. While single-turn RLHF methods like DPO or PPO can be applied, they don't perform explicit credit assignment across turns, resulting in high variance and poor sample efficiency for complex sequential decision-making. Meanwhile, value function learning approaches that train task-specific heads on LLM representations often generalize poorly with limited fine-tuning data. The authors also note that existing benchmarks for LLM agents lack the right combination of task diversity, reasoning complexity, and engineering simplicity needed to validate multi-turn RL algorithms.

How does it solve the problem? The authors first created ColBench, a benchmark featuring realistic collaborative tasks where agents interact with human simulators to produce artifacts like code or web pages. To solve the credit assignment problem, they developed SWEET-RL (RL with Step-WisE Evaluation from Training-Time Information), which uses an asymmetric actor-critic approach where the critic has access to additional training-time information (like reference solutions) that the actor doesn't see. Rather than learning a value function to predict expected utility, SWEET-RL directly learns the advantage function by parameterizing it using the mean log probability of actions and training through a Bradley-Terry objective at the trajectory level. This approach better aligns with pre-trained LLMs' capabilities compared to adding a value head on top of hidden states.

What are the key findings? SWEET-RL achieved a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms, enabling Llama-3.1-8B to match or exceed the performance of GPT4-o in realistic collaborative content creation. The experiments revealed that providing the critic with asymmetric information during training significantly enhanced credit assignment capabilities. The paper also showed that the standard practice of training a value function generalizes poorly to unseen tasks compared to SWEET-RL's approach of directly learning an advantage function parameterized by mean log probability. Additionally, the authors observed that multi-turn collaboration substantially improved performance across all models—success rates roughly doubled when agents could gather more information through multiple interaction turns.

Why does it matter? Their findings advance our understanding of how to effectively train LLM agents for realistic multi-turn tasks where humans and AI collaborate to create something useful. The SWEET-RL algorithm offers a practical solution that allows smaller models to perform competitively with much larger ones through better optimization, potentially reducing deployment costs and widening access to effective AI assistants. The asymmetric actor-critic approach provides a general methodology for leveraging information that's available during training but not during deployment—a pattern that appears in many real-world scenarios.

Papers of the Week:

?? If you enjoyed this article, give it a like and share it with your peers.

LLM Watch

53,806 位关注者

Promila Ghosh

AI Researcher | Ex-Data Scientist @ United We Care | MSc in CSE | Teaching Machines to See, Hear & Think | Future Tech Storyteller

9 小时前

Worth sharing. DAPO concept is impressive. Nice article.

Hex Miller-Bakewell

CTO and co-founder of Health Data Avatar

11 小时前

Such a beautiful idea, enhancing (but not replacing) the rank bottleneck of a LoRA with a quantum channel. So much extra power from just a handful of extra parameters. They achieve so much with just four qubits!

1 次回应

Layton Perrin

Entrepreneur, Leader, Architect, Full-Stack Extreme Virtuoso: Business Analysis, Cyber Security, Data Science. ITIL BPM SLM Expert bringing Modern Approaches to drive Business Processes.

14 小时前

Thank you Pascal - been waiting for this moment and very much appreciate your article. Have a great weekend!

2 次回应

查看更多评论

要查看或添加评论，请登录

Pascal Biese的更多文章

?? Search-R1, Gemini Embeddings & Controlled Reasoning with L1

2025年3月14日

?? Search-R1, Gemini Embeddings & Controlled Reasoning with L1

In this issue: Emergent search behavior in LLMs Stopping reasoning models from “overthinking” The best embeddings - for…

1 条评论
?? QwQ-32B: 20x smaller than DeepSeek-R1

2025年3月7日

?? QwQ-32B: 20x smaller than DeepSeek-R1

In this issue: China just did it again: a new open source powerhouse The art of post-training reasoning models A new…

6 条评论
OpenAI Can Not Be Happy About This

2025年2月28日

OpenAI Can Not Be Happy About This

In this issue: OpenAI releases first “vibe” model Microsoft bets on data quality and efficiency When old benchmarks…
?????? One Giant Leap for AI Optimization

2025年2月21日

?????? One Giant Leap for AI Optimization

In this issue: Sakana’s AI CUDA Engineer Inner Thinking Transformers Better Code Generation for any model Accelerate…
LLM Watch#74: DeepSeek-R1 Was Only The Beginning

2025年2月14日

LLM Watch#74: DeepSeek-R1 Was Only The Beginning

In this issue: 1B model > 405B model AI winning Olympic Gold Generating world models on the fly For those of you that…

5 条评论
?? Massive Progress in Reasoning Models

2025年2月7日

?? Massive Progress in Reasoning Models

In this issue: Beating OpenAI with Open-Source 99% performance with only 1% data Chain-of-Associated-Thoughts (CoAT)…

2 条评论
??? Automatic Prompt Engineering 2.0

2025年1月31日

??? Automatic Prompt Engineering 2.0

Foreword: hi everyone, I hope you had a great week! Before we dive into this newsletter and its (hopefully) exciting…

5 条评论
?? This AI Makes Big Tech Panic

2025年1月24日

?? This AI Makes Big Tech Panic

In this issue: Re-defining what’s possible in AI DeepMind going even deeper Self-training agents are coming 1…

11 条评论
?? Google Releases Transformer 2.0

2025年1月17日

?? Google Releases Transformer 2.0

In this issue: From Transformers to Titans Smaller, weaker, yet better O1-preview-level results for $450 Interested in…

9 条评论
???? AI Cutting Research Costs by 84%

2025年1月10日

???? AI Cutting Research Costs by 84%

In this issue: AI helping researchers to be more efficient LLMs being unreliable when reasoning about time Evaluating…

3 条评论

See all articles

In this issue:

1. Quantum-Enhanced LLM Efficient Fine Tuning

2. DAPO: An Open-Source LLM Reinforcement Learning System at Scale

3. SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

Papers of the Week:

?? If you enjoyed this article, give it a like and share it with your peers.

LLM Watch

53,806 位关注者

Pascal Biese的更多文章

?? Search-R1, Gemini Embeddings & Controlled Reasoning with L1

?? QwQ-32B: 20x smaller than DeepSeek-R1

OpenAI Can Not Be Happy About This

?????? One Giant Leap for AI Optimization

LLM Watch#74: DeepSeek-R1 Was Only The Beginning

?? Massive Progress in Reasoning Models

??? Automatic Prompt Engineering 2.0

?? This AI Makes Big Tech Panic

?? Google Releases Transformer 2.0

???? AI Cutting Research Costs by 84%