Deepseek and Test-Time Scaling: The New Frontier in AI
“There are decades where nothing happens; and there are weeks where decades happen.” – (Often attributed to Vladimir Lenin)
In the world of artificial intelligence, this sentiment rings truer than ever. We can go months without any major breakthroughs, only to see years’ worth of progress crammed into a few frenetic weeks. For the better part of last year, much of AI’s progress was driven by post-training breakthroughs—fine-tuning techniques like Supervised Fine-Tuning (SFT), LoRA, and Reinforcement Learning from Human Feedback (RLHF) that allowed large pre-trained models to be adapted for specific tasks. This post-training revolution reshaped AI deployment, enabling companies to refine foundation models without the massive cost of training from scratch.
But the frontier is shifting. The new challenge isn’t just about training better models—it’s about scaling inference efficiently. Underpinning these pivotal shits marked by Deepseek and Reasoning first model developments are three key resources—memory, FLOPs, and communication bandwidth—that constrain how far and fast these innovations can scale.
Test-time scaling is where the next battles in AI performance will be won or lost.
Test-time scaling is now the dominant concern. We are realizing that deploying a powerful model means more than just training it well. The real bottlenecks emerge at inference, where memory constraints, FLOPs per token, and communication bandwidth limitations dictate real-world performance. Whether serving million-token contexts in chat models, powering real-time reasoning systems, or running multi-step, iterative AI workflows, the constraints of test-time execution are the next frontier.
This is why DeepSeek-R1 and the rise of Reasoning-First architectures mark such a pivotal moment. These models don’t just store knowledge in parameters—they actively compute solutions, iterating through multiple steps. This makes their test-time scaling dynamics fundamentally different from GPT-style knowledge-first models, with distinct memory, compute, and communication demands.
I have covered emergence of Reasoning First Models in a previous article about Knowledge First vs Reasoning First models.
In this article, I explore how the troika of memory, FLOPs, and bandwidth defines both knowledge-first and reasoning-first models and why solving test-time scaling challenges is crucial for enterprise AI deployments. Along the way, we’ll unpack DeepSeek-R1-Zero’s reinforcement learning (GRPO), the latest attention optimizations (FlashAttention, GQA), and how hardware advancements (NVIDIA H100, Blackwell) are shaping AI’s next era.
1. The Troika Explained
All AI models grapple with three fundamental resource constraints:
2. Knowledge-First vs. Reasoning-First: Contrasting Architectures
Knowledge-First Models
GPT-like architectures exemplify the knowledge-first paradigm. They rely on huge parameter counts to store factual and linguistic patterns “in the weights.” Training these models demands enormous FLOPs, heavily leveraging parallelism strategies such as pipeline or tensor parallelism to distribute the computational load.
At inference, these models employ a KV cache to avoid recomputing the entire self-attention mechanism for every new token. This technique significantly reduces FLOPs per token but causes memory usage to grow linearly with sequence length. In multi-GPU settings, the communication overhead can also become substantial, as partial states might need to be synchronized using libraries like NCCL (NVIDIA Collective Communications Library - aka "Nickel").
Reasoning-First Models
Reasoning-first systems focus on iterative or symbolic manipulation. Instead of storing all knowledge in billions of weights, they maintain a smaller parameter set but use “scratchpads” or repeated computation steps to solve tasks.
At training time, the total FLOPs may remain moderate—assuming the base model is smaller—but memory usage can get very high for expanded intermediate representations. During inference, these models can repeatedly update or reuse an internal cache, akin to the KV cache, especially if they are iterating through multiple reasoning steps. Depending on the architecture, communication requirements may be lower than in knowledge-first systems if the iterative modules do not constantly exchange large volumes of data across GPUs.
3. The Post-Training Revolution: SFT and Alignment
Until recently, much attention was paid to “post-training” methods that further refine a large pre-trained model. These include Supervised Fine-Tuning (SFT) for domain-specific tasks and alignment methods like Reinforcement Learning from Human Feedback (RLHF). By focusing on data curation and reward signals after a model is already pre-trained, these techniques sparked a wave of performance gains without needing to train an entirely new foundation model from scratch.
Full Supervised Fine-Tuning (SFT)
Full SFT updates all model parameters on a curated dataset. Because the entire network is trainable, memory and FLOPsdemands can be quite high, and communication overhead grows substantially when sharding parameters or gradients across multiple GPUs.
In an effort to manage complexity, frameworks like DeepSpeed implement optimizations (e.g., ZeRO) that shard optimizer states, gradients, and parameters. Additionally, low-level techniques at the CUDA or PTX (Parallel Thread Execution) level—such as kernel fusion—can further reduce overhead. FlashAttention and Grouped Query Attention (GQA) can also help by computing attention operations in a more memory-efficient manner, mitigating the quadratic blow-up when context lengths increase.
Parameter-Efficient Fine-Tuning (PEFT)
PEFT methods like prefix tuning or prompt tuning modify only a small subset of parameters, leaving the bulk of the network frozen. This dramatically reduces gradient storage and lowers the communication required for multi-GPU setups. Because fewer parameters are updated, overall FLOPs also decrease compared to full SFT.
When dealing with large contexts, these parameter-efficient approaches pair nicely with advanced attention kernels (FlashAttention, GQA), which reduce the memory footprint by recalculating attention in a streaming or grouped manner. This synergy helps handle longer inputs more gracefully without fully incurring a quadratic cost.
LoRA (Low-Rank Adaptation)
A specialized form of PEFT, LoRA inserts learnable low-rank matrices into existing weight tensors. This strategy keeps the main network parameters frozen, drastically shrinking the portion that must be trained. The result is lower memory usage and reduced communication overhead—plus easier synergy with advanced CUDA kernels, which can be fine-tuned for these low-rank updates.
In practice, LoRA is especially effective for large, knowledge-first transformers. It allows them to adapt to new domains or tasks (medical or legal for example) without the burden of a full model update.
4. Large-Scale Reinforcement Learning and GRPO (DeepSeek-R1-Zero)
While supervised data has proven indispensable, Deepseek heralds the promise of large-scale reinforcement learning (RL) for enhancing reasoning capabilities—sometimes without any supervised fine-tuning at all. This approach is exemplified by DeepSeek-R1-Zero, which applies RL directly to a base model, and DeepSeek-R1, which starts from a small checkpoint fine-tuned on thousands of long chain-of-thought (CoT) examples. Ultimately, the learned reasoning ability can even be distilled into smaller, more compact networks.
Group Relative Policy Optimization (GRPO)
One cornerstone of DeepSeek-R1-Zero is Group Relative Policy Optimization (GRPO), a technique that forgoes a traditional critic network to estimate advantages from a group of sampled outputs. By dispensing with a separate critic model—often comparable in size to the policy—it slashes both memory usage and FLOPs.
Mathematically, GRPO samples a group of outputs from the old policy , then optimizes a new policy by maximizing an objective that depends on log-probabilities, advantage estimates, and a KL-divergence regularization term.
Linking this to the troika:
RLHF is not pure Reinforcement Learning
Reinforcement Learning (RL) is fundamentally about trial-and-error learning through direct interaction with an environment, where policies evolve based on feedback signals. However, not all methods labeled as “RL” adhere to this core principle.
Traditional Reward Modeling (RM) + Proximal Policy Optimization (PPO), widely used in RLHF (Reinforcement Learning from Human Feedback), deviates from pure RL in a crucial way: it introduces a static, pre-trained reward model, effectively turning RLHF into supervised fine-tuning with a learned preference signal rather than direct environment-driven learning.
The model is not interacting with an environment in real time; instead, it is fine-tuned on human-labeled preferences, making the training loop closer to gradient-based optimization than true RL exploration.
In contrast, Group Relative Policy Optimization (GRPO), as used in DeepSeek-R1-Zero, aligns with true RL principles because it directly updates the policy without requiring a separate critic model or pre-trained reward network. Instead of relying on human-annotated reward labels, GRPO estimates advantage values dynamically from a group of sampled outputs. This removes the reliance on externally imposed reward functions and allows the model to evolve purely from its own policy-driven interactions.
The absence of a pre-trained reward model ensures that the model’s learning is emergent and self-guided, making GRPO a closer embodiment of reinforcement learning’s original formulation—where agents improve by interacting with an environment rather than being fine-tuned on fixed preference labels.
In essence, while RM + PPO in RLHF is more akin to guided fine-tuning, GRPO represents a true reinforcement learning framework, where policies adapt dynamically based on the model’s own iterative decision-making process.
5. Context Lengths are a bottleneck - from quadratic complexity to sub-linear complexity
Where context lengths grow large—think thousands or tens of thousands of tokens—the attention mechanism in a Transformer can become a costly bottleneck, since standard attention scales quadratically with sequence length. Several next-generation methods aim to reduce this cost:
Even with these techniques, distributing inference over multiple GPUs can still be tricky. Every time you split a long context across GPUs, you must decide how to handle cross-GPU attention.
If you pass partial attention states around too often, the communication overhead undermines all the speed gains. So, the troika once again rears its head: bigger contexts can yield better performance, but you must juggle memory, FLOPs, and interconnect bandwidth.
6. NVIDIA GPUs at Scale: From A100 and H100 to Blackwell
NVIDIA’s A100 and H100 GPUs are the backbone of large-scale AI training and deployment. They combine high-capacity memory (HBM2/3) with advanced features like mixed precision (FP16/BF16) and, in the H100’s case, new support for FP8 computations. By scaling out with NVLink or InfiniBand, massive clusters can handle the enormous parameter counts of GPT-sized models or the repeated iterative steps of reasoning-first systems.
Yet even these advanced GPUs can reach their limits if the KV cache grows too large or if multi-GPU inference saturates NCCL bandwidth. The next-generation Blackwell architecture promises further enhancements, offering more efficient memory layouts, higher compute density, and possibly improved interconnect technology that can mitigate some communication bottlenecks. However, the same fundamental issues remain—exchanging large chunks of data frequently is inherently expensive, and as context lengths or multi-step reasoning expand, the troika remains the limiting factor.
7. Moving from the Post-Training Revolution to Test-Time Scaling
The explosion of post-training methods—SFT, alignment, RLHF, parameter-efficient fine-tuning—enabling a single, huge pre-trained model to be adapted for countless tasks. These methods ushered in a revolution where building a specialized system no longer required training from scratch.
But as enterprise demands grow, we’re witnessing a shift toward test-time scaling. Businesses want to serve million-token contexts, run real-time dialogue with advanced reasoning, or deploy specialized RL-based systems that adapt on the fly. Meeting these demands requires efficient inference optimizations that deftly handle large caches, advanced attention kernels, and multi-GPU parallelism without saturating memory or bandwidth.
In an enterprise context, the stakes are high:
8. Conclusion: What It Takes to Get Test-Time Scaling Right
We’ve moved beyond a world where “post-training” was the final frontier. Today’s challenges revolve around test-time scaling: how to serve massive contexts, iterative reasoning steps, or real-time RL-based policies at scale. Achieving this requires:
The reason for all this complexity is simple: AI has become a competitive differentiator, and organizations that can deploy advanced reasoning or massive knowledge-driven models at scale gain a decisive edge. Whether it’s a chatbot that handles thousand-token dialogues fluidly or a specialized domain reasoning system that can solve intricate tasks without human intervention, test-time scaling is where the future battles in AI performance will be won or lost.
In essence, the “weeks where decades happen” in AI are upon us. As memory, FLOPs, and communication bandwidth remain the bedrock limits, we’ll continue to see rapid innovation in architectures, post-training techniques, and GPU-level optimizations—all aimed at unlocking new possibilities for both knowledge-first and reasoning-first paradigms at inference time.