Deepseek and Test-Time Scaling: The New Frontier in AI
Staring down the Test-Time Scale frontier (Dalle.E)

Deepseek and Test-Time Scaling: The New Frontier in AI

“There are decades where nothing happens; and there are weeks where decades happen.” – (Often attributed to Vladimir Lenin)

In the world of artificial intelligence, this sentiment rings truer than ever. We can go months without any major breakthroughs, only to see years’ worth of progress crammed into a few frenetic weeks. For the better part of last year, much of AI’s progress was driven by post-training breakthroughs—fine-tuning techniques like Supervised Fine-Tuning (SFT), LoRA, and Reinforcement Learning from Human Feedback (RLHF) that allowed large pre-trained models to be adapted for specific tasks. This post-training revolution reshaped AI deployment, enabling companies to refine foundation models without the massive cost of training from scratch.

But the frontier is shifting. The new challenge isn’t just about training better models—it’s about scaling inference efficiently. Underpinning these pivotal shits marked by Deepseek and Reasoning first model developments are three key resources—memory, FLOPs, and communication bandwidth—that constrain how far and fast these innovations can scale.

Test-time scaling is where the next battles in AI performance will be won or lost.

Test-time scaling is now the dominant concern. We are realizing that deploying a powerful model means more than just training it well. The real bottlenecks emerge at inference, where memory constraints, FLOPs per token, and communication bandwidth limitations dictate real-world performance. Whether serving million-token contexts in chat models, powering real-time reasoning systems, or running multi-step, iterative AI workflows, the constraints of test-time execution are the next frontier.

This is why DeepSeek-R1 and the rise of Reasoning-First architectures mark such a pivotal moment. These models don’t just store knowledge in parameters—they actively compute solutions, iterating through multiple steps. This makes their test-time scaling dynamics fundamentally different from GPT-style knowledge-first models, with distinct memory, compute, and communication demands.

I have covered emergence of Reasoning First Models in a previous article about Knowledge First vs Reasoning First models.

In this article, I explore how the troika of memory, FLOPs, and bandwidth defines both knowledge-first and reasoning-first models and why solving test-time scaling challenges is crucial for enterprise AI deployments. Along the way, we’ll unpack DeepSeek-R1-Zero’s reinforcement learning (GRPO), the latest attention optimizations (FlashAttention, GQA), and how hardware advancements (NVIDIA H100, Blackwell) are shaping AI’s next era.


1. The Troika Explained

All AI models grapple with three fundamental resource constraints:

  • Memory This refers to GPU memory (or any accelerator memory), CPU RAM, and specialized storage like high-bandwidth memory (HBM). Neural networks store not just parameters but also intermediate activations, gradients (during training), and context caches (during inference). As context windows expand or as models adopt iterative reasoning, memory usage can skyrocket.
  • FLOPs (Floating-Point Operations) The computational workload of a neural model is often described by FLOPs. Larger models—and longer sequence lengths—mean more matrix multiplications or attention calculations. Iterative, reasoning-centric processes can also inflate FLOPs if each inference pass involves multiple computational steps.
  • Communication Bandwidth Modern AI training and inference often spans multiple GPUs or even multiple servers. Communication bandwidth (e.g., via NVIDIA’s NCCL - pronounced as "Nickel" or NVLink or InfiniBand) measures how quickly data moves between these devices. Insufficient bandwidth can nullify the benefit of adding more GPUs if large tensors or caches must be synchronized frequently.


2. Knowledge-First vs. Reasoning-First: Contrasting Architectures

Knowledge-First Models

GPT-like architectures exemplify the knowledge-first paradigm. They rely on huge parameter counts to store factual and linguistic patterns “in the weights.” Training these models demands enormous FLOPs, heavily leveraging parallelism strategies such as pipeline or tensor parallelism to distribute the computational load.

At inference, these models employ a KV cache to avoid recomputing the entire self-attention mechanism for every new token. This technique significantly reduces FLOPs per token but causes memory usage to grow linearly with sequence length. In multi-GPU settings, the communication overhead can also become substantial, as partial states might need to be synchronized using libraries like NCCL (NVIDIA Collective Communications Library - aka "Nickel").

Reasoning-First Models

Reasoning-first systems focus on iterative or symbolic manipulation. Instead of storing all knowledge in billions of weights, they maintain a smaller parameter set but use “scratchpads” or repeated computation steps to solve tasks.

At training time, the total FLOPs may remain moderate—assuming the base model is smaller—but memory usage can get very high for expanded intermediate representations. During inference, these models can repeatedly update or reuse an internal cache, akin to the KV cache, especially if they are iterating through multiple reasoning steps. Depending on the architecture, communication requirements may be lower than in knowledge-first systems if the iterative modules do not constantly exchange large volumes of data across GPUs.


3. The Post-Training Revolution: SFT and Alignment

Until recently, much attention was paid to “post-training” methods that further refine a large pre-trained model. These include Supervised Fine-Tuning (SFT) for domain-specific tasks and alignment methods like Reinforcement Learning from Human Feedback (RLHF). By focusing on data curation and reward signals after a model is already pre-trained, these techniques sparked a wave of performance gains without needing to train an entirely new foundation model from scratch.

Full Supervised Fine-Tuning (SFT)

Full SFT updates all model parameters on a curated dataset. Because the entire network is trainable, memory and FLOPsdemands can be quite high, and communication overhead grows substantially when sharding parameters or gradients across multiple GPUs.

In an effort to manage complexity, frameworks like DeepSpeed implement optimizations (e.g., ZeRO) that shard optimizer states, gradients, and parameters. Additionally, low-level techniques at the CUDA or PTX (Parallel Thread Execution) level—such as kernel fusion—can further reduce overhead. FlashAttention and Grouped Query Attention (GQA) can also help by computing attention operations in a more memory-efficient manner, mitigating the quadratic blow-up when context lengths increase.

Parameter-Efficient Fine-Tuning (PEFT)

PEFT methods like prefix tuning or prompt tuning modify only a small subset of parameters, leaving the bulk of the network frozen. This dramatically reduces gradient storage and lowers the communication required for multi-GPU setups. Because fewer parameters are updated, overall FLOPs also decrease compared to full SFT.

When dealing with large contexts, these parameter-efficient approaches pair nicely with advanced attention kernels (FlashAttention, GQA), which reduce the memory footprint by recalculating attention in a streaming or grouped manner. This synergy helps handle longer inputs more gracefully without fully incurring a quadratic cost.

LoRA (Low-Rank Adaptation)

A specialized form of PEFT, LoRA inserts learnable low-rank matrices into existing weight tensors. This strategy keeps the main network parameters frozen, drastically shrinking the portion that must be trained. The result is lower memory usage and reduced communication overhead—plus easier synergy with advanced CUDA kernels, which can be fine-tuned for these low-rank updates.

In practice, LoRA is especially effective for large, knowledge-first transformers. It allows them to adapt to new domains or tasks (medical or legal for example) without the burden of a full model update.


4. Large-Scale Reinforcement Learning and GRPO (DeepSeek-R1-Zero)

While supervised data has proven indispensable, Deepseek heralds the promise of large-scale reinforcement learning (RL) for enhancing reasoning capabilities—sometimes without any supervised fine-tuning at all. This approach is exemplified by DeepSeek-R1-Zero, which applies RL directly to a base model, and DeepSeek-R1, which starts from a small checkpoint fine-tuned on thousands of long chain-of-thought (CoT) examples. Ultimately, the learned reasoning ability can even be distilled into smaller, more compact networks.

Group Relative Policy Optimization (GRPO)

One cornerstone of DeepSeek-R1-Zero is Group Relative Policy Optimization (GRPO), a technique that forgoes a traditional critic network to estimate advantages from a group of sampled outputs. By dispensing with a separate critic model—often comparable in size to the policy—it slashes both memory usage and FLOPs.

Mathematically, GRPO samples a group of outputs from the old policy , then optimizes a new policy by maximizing an objective that depends on log-probabilities, advantage estimates, and a KL-divergence regularization term.

Linking this to the troika:

  • Memory is saved because there is no need to maintain a large critic network in parallel.
  • FLOPs are reduced because each training iteration updates only one model, using group-based baseline estimates.
  • Communication overhead can also drop in multi-GPU RL setups, since the group advantage requires fewer global synchronizations than a full-blown actor-critic architecture.

RLHF is not pure Reinforcement Learning

Reinforcement Learning (RL) is fundamentally about trial-and-error learning through direct interaction with an environment, where policies evolve based on feedback signals. However, not all methods labeled as “RL” adhere to this core principle.

Traditional Reward Modeling (RM) + Proximal Policy Optimization (PPO), widely used in RLHF (Reinforcement Learning from Human Feedback), deviates from pure RL in a crucial way: it introduces a static, pre-trained reward model, effectively turning RLHF into supervised fine-tuning with a learned preference signal rather than direct environment-driven learning.

The model is not interacting with an environment in real time; instead, it is fine-tuned on human-labeled preferences, making the training loop closer to gradient-based optimization than true RL exploration.

In contrast, Group Relative Policy Optimization (GRPO), as used in DeepSeek-R1-Zero, aligns with true RL principles because it directly updates the policy without requiring a separate critic model or pre-trained reward network. Instead of relying on human-annotated reward labels, GRPO estimates advantage values dynamically from a group of sampled outputs. This removes the reliance on externally imposed reward functions and allows the model to evolve purely from its own policy-driven interactions.

The absence of a pre-trained reward model ensures that the model’s learning is emergent and self-guided, making GRPO a closer embodiment of reinforcement learning’s original formulation—where agents improve by interacting with an environment rather than being fine-tuned on fixed preference labels.

In essence, while RM + PPO in RLHF is more akin to guided fine-tuning, GRPO represents a true reinforcement learning framework, where policies adapt dynamically based on the model’s own iterative decision-making process.


5. Context Lengths are a bottleneck - from quadratic complexity to sub-linear complexity

Where context lengths grow large—think thousands or tens of thousands of tokens—the attention mechanism in a Transformer can become a costly bottleneck, since standard attention scales quadratically with sequence length. Several next-generation methods aim to reduce this cost:

  • FlashAttention: Imagine you want to calculate attention but only have so much space on your “scratchpad.” FlashAttention processes chunks of the input at a time (streaming the queries, keys, and values) so it never needs to store the full attention matrix in memory all at once. This is like washing dishes one stack at a time instead of trying to fit every dish in the sink. By doing so, it saves memory and often runs faster, too.
  • Grouped Query Attention (GQA): Instead of calculating attention for every single query token independently, GQA groups certain tokens together (or shares some parts of the calculation among them). You can think of it as a group discount—computing attention in mini-batches so there’s less duplication of work. This approach can reduce the overall FLOPs and memory footprint for large sequences.
  • Sparse / Blockwise / Sliding-Window Attention: Sometimes, a model doesn’t need to attend to every token in a long text. If the task is local or hierarchical, the model can limit each token’s attention to a small window of neighbors or process chunks of the text in blocks. While not always perfect for tasks needing truly global context, these methods shrink the memory and compute cost significantly—and if you only do partial synchronization across GPUs, you reduce bandwidth needs too.

Even with these techniques, distributing inference over multiple GPUs can still be tricky. Every time you split a long context across GPUs, you must decide how to handle cross-GPU attention.

If you pass partial attention states around too often, the communication overhead undermines all the speed gains. So, the troika once again rears its head: bigger contexts can yield better performance, but you must juggle memory, FLOPs, and interconnect bandwidth.


6. NVIDIA GPUs at Scale: From A100 and H100 to Blackwell

NVIDIA’s A100 and H100 GPUs are the backbone of large-scale AI training and deployment. They combine high-capacity memory (HBM2/3) with advanced features like mixed precision (FP16/BF16) and, in the H100’s case, new support for FP8 computations. By scaling out with NVLink or InfiniBand, massive clusters can handle the enormous parameter counts of GPT-sized models or the repeated iterative steps of reasoning-first systems.

Yet even these advanced GPUs can reach their limits if the KV cache grows too large or if multi-GPU inference saturates NCCL bandwidth. The next-generation Blackwell architecture promises further enhancements, offering more efficient memory layouts, higher compute density, and possibly improved interconnect technology that can mitigate some communication bottlenecks. However, the same fundamental issues remain—exchanging large chunks of data frequently is inherently expensive, and as context lengths or multi-step reasoning expand, the troika remains the limiting factor.


7. Moving from the Post-Training Revolution to Test-Time Scaling

The explosion of post-training methods—SFT, alignment, RLHF, parameter-efficient fine-tuning—enabling a single, huge pre-trained model to be adapted for countless tasks. These methods ushered in a revolution where building a specialized system no longer required training from scratch.

But as enterprise demands grow, we’re witnessing a shift toward test-time scaling. Businesses want to serve million-token contexts, run real-time dialogue with advanced reasoning, or deploy specialized RL-based systems that adapt on the fly. Meeting these demands requires efficient inference optimizations that deftly handle large caches, advanced attention kernels, and multi-GPU parallelism without saturating memory or bandwidth.

In an enterprise context, the stakes are high:

  • Latency: Interactive applications can’t afford multi-second delays.
  • Cost: GPUs are expensive, and running them around the clock for inference requires maximum utilization and minimal waste.
  • Reliability: Models must handle large or variable-length inputs without crashing the system or timing out.


8. Conclusion: What It Takes to Get Test-Time Scaling Right

We’ve moved beyond a world where “post-training” was the final frontier. Today’s challenges revolve around test-time scaling: how to serve massive contexts, iterative reasoning steps, or real-time RL-based policies at scale. Achieving this requires:

  • Architectural Innovations: Techniques like FlashAttention, GQA, and sparse attention to tame quadratic complexity.
  • Parallelism and Communication: Smart partitioning that prevents GPU interconnects (e.g., NCCL) from becoming a bottleneck, coupled with next-gen hardware like H100 and Blackwell for higher throughput.
  • Low-Level Optimizations: Custom CUDA kernels and PTX-level tweaks to fuse operations, reduce memory overhead, and streamline multi-step computations.
  • Refined Reinforcement Learning: Approaches like DeepSeek-R1-Zero and GRPO that improve reasoning without massive supervised datasets, potentially keeping FLOPs and memory usage in check by removing a large critic.
  • Enterprise-Ready Serving Infra Architectures: New architectures that fosters Reliability, cost-effectiveness, and real-time latency constraints that balances the troika for every inference call.

The reason for all this complexity is simple: AI has become a competitive differentiator, and organizations that can deploy advanced reasoning or massive knowledge-driven models at scale gain a decisive edge. Whether it’s a chatbot that handles thousand-token dialogues fluidly or a specialized domain reasoning system that can solve intricate tasks without human intervention, test-time scaling is where the future battles in AI performance will be won or lost.

In essence, the “weeks where decades happen” in AI are upon us. As memory, FLOPs, and communication bandwidth remain the bedrock limits, we’ll continue to see rapid innovation in architectures, post-training techniques, and GPU-level optimizations—all aimed at unlocking new possibilities for both knowledge-first and reasoning-first paradigms at inference time.

要查看或添加评论,请登录

Narendran Sivakumar的更多文章

社区洞察