With NVIDIA's stock falling so drastically, does it mean AI computing power is becoming useless?
Let’s start with the conclusion: it’s useful. Now let’s discuss, scientifically and rationally, whether it’s truly useful.
Why did the stock plunge so much?
Is it because of Deepseek R1? Actually, no—it’s because of Deepseek V3.
When V3 was first released, I wrote a technical analysis (and posted it on Weibo) explaining its value in detail. Normally, for a model of this scale, if we want to do pretraining, we’d need about ten thousand GPUs. Yet V3 managed to get it done with a bit over 2,000 GPUs, and the results are quite good. Hence, it’s very valuable.
But does this mean that anyone can replicate the same pretraining level with 2,000+ GPUs simply by following their paper?
The answer is, most likely not.
In my view, most people currently do not have the engineering know-how to properly tune an MoE (Mixture of Experts) model. If you train it with 256 experts, or even just 8 experts, you might find that disabling a couple of them makes no difference at all—let alone pulling off those extreme optimizations. Even if you have the paper, who knows how much of it you can truly implement.
In fact, the biggest cost-saver in V3 is FP8, but the precondition is that you have hardware that supports it—specifically H-series GPUs—and you must really understand how to combine mixed precision with FP8. If large-scale FP8 training were straightforward, Meta wouldn’t only be using it for small-model quantization while still relying on BF16 for actual training. Similarly, for Deepseek V3 or R1, the open-sourced materials are the weights, not the training code or the essential hyperparameters like the learning rate. They’re definitely not going to share those.
But if you can reach even half of what they’ve achieved, congratulations—you’ve just saved your company 5,000 GPUs for pretraining, and that alone might earn you an “excellent” rating by year’s end. From this perspective, V3’s extreme optimizations really do pave a lower-cost path to pretraining (which I’ve discussed in previous articles, so I won’t repeat it here).
From this angle, the advent of V3 is bad news for virtually all computing hardware companies in the world (except Groq).
Moving on to R1
What is R1 for? It’s post-training—that is, training on top of V3’s output.
A model’s training cycle doesn’t end with just pretraining. Once you finish pretraining (we call that the “base model,” such as V3—though V3 is presumably already SFT’d to some extent; otherwise it wouldn’t be useful—let’s just take it as an example), the model has learned “knowledge.” Note that knowledge is not the same as reasoning.
If we really wanted a comparison: you could say the model has memorized all the words, sentences, and associations from textbooks, but it might not speak coherently. That’s when you need SFT (Supervised Fine-Tuning), which basically teaches the model to communicate like a human.
But is just speaking like a human enough?
Possibly not. As I said, knowledge is not equivalent to reasoning. How do we train it so that its output isn’t just human-like but also thinks more like a human—maybe even at a human level? That’s where RL (reinforcement learning) methods come in.
Traditional RL initially was RLHF, but nowadays we have RLAIF. In either case, the first step is usually to train a reward model reflecting human preferences.
Human Preferences: An Example
Say someone asks you to rank who’s the most beautiful among Fan Bingbing, Li Xiaolu, Bai Baihe, and your wife. Let’s imagine your ranking is:
Meanwhile, another person might rank them differently:
The point is to accommodate everyone’s unique human preferences, so we don’t skew the model too much. By using a ranking approach, we essentially assign an implicit weight to each option. By the law of large numbers, perhaps Fan Bingbing might come out on top, so the LLM would answer “Fan Bingbing” if asked, “Who is the most beautiful?”
But what’s the catch?
The catch is that we only have the final answer for our reward. That is, you might get 4 points for “Fan Bingbing,” 1 point for “your wife,” and so on, relying solely on the final outcome to align the model’s preferences.
领英推荐
That’s straightforward for Q&A-type interactions—just give an answer and you get a reward. But consider the following question:
Could you answer immediately? You might just say, “Don’t bother me—go away!” But if you really tried to solve it, you’d need a good amount of time to think. Your prefrontal cortex would distribute tasks to different parts of your brain—some handling matrix operations, some handling equations, the cerebellum for precision, the temporal lobe for symbolic representation, and so on (not literally, but for the sake of analogy).
LLMs face similar challenges for complex problems. If they can break a complicated problem down into smaller subproblems, the success rate improves significantly. Early approaches like MetaGPT and AutoGPT tried using external “agents” to guide an LLM through each subtask.
Another angle is that LLMs themselves might have an internal Chain of Thought (CoT). You can attempt to activate it using prompts like “step by step answer my question,” which can help reveal the implicit reasoning pathway.
How does Implicit CoT form?
Summarizing a few key points:
The Problem with Implicit CoT
Given (1) and (2), implicit reasoning is difficult to control. So we can make it explicit. This is where the “O1” approach and “R1” approach both come into play as explicit CoT training methodologies.
I’ve talked about “O1” extensively before, so I won’t reiterate. Let’s focus on R1 and what’s different about it.
R1 doesn’t rely on a PRM (planned intermediate reward model) or do any stepwise supervision (like MCTS). Instead, it simply has the model generate its own CoT. The model is trained to “always keep thinking,” so it generates more and more steps until it hits an “Aha!” moment. It’s essentially repeated RL on top of your policy (here, “policy” refers to the base V3 model). Because it’s an online RL approach, it continuously optimizes the policy so that the output approaches the reward’s maximum. (Strictly speaking, it doesn’t even use a reward model since it’s GRPO, which is a kind of DPO, relying on a preference function for RL.)
When the policy converges to that “Aha!” moment, the model can solve other variants of the same type of problem in a similar manner. If you look at R1’s intermediate outputs, you’ll find it repeatedly reflecting on its previous answers—a form of self-play.
Advantages:
It’s a bit like how AlphaGo learned from existing game records or how you might train reinforcement learning to play Super Mario by self-play.
Drawbacks:
Heavier Compute Load
Under slow, step-by-step reasoning conditions, neither R1 nor O1 is going to save compute resources in the inference phase. They’ll both likely consume more as time goes on. You may see something like “O3” in the future that solves a single problem for $3,000 in compute time.
However, R1 can require even more GPU memory than “PRM+search” methods because the intermediate results are not well-controlled.
Will R1 impact AI compute demand?
Conclusion
All in all, the release of R1 isn’t going to reduce the global appetite for compute—if anything, it will increase it in certain areas.