With NVIDIA's stock falling so drastically, does it mean AI computing power is becoming useless?

Let’s start with the conclusion: it’s useful. Now let’s discuss, scientifically and rationally, whether it’s truly useful.


Why did the stock plunge so much?

Is it because of Deepseek R1? Actually, no—it’s because of Deepseek V3.

When V3 was first released, I wrote a technical analysis (and posted it on Weibo) explaining its value in detail. Normally, for a model of this scale, if we want to do pretraining, we’d need about ten thousand GPUs. Yet V3 managed to get it done with a bit over 2,000 GPUs, and the results are quite good. Hence, it’s very valuable.

But does this mean that anyone can replicate the same pretraining level with 2,000+ GPUs simply by following their paper?

The answer is, most likely not.

In my view, most people currently do not have the engineering know-how to properly tune an MoE (Mixture of Experts) model. If you train it with 256 experts, or even just 8 experts, you might find that disabling a couple of them makes no difference at all—let alone pulling off those extreme optimizations. Even if you have the paper, who knows how much of it you can truly implement.

In fact, the biggest cost-saver in V3 is FP8, but the precondition is that you have hardware that supports it—specifically H-series GPUs—and you must really understand how to combine mixed precision with FP8. If large-scale FP8 training were straightforward, Meta wouldn’t only be using it for small-model quantization while still relying on BF16 for actual training. Similarly, for Deepseek V3 or R1, the open-sourced materials are the weights, not the training code or the essential hyperparameters like the learning rate. They’re definitely not going to share those.

But if you can reach even half of what they’ve achieved, congratulations—you’ve just saved your company 5,000 GPUs for pretraining, and that alone might earn you an “excellent” rating by year’s end. From this perspective, V3’s extreme optimizations really do pave a lower-cost path to pretraining (which I’ve discussed in previous articles, so I won’t repeat it here).

From this angle, the advent of V3 is bad news for virtually all computing hardware companies in the world (except Groq).


Moving on to R1

What is R1 for? It’s post-training—that is, training on top of V3’s output.

A model’s training cycle doesn’t end with just pretraining. Once you finish pretraining (we call that the “base model,” such as V3—though V3 is presumably already SFT’d to some extent; otherwise it wouldn’t be useful—let’s just take it as an example), the model has learned “knowledge.” Note that knowledge is not the same as reasoning.

If we really wanted a comparison: you could say the model has memorized all the words, sentences, and associations from textbooks, but it might not speak coherently. That’s when you need SFT (Supervised Fine-Tuning), which basically teaches the model to communicate like a human.

But is just speaking like a human enough?

Possibly not. As I said, knowledge is not equivalent to reasoning. How do we train it so that its output isn’t just human-like but also thinks more like a human—maybe even at a human level? That’s where RL (reinforcement learning) methods come in.

Traditional RL initially was RLHF, but nowadays we have RLAIF. In either case, the first step is usually to train a reward model reflecting human preferences.


Human Preferences: An Example

Say someone asks you to rank who’s the most beautiful among Fan Bingbing, Li Xiaolu, Bai Baihe, and your wife. Let’s imagine your ranking is:

  1. Fan Bingbing
  2. Li Xiaolu
  3. Bai Baihe
  4. Your wife

Meanwhile, another person might rank them differently:

  1. Li Xiaolu
  2. Bai Baihe
  3. Fan Bingbing
  4. Your wife

The point is to accommodate everyone’s unique human preferences, so we don’t skew the model too much. By using a ranking approach, we essentially assign an implicit weight to each option. By the law of large numbers, perhaps Fan Bingbing might come out on top, so the LLM would answer “Fan Bingbing” if asked, “Who is the most beautiful?”

But what’s the catch?

The catch is that we only have the final answer for our reward. That is, you might get 4 points for “Fan Bingbing,” 1 point for “your wife,” and so on, relying solely on the final outcome to align the model’s preferences.

That’s straightforward for Q&A-type interactions—just give an answer and you get a reward. But consider the following question:


Could you answer immediately? You might just say, “Don’t bother me—go away!” But if you really tried to solve it, you’d need a good amount of time to think. Your prefrontal cortex would distribute tasks to different parts of your brain—some handling matrix operations, some handling equations, the cerebellum for precision, the temporal lobe for symbolic representation, and so on (not literally, but for the sake of analogy).

LLMs face similar challenges for complex problems. If they can break a complicated problem down into smaller subproblems, the success rate improves significantly. Early approaches like MetaGPT and AutoGPT tried using external “agents” to guide an LLM through each subtask.

Another angle is that LLMs themselves might have an internal Chain of Thought (CoT). You can attempt to activate it using prompts like “step by step answer my question,” which can help reveal the implicit reasoning pathway.


How does Implicit CoT form?

Summarizing a few key points:

  1. Massive Data Pretraining Ensure exposure to a broad range of corpus data, including hidden reasoning and logic.
  2. Multitask Learning Train on multiple reasoning tasks simultaneously to enhance generalization.
  3. Complex Task Fine-Tuning Design tasks that require implicit reasoning—like fill-in-the-blank or Q&A—to nudge the model into forming internal reasoning processes.
  4. Hierarchical Representations Through architectural adjustments or specialized regularization, strengthen the model’s internal logical representations.
  5. Sufficient Model Size This is the classic “scaling law” effect: large models have more potential to learn implicit representations—including CoT, which often remains somewhat mysterious (“we’re not exactly sure how it’s formed”).

The Problem with Implicit CoT

  1. It’s hard to invoke. We didn’t specifically train it to produce step-by-step thoughts, so prompting can be hit-or-miss.
  2. It may not appear at all. Same reason as above—if the prompt doesn’t elicit it, you may not see it.
  3. It might not be optimal. Even if the CoT emerges, it’s not guaranteed to produce the best or most correct reasoning steps.

Given (1) and (2), implicit reasoning is difficult to control. So we can make it explicit. This is where the “O1” approach and “R1” approach both come into play as explicit CoT training methodologies.

I’ve talked about “O1” extensively before, so I won’t reiterate. Let’s focus on R1 and what’s different about it.

R1 doesn’t rely on a PRM (planned intermediate reward model) or do any stepwise supervision (like MCTS). Instead, it simply has the model generate its own CoT. The model is trained to “always keep thinking,” so it generates more and more steps until it hits an “Aha!” moment. It’s essentially repeated RL on top of your policy (here, “policy” refers to the base V3 model). Because it’s an online RL approach, it continuously optimizes the policy so that the output approaches the reward’s maximum. (Strictly speaking, it doesn’t even use a reward model since it’s GRPO, which is a kind of DPO, relying on a preference function for RL.)

When the policy converges to that “Aha!” moment, the model can solve other variants of the same type of problem in a similar manner. If you look at R1’s intermediate outputs, you’ll find it repeatedly reflecting on its previous answers—a form of self-play.

Advantages:

  • Simple approach.
  • No need for PRM or other reward models that track intermediate steps.
  • The model effectively “self-generates” CoT data, then trains on it repeatedly, eventually improving itself in a spiral, stepping on its own shoulders until it reaches the “Aha!” moment.

It’s a bit like how AlphaGo learned from existing game records or how you might train reinforcement learning to play Super Mario by self-play.

Drawbacks:

  1. Very limited control over intermediate steps (R1’s paper mentions how it’s tough to quantify the loss on those intermediate outputs, and it’s not easy to train a PRM for them). This is partly an AI engineering challenge but also very much a data-engineering problem.
  2. Longer CoT sequences often get picked. If you generate more tokens within the context limit, you have more references and more thorough reflection. But that also means you use more GPU memory—because you’re storing and manipulating a lot more tokens.
  3. Not necessarily optimal in the sense that each step might not be the “shortest path.” R1 only ensures correctness of the final answer (the global optimum), not necessarily the local optimum at every step.

Heavier Compute Load

Under slow, step-by-step reasoning conditions, neither R1 nor O1 is going to save compute resources in the inference phase. They’ll both likely consume more as time goes on. You may see something like “O3” in the future that solves a single problem for $3,000 in compute time.

However, R1 can require even more GPU memory than “PRM+search” methods because the intermediate results are not well-controlled.

Will R1 impact AI compute demand?

  1. Impact on Pretraining Yes, and it’s not just Deepseek’s doing—NVIDIA’s Jensen Huang (Huang Renxun) has been promoting FP8 for a while. If the hardware supports it, you can cut your GPU usage in half. Of course, he might be upset that Deepseek is already using FP8 to reduce costs with their V3.
  2. Impact on Post-Training R1 methods will consume more resources than simpler “PRM + search” solutions.
  3. Impact on Inference Whether you choose R-series or O-series approaches, you’ll end up using more resources as reasoning depth increases.

Conclusion

  • V3 has already shaken up the world of pretraining, lowering costs significantly (bad news for almost everyone except specialized chipmakers like Groq).
  • Meanwhile, R1-style post-training actually increases the resource demands during training, and potentially in inference, because it leans on the model’s own iterative generation of chain-of-thought.
  • O-series methods also consume more compute when tackling complex tasks.

All in all, the release of R1 isn’t going to reduce the global appetite for compute—if anything, it will increase it in certain areas.

要查看或添加评论,请登录

Boyang Zhou的更多文章

  • 手把手带你理解OpenManus

    手把手带你理解OpenManus

    我之前演示过几个OpenManus的demo,其实也就是demo,包括manus,现在也就是demo阶段,复杂的plan和flow,现在的代码支撑和LLM的能力都有待改善,但是我们这期不是吐槽文章,是来把OpenManus给打开看看它的实现…

  • 你理解的控制LLM幻觉的方法可能是错的

    你理解的控制LLM幻觉的方法可能是错的

    那什么是粗暴的控制LLM的幻觉的方法呢? 正常你们大家学到的应该是 temperature=0 top_k=1 top_p=0.1 类似这种的 但是这种是不是能解决幻觉呢?很显然在做的各位试过,应该是没什么效果的 为什么呢?…

  • SSI用量子计算来玩AI

    SSI用量子计算来玩AI

    SSI用量子计算来玩AI 刚到家,早上说今天回来要写SSI为什么这么牛B,那就必须得写 SSI是什么公司? Safe Super Intelligence 就是中间这个秃子的公司 ilya 前openAI 首席科学家(现在的mark…

  • 强化学习能让小模型多恐怖?

    强化学习能让小模型多恐怖?

    不是标题党! 不是标题党! 不是标题党! 先说3遍 这个模型有多大呢?1.5B,相当于鼻涕嘎一般大小,和大模型可以说是毫无关系 先看看它和别的模型比较,我们不能只看eval…

  • 快速讲一下deepseek的新论文,这次他们魔爪伸向了attention

    快速讲一下deepseek的新论文,这次他们魔爪伸向了attention

    快速讲一下deepseek的新论文,这次他们魔爪伸向了attention 新文论地址: https://arxiv.org/pdf/2502.

  • LLM Math?

    LLM Math?

    At the beginning of the year, I predicted that reasoning would become a hot topic, but I didn’t expect it to blow up so…

  • LLM到底会解数学题吗?

    LLM到底会解数学题吗?

    我开年时候说Reasoning会火,但是我没想到火这么快 和2023年预测MOE一样,2024年年底预测reasoning基本也是年度AI热词了,我跟迪亚波罗的绯红之王有一拼 但是确实最近没什么东西值得写,也收到了兄弟的吐槽…

  • All in one 的 AI tool Chain “Halomate”

    All in one 的 AI tool Chain “Halomate”

    这不算广告啊,就是真好用,虽然是我哥们儿的产品 比如你定了个gpt的plus 订阅,你发现好像有挺多功能 1- chat,这个自不必说,必须的功能 2- 高级语音 现在变成学英语的了,实时视频也就是我过年给姑婶介绍是是ai用的 3-…

  • 产品思维的角度来讲,Deep Research本质是Co-RAG

    产品思维的角度来讲,Deep Research本质是Co-RAG

    当然我这个标题扣的很多同学会不同意 也能理解 比如有些人说我用while 也能实现只要最终给出一个差不多样子的markdown文件就行 这话也对 也不对 对的是似乎从产出物来讲,是那么回事,但是实际上你的东西不一定是deep…

  • CoRAG: A New Paradigm for Retrieval-Augmented Generation in Deep Research

    CoRAG: A New Paradigm for Retrieval-Augmented Generation in Deep Research

    Introduction When discussing advanced problem-solving using large language models (LLMs), a recurring topic is the role…

社区洞察

其他会员也浏览了