DeepSeek-R1 is fascinating because it prioritizes fundamentals and clever architecture over novelty. Here's a technical summary.
Q/ What was DeepSeek's strategy to circumvent export restrictions on hardware?
A/ DeepSeek did not directly circumvent export restrictions. Instead, they optimized H800s by tweaking the chips to ensure memory was handled as efficiently as possible. This optimization meant their low-level code was not constrained by chip capacity issues, allowing them to maintain performance without needing advanced restricted chips.
Q/ What method did DeepSeek employ to replicate o1's performance?
A/ Reinforcement learning. Plain and simple. They focused on training with complex questions that could be easily verified, particularly in areas like math or coding. The model was updated based on correct answers, which helped in refining its capabilities.
Q/ How did DeepSeek reduce the cost of inference?
A/ DeepSeek achieved cheaper inference by compressing the Key-Value cache, a breakthrough from their earlier work. This compression technique significantly reduces the memory overhead during inference, which in turn lowers the computational cost.
Q/ How many GPU hours did DeepSeek-V3 require for its full training?
A/ DeepSeek-V3 used H800. It required a total of 2.788 million H800 GPU hours for its full training. This includes 2.664 million GPU hours for the pre-training stage, 119K GPU hours for context length extension, and 5K GPU hours for post-training.
Q/ How does the efficiency of DeepSeek in terms of GPU hours compare to typical expectations for training such models?
A/ DeepSeek-V3's training was notably efficient, costing only 2.788M GPU hours, which is significantly less than what might be expected for models of its performance level. For comparison, the typical cost for training LLMs is much higher, but DeepSeek managed to achieve high performance at a cost of $2/GPU hour, totaling to $5.576 million for DeepSeek-V3's training. This efficiency is attributed to their innovative approaches in load balancing and multi-token prediction.
Q/ How did DeepSeek manage to train their model more efficiently than others?
A/ DeepSeek leverages a MoE architecture which activates only 37B params per token, significantly reducing the computational load. They also adopted FP8 precision for key computations, which cuts down memory and computational costs. Their use of latent attention and dynamic routing contributes to speed and efficiency enhancements. This approach led to a significant reduction in GPU requirements, needing only 5% of the parameters per token, which is 95% fewer GPUs than what Meta would typically use.
Bottomline: All the #AI labs should be freaking out. They're likely adopting the DeepSeek architecture to tweak their newer models. While exciting times, remember, data remains king.