This week, we will explore DeepSeek, a Chinese AI lab that has rapidly gained recognition for its innovative LLM development approach. Unlike many well-funded US tech companies, DeepSeek has achieved remarkable performance and efficiency with comparatively limited resources.
DeepSeek V3 was developed in just two months with a $5.58M budget led by a?group of hedge fund managers.?Its API is currently 100x cheaper than ChatGPT.
DeepSeek R1 is one of the top models in?LMSYS Chatbot Arena. It is?tied with ChatGPT and Gemini on most benchmarks and is the?only MIT open-source license model on the leaderboard. Oh, and?DeepSeek Buzz Puts Tech Stocks on Track for a $1 Trillion Drop.
Special thanks to
Ouyang Ruofei
for assisting with the research.
AI Podcast Discussion
This week's podcast provides an excellent summary, especially for the challenging technical details and their significance to the AI industry.
Why This Tech Matters
- Challenging US Dominance:?DeepSeek has emerged as a strong contender in the global AI race. It demonstrates that innovation isn't solely dependent on vast resources, and the company has been called a 'Sputnik moment' for the US.
- Democratizing AI: Their open-source models and cost-effective approaches make advanced AI accessible to more developers and companies.
- Pushing Boundaries of LLMs: The breakthroughs in reasoning capabilities demonstrated by the DeepSeek-R1 series and the efficiency gains of DeepSeek-V3 significantly expand the possibilities of LLMs.
- Future Research:?DeepSeek's pioneering techniques, such as auxiliary loss-free load balancing, multi-token prediction, and reinforcement learning without supervised fine-tuning, are setting the standard for future AI research.
- Global Impact: Their work is not just a win for China; it's a wake-up call for the worldwide tech industry. It demonstrates that innovation can thrive under constraints and that the future of AI is decentralized and collaborative.
DeepSeek's Efficiency Advantage: A Multi-faceted Approach
DeepSeek's ability to outperform many of its competitors while spending significantly less is down to several factors:
Optimized Architectures: DeepSeek has developed architectures specifically designed for efficient training and inference.
- Multi-head Latent Attention (MLA): This key component reduces memory demands by compressing the attention keys and values. By reducing the KV cache size during inference, MLA allows DeepSeek to achieve faster performance without requiring excessive memory.
- DeepSeekMoE:?This mixture-of-experts architecture uses finer-grained experts and a shared expert,?distributing the computational load and allowing for more economical training. This differs from many traditional MoE models, which use larger experts and less flexibility in load balancing. Furthermore, only a few parameters are activated for each token, making training efficient. In a way,?DeepSeek selects a subset of relevant experts to tackle each task instead of activating everyone.
- Auxiliary-Loss-Free Load Balancing:?DeepSeek uses an alternative approach that reduces performance degradation from load balancing efforts without relying on auxiliary loss functions. This allows the experts to specialize better within their domains.
Advanced Training Techniques: DeepSeek uses several innovative training techniques to improve efficiency further.
- Multi-Token Prediction (MTP): The models are trained to predict multiple tokens at once rather than one at a time, improving overall performance. This changes the training objective and is an approach not widely adopted by their competitors.
- FP8 Training: This mixed precision training using FP8 data format significantly reduces the memory footprint during training. This contrasts with traditional training methods that may use higher precision and be more resource-intensive.
- DualPipe: This pipeline parallelism algorithm overlaps computation and communication, reducing pipeline bubbles and optimizing resource usage.
- Memory Optimisation:?They carefully optimize the memory footprint during training, avoiding the need for expensive tensor parallelism. DeepSeek has designed the system to share parameters and gradients to further enhance memory efficiency.
- Reinforcement Learning: The company has successfully shown that reinforcement learning can be used to enhance reasoning abilities in LLMs with very little or no dependence on large quantities of supervised data
Strategic Resource Utilisation:?DeepSeek has effectively leveraged less powerful hardware and focused on algorithmic and structural innovation rather than brute-force scaling.
- Due to US sanctions, they had to use Nvidia H800 GPUs rather than the more powerful H100 GPUs that many US tech companies use. This shows that DeepSeek has optimized its software stack and training methodologies to produce results on lower-power hardware. They have also optimized their communications infrastructure to improve throughput and reduce latency.
- Open-Source Philosophy:?DeepSeek's open-source approach enables global collaboration and quicker innovation. By publicly releasing their models (under an MIT license with a full technical report), they foster collaboration and innovation globally and allow faster development of future models.
- They also offer their models at a much lower inference cost (100x cheaper than ChatGPT), making their technology accessible to a wider user base.
Deep Dive into DeepSeek's Key Models
DeepSeek-V3: The Cost-Effective Powerhouse
Architecture Details: DeepSeek-V3 employs a mixture of experts (MoE) architecture with 671 billion parameters, but only a portion (37 billion) is activated for each token.
- It is structured with 256 routed experts and 1 shared expert. The model employs Multi-Head Latent Attention (MLA) for efficient inference. The core of MLA is the low-rank joint compression for attention keys and values to reduce the Key-Value (KV) cache during inference. The attention queries, keys, and values are combined to yield the final attention output.
- Using shared experts and routed experts allows the model to distribute computation and learn specialized skills within each expert.
- DeepSeek-V3's design choices, such as using sigmoid functions to compute affinity scores, allow for more precise gating values. They have also introduced an auxiliary-loss-free strategy to prevent performance degradation due to load-balancing efforts. This is an innovative approach compared to other models which rely on auxiliary losses.
- They also incorporate a complementary sequence-wise balance loss, an additional feature that ensures balanced expert loading within sequences.
Training Details:?The model is trained on 14.8 trillion high-quality tokens, with a strong focus on mathematical and programming samples. During training, it implements a document packing method without cross-sample attention masking.
- They also implement a Fill-in-the-Middle (FIM) strategy, in which the model learns to predict middle text based on contextual cues.
- The training is remarkably stable, with no loss spikes or rollbacks. The models are trained in just 55 days.
- The models also undergo a context length extension (first to 32K and then to 128K) by applying YaRN after pre-training.
Performance:?DeepSeek-V3 performs excellently across several benchmarks, outperforming other open-source models and matching closed-source models, including GPT-4o and Claude-3.5-Sonnet. In particular, it performs very well in mathematics, code, and reasoning tasks.
Training Cost: DeepSeek-V3's total training cost, including pre-training, context extension, and post-training, is approximately $5.6 million. This is significantly lower than the billions spent by some US companies.
Impact: DeepSeek-V3's performance and low training cost are pushing the boundaries of what's possible with large models and challenging the US dominance in the AI space.
DeepSeek-R1: The Reasoning Specialist
Unique Training Approach: DeepSeek R1 is focused on reasoning and trained through innovative reinforcement learning (RL). The R1-Zero model is trained purely through RL without SFT.
- DeepSeek-R1-Zero:?This model was trained using pure RL, which allowed it to develop complex CoT reasoning abilities without supervised data. After thousands of steps of RL, the model demonstrated significant improvement in reasoning tasks like the AIME 2024. The pass@1 score on the AIME increased from 15.6% to 71.0%.
- DeepSeek-R1?builds upon R1-Zero with a multi-stage training pipeline incorporating small amounts of cold-start data and two RL stages. This model also includes a language consistency reward during the RL phase. After fine-tuning with the new data, it undergoes another RL phase.
- Performance: DeepSeek-R1 achieves powerful results in mathematical problem-solving (AIME 2024, MATH-500), coding, and other tasks requiring reasoning. It exceeds the performance of DeepSeek-V3 and matches the performance of OpenAI's o1 series on specific benchmarks.
DeepSeek utilizes Group Relative Policy Optimization (GRPO), a reinforcement learning (RL) algorithm, to develop its DeepSeek-R1 models. GRPO is employed to improve the reasoning capabilities of large language models (LLMs). Here's how DeepSeek employs GRPO in R1:
- Cost-Effective RL: GRPO is used to save on the training costs of RL, as it does not use a critic model that is typically the same size as the policy model. Instead, GRPO estimates the baseline from group scores.
- Baseline Estimation:?GRPO samples a group of outputs from the old policy for each question and then optimizes the policy model. The advantage is calculated using the rewards corresponding to the outputs within each group. Rather than using a critic model, the baseline is estimated from the group of scores.
- Objective Maximization:?GRPO optimizes the policy model by maximizing an objective function that includes a clipped policy ratio and a Kullback–Leibler divergence term. This helps stabilize the training process.
- Reward System:?The reward system is rule-based and consists of accurate and formatted rewards. For DeepSeek-R1, a language consistency reward, calculated as the proportion of target language words in the CoT, is also introduced.
- Self-Evolution: Using GRPO, DeepSeek-R1-Zero demonstrates a self-evolution process where the model learns to solve complex reasoning tasks using extended test-time computation. This leads to the spontaneous development of sophisticated behaviors, such as reflection and exploring alternative problem-solving approaches.
Conclusion
DeepSeek's rapid rise as an AI leader is a testament to its strategic and innovative approach. They have redefined how AI models are built and trained, proving that high performance can be achieved with limited resources. Their commitment to open-source and focus on efficient, innovative solutions position them as a major force in the global AI landscape. They've demonstrated that the future of AI will be shaped by those who innovate the fastest and most efficiently rather than by those with the largest budgets.
Sources
Professor in Innovation Management | Global Futurist | Author of 30 books on Purpose-Driven Innovation, AI, Governance, Design, Leadership, and Sustainability | Endorsed by Donald Trump: "TO HUBERT, ALWAYS THINK BIG!"
4 周Very informative Kai Xin Thia . Happy to share this AI Design Lessons from DeepSeek https://hkrampersad.wordpress.com/2025/02/01/purpose-driven-ai-design-lifecycle/
Gen AI | AI/ML | Advanced Data Analytics
1 个月Awesome innovation by Deepseek with a pinch of skepticism on the reported development costs. Chinese firms can’t publicly disclose the use of sanctioned chips while I’m pretty sure they do have them.
Head of AI & Analytics, Group Tech Office, ST Engineering
1 个月And.. they just released yet ANOTHER open-source AI model, Janus-Pro-7B. It is multimodal and beats OpenAI's DALL-E 3 and Stable Diffusion across GenEval and DPG-Bench benchmarks. https://venturebeat.com/ai/deepseek-unleashes-janus-pro-7b-vision-model-amidst-ai-stock-bloodbath-igniting-fresh-fears-of-chinese-tech-dominance/