登录查看更多内容

DeepSeek – A Deep Dive into Efficiency and Innovation

Kai Xin Thia

Head of AI & Analytics, Group Tech Office, ST Engineering

发布日期: 2025年1月27日

This week, we will explore DeepSeek, a Chinese AI lab that has rapidly gained recognition for its innovative LLM development approach. Unlike many well-funded US tech companies, DeepSeek has achieved remarkable performance and efficiency with comparatively limited resources.

DeepSeek V3 was developed in just two months with a $5.58M budget led by a?group of hedge fund managers.?Its API is currently 100x cheaper than ChatGPT.

DeepSeek R1 is one of the top models in?LMSYS Chatbot Arena. It is?tied with ChatGPT and Gemini on most benchmarks and is the?only MIT open-source license model on the leaderboard. Oh, and?DeepSeek Buzz Puts Tech Stocks on Track for a $1 Trillion Drop.

Special thanks to Ouyang Ruofei for assisting with the research.

AI Podcast Discussion

This week's podcast provides an excellent summary, especially for the challenging technical details and their significance to the AI industry.

Why This Tech Matters

Challenging US Dominance:?DeepSeek has emerged as a strong contender in the global AI race. It demonstrates that innovation isn't solely dependent on vast resources, and the company has been called a 'Sputnik moment' for the US.
Democratizing AI: Their open-source models and cost-effective approaches make advanced AI accessible to more developers and companies.
Pushing Boundaries of LLMs: The breakthroughs in reasoning capabilities demonstrated by the DeepSeek-R1 series and the efficiency gains of DeepSeek-V3 significantly expand the possibilities of LLMs.
Future Research:?DeepSeek's pioneering techniques, such as auxiliary loss-free load balancing, multi-token prediction, and reinforcement learning without supervised fine-tuning, are setting the standard for future AI research.
Global Impact: Their work is not just a win for China; it's a wake-up call for the worldwide tech industry. It demonstrates that innovation can thrive under constraints and that the future of AI is decentralized and collaborative.

DeepSeek's Efficiency Advantage: A Multi-faceted Approach

DeepSeek's ability to outperform many of its competitors while spending significantly less is down to several factors:

Optimized Architectures: DeepSeek has developed architectures specifically designed for efficient training and inference.

Multi-head Latent Attention (MLA): This key component reduces memory demands by compressing the attention keys and values. By reducing the KV cache size during inference, MLA allows DeepSeek to achieve faster performance without requiring excessive memory.
DeepSeekMoE:?This mixture-of-experts architecture uses finer-grained experts and a shared expert,?distributing the computational load and allowing for more economical training. This differs from many traditional MoE models, which use larger experts and less flexibility in load balancing. Furthermore, only a few parameters are activated for each token, making training efficient. In a way,?DeepSeek selects a subset of relevant experts to tackle each task instead of activating everyone.
Auxiliary-Loss-Free Load Balancing:?DeepSeek uses an alternative approach that reduces performance degradation from load balancing efforts without relying on auxiliary loss functions. This allows the experts to specialize better within their domains.

Advanced Training Techniques: DeepSeek uses several innovative training techniques to improve efficiency further.

Multi-Token Prediction (MTP): The models are trained to predict multiple tokens at once rather than one at a time, improving overall performance. This changes the training objective and is an approach not widely adopted by their competitors.
FP8 Training: This mixed precision training using FP8 data format significantly reduces the memory footprint during training. This contrasts with traditional training methods that may use higher precision and be more resource-intensive.
DualPipe: This pipeline parallelism algorithm overlaps computation and communication, reducing pipeline bubbles and optimizing resource usage.
Memory Optimisation:?They carefully optimize the memory footprint during training, avoiding the need for expensive tensor parallelism. DeepSeek has designed the system to share parameters and gradients to further enhance memory efficiency.
Reinforcement Learning: The company has successfully shown that reinforcement learning can be used to enhance reasoning abilities in LLMs with very little or no dependence on large quantities of supervised data

Strategic Resource Utilisation:?DeepSeek has effectively leveraged less powerful hardware and focused on algorithmic and structural innovation rather than brute-force scaling.

Due to US sanctions, they had to use Nvidia H800 GPUs rather than the more powerful H100 GPUs that many US tech companies use. This shows that DeepSeek has optimized its software stack and training methodologies to produce results on lower-power hardware. They have also optimized their communications infrastructure to improve throughput and reduce latency.
Open-Source Philosophy:?DeepSeek's open-source approach enables global collaboration and quicker innovation. By publicly releasing their models (under an MIT license with a full technical report), they foster collaboration and innovation globally and allow faster development of future models.
They also offer their models at a much lower inference cost (100x cheaper than ChatGPT), making their technology accessible to a wider user base.

领英推荐

Unicorns and Rainbows: The Reality of Implementing AI…

Marvelous MLOps 3 周前

The 5 Biggest Technology Trends For 2025 Everyone Must…

Bernard Marr 5 个月前

Baichuan Intelligence: The AI Tiger Focused on Math…

TuringPost 3 个月前

Deep Dive into DeepSeek's Key Models

DeepSeek-V3: The Cost-Effective Powerhouse

Architecture Details: DeepSeek-V3 employs a mixture of experts (MoE) architecture with 671 billion parameters, but only a portion (37 billion) is activated for each token.

It is structured with 256 routed experts and 1 shared expert. The model employs Multi-Head Latent Attention (MLA) for efficient inference. The core of MLA is the low-rank joint compression for attention keys and values to reduce the Key-Value (KV) cache during inference. The attention queries, keys, and values are combined to yield the final attention output.
Using shared experts and routed experts allows the model to distribute computation and learn specialized skills within each expert.
DeepSeek-V3's design choices, such as using sigmoid functions to compute affinity scores, allow for more precise gating values. They have also introduced an auxiliary-loss-free strategy to prevent performance degradation due to load-balancing efforts. This is an innovative approach compared to other models which rely on auxiliary losses.
They also incorporate a complementary sequence-wise balance loss, an additional feature that ensures balanced expert loading within sequences.

Training Details:?The model is trained on 14.8 trillion high-quality tokens, with a strong focus on mathematical and programming samples. During training, it implements a document packing method without cross-sample attention masking.

They also implement a Fill-in-the-Middle (FIM) strategy, in which the model learns to predict middle text based on contextual cues.
The training is remarkably stable, with no loss spikes or rollbacks. The models are trained in just 55 days.
The models also undergo a context length extension (first to 32K and then to 128K) by applying YaRN after pre-training.

Performance:?DeepSeek-V3 performs excellently across several benchmarks, outperforming other open-source models and matching closed-source models, including GPT-4o and Claude-3.5-Sonnet. In particular, it performs very well in mathematics, code, and reasoning tasks.

Training Cost: DeepSeek-V3's total training cost, including pre-training, context extension, and post-training, is approximately $5.6 million. This is significantly lower than the billions spent by some US companies.

Impact: DeepSeek-V3's performance and low training cost are pushing the boundaries of what's possible with large models and challenging the US dominance in the AI space.

DeepSeek-R1: The Reasoning Specialist

Unique Training Approach: DeepSeek R1 is focused on reasoning and trained through innovative reinforcement learning (RL). The R1-Zero model is trained purely through RL without SFT.

DeepSeek-R1-Zero:?This model was trained using pure RL, which allowed it to develop complex CoT reasoning abilities without supervised data. After thousands of steps of RL, the model demonstrated significant improvement in reasoning tasks like the AIME 2024. The pass@1 score on the AIME increased from 15.6% to 71.0%.
DeepSeek-R1?builds upon R1-Zero with a multi-stage training pipeline incorporating small amounts of cold-start data and two RL stages. This model also includes a language consistency reward during the RL phase. After fine-tuning with the new data, it undergoes another RL phase.
Performance: DeepSeek-R1 achieves powerful results in mathematical problem-solving (AIME 2024, MATH-500), coding, and other tasks requiring reasoning. It exceeds the performance of DeepSeek-V3 and matches the performance of OpenAI's o1 series on specific benchmarks.

DeepSeek utilizes Group Relative Policy Optimization (GRPO), a reinforcement learning (RL) algorithm, to develop its DeepSeek-R1 models. GRPO is employed to improve the reasoning capabilities of large language models (LLMs). Here's how DeepSeek employs GRPO in R1:

Cost-Effective RL: GRPO is used to save on the training costs of RL, as it does not use a critic model that is typically the same size as the policy model. Instead, GRPO estimates the baseline from group scores.
Baseline Estimation:?GRPO samples a group of outputs from the old policy for each question and then optimizes the policy model. The advantage is calculated using the rewards corresponding to the outputs within each group. Rather than using a critic model, the baseline is estimated from the group of scores.
Objective Maximization:?GRPO optimizes the policy model by maximizing an objective function that includes a clipped policy ratio and a Kullback–Leibler divergence term. This helps stabilize the training process.
Reward System:?The reward system is rule-based and consists of accurate and formatted rewards. For DeepSeek-R1, a language consistency reward, calculated as the proportion of target language words in the CoT, is also introduced.
Self-Evolution: Using GRPO, DeepSeek-R1-Zero demonstrates a self-evolution process where the model learns to solve complex reasoning tasks using extended test-time computation. This leads to the spontaneous development of sophisticated behaviors, such as reflection and exploring alternative problem-solving approaches.

Conclusion

DeepSeek's rapid rise as an AI leader is a testament to its strategic and innovative approach. They have redefined how AI models are built and trained, proving that high performance can be achieved with limited resources. Their commitment to open-source and focus on efficient, innovative solutions position them as a major force in the global AI landscape. They've demonstrated that the future of AI will be shaped by those who innovate the fastest and most efficiently rather than by those with the largest budgets.

Sources

Hubert Rampersad

Professor in Innovation Management | Global Futurist | Author of 30 books on Purpose-Driven Innovation, AI, Governance, Design, Leadership, and Sustainability | Endorsed by Donald Trump: "TO HUBERT, ALWAYS THINK BIG!"

4 周

Very informative Kai Xin Thia . Happy to share this AI Design Lessons from DeepSeek https://hkrampersad.wordpress.com/2025/02/01/purpose-driven-ai-design-lifecycle/

Tan Tsze Shin

Gen AI | AI/ML | Advanced Data Analytics

1 个月

Awesome innovation by Deepseek with a pinch of skepticism on the reported development costs. Chinese firms can’t publicly disclose the use of sanctioned chips while I’m pretty sure they do have them.

2 次回应

Kai Xin Thia

Head of AI & Analytics, Group Tech Office, ST Engineering

1 个月

And.. they just released yet ANOTHER open-source AI model, Janus-Pro-7B. It is multimodal and beats OpenAI's DALL-E 3 and Stable Diffusion across GenEval and DPG-Bench benchmarks. https://venturebeat.com/ai/deepseek-unleashes-janus-pro-7b-vision-model-amidst-ai-stock-bloodbath-igniting-fresh-fears-of-chinese-tech-dominance/

6 次回应

查看更多评论

要查看或添加评论，请登录

Kai Xin Thia的更多文章

Small but Mighty: SLMs are Democratising AI

2025年2月27日

Small but Mighty: SLMs are Democratising AI

This week, we explore the surge in the development of small language models (SLMs) that address the growing need for…

5 条评论
DeekSeek AI Agents for Knowledge Graph Augmentation & Query

2025年2月20日

DeekSeek AI Agents for Knowledge Graph Augmentation & Query

This week, let's explore how advancements in AI-driven knowledge management pave the way for more efficient and…
Advanced Agentic Reasoning with Structure & Optimisation

2025年2月13日

Advanced Agentic Reasoning with Structure & Optimisation

LLMs are transforming beyond simple text generation to complex problem-solving and expert-level reasoning. This shift…

1 条评论
Practical Humanoid Robots - Agile, Affordable, Teleoperated

2025年2月5日

Practical Humanoid Robots - Agile, Affordable, Teleoperated

This week, let's take a deeper look into Humanoid robotics, which is experiencing a rapid transformation, making…
Applied AI: LLMs for Enhanced Emergency Response

2025年1月25日

Applied AI: LLMs for Enhanced Emergency Response

This week, we explore several innovative approaches to leveraging LLMs and other AI techniques to enhance emergency…

1 条评论
Physical AI and the Convergence of Embodied & Living Intelligence

2025年1月17日

Physical AI and the Convergence of Embodied & Living Intelligence

The rapidly developing field of Artificial Intelligence is no longer confined to the digital realm of text and images…
Future of Humanoid Robotics

2025年1月9日

Future of Humanoid Robotics

The world of humanoid robotics is on the cusp of a significant leap forward, driven by the convergence of sophisticated…

1 条评论
A Deep Dive into Generative World Models

2025年1月2日

A Deep Dive into Generative World Models

This week, we explore the surge of innovation in AI world models that enables the creation of interactive and…

1 条评论
Building and Deploying Robust AI Systems

2024年12月24日

Building and Deploying Robust AI Systems

This week, let's examine how we can develop AI systems that are robust, reliable, and adaptable for real-world…

1 条评论
AI Gone Rogue: The Hidden Threat of Scheming Agentic AI

2024年12月19日

AI Gone Rogue: The Hidden Threat of Scheming Agentic AI

This week, we look into recent research revealing surprising capabilities in advanced LLMs, showcasing their potential…

3 条评论

See all articles

DeepSeek – A Deep Dive into Efficiency and Innovation

Kai Xin Thia

Head of AI & Analytics, Group Tech Office, ST Engineering

AI Podcast Discussion

Why This Tech Matters

DeepSeek's Efficiency Advantage: A Multi-faceted Approach

领英推荐

Deep Dive into DeepSeek's Key Models

DeepSeek-V3: The Cost-Effective Powerhouse

DeepSeek-R1: The Reasoning Specialist

Conclusion

Sources

Kai Xin Thia的更多文章

社区洞察

其他会员也浏览了

Demystifying Artificial Intelligence: The New Electricity!

Navigating AI’s Paradoxes: A Balanced Approach to Innovation

What Will the Future Bring?

DeepSeek Breaks the AI Paradigm

DEEPSEEK DEEP DIVE: AI TOOL MAKES WAVES AND SHAKES UP THE MARKET!

Think Like a Machine, Act Like a?Human

AI as Key Exponential Technology in the Smart Technology Era

Actual Intelligence

The Artificial Investor - Issue 40: 2024 highlights

The Transformational Impact of AI: A Perspective from Zown

AI Podcast Discussion

Why This Tech Matters

DeepSeek's Efficiency Advantage: A Multi-faceted Approach

领英推荐

Deep Dive into DeepSeek's Key Models

DeepSeek-V3: The Cost-Effective Powerhouse

DeepSeek-R1: The Reasoning Specialist

Conclusion

Sources

Kai Xin Thia的更多文章

Small but Mighty: SLMs are Democratising AI

DeekSeek AI Agents for Knowledge Graph Augmentation & Query

Advanced Agentic Reasoning with Structure & Optimisation

Practical Humanoid Robots - Agile, Affordable, Teleoperated

Applied AI: LLMs for Enhanced Emergency Response

Physical AI and the Convergence of Embodied & Living Intelligence

Future of Humanoid Robotics

A Deep Dive into Generative World Models

Building and Deploying Robust AI Systems

AI Gone Rogue: The Hidden Threat of Scheming Agentic AI

社区洞察

其他会员也浏览了

Demystifying Artificial Intelligence: The New Electricity!

Navigating AI’s Paradoxes: A Balanced Approach to Innovation

What Will the Future Bring?

DeepSeek Breaks the AI Paradigm

DEEPSEEK DEEP DIVE: AI TOOL MAKES WAVES AND SHAKES UP THE MARKET!

Think Like a Machine, Act Like a?Human

AI as Key Exponential Technology in the Smart Technology Era

Actual Intelligence

The Artificial Investor - Issue 40: 2024 highlights

The Transformational Impact of AI: A Perspective from Zown