Many people have asked for my thoughts on DeepSeek AI. As I know some of the core folks involved across Silicon Valley, China, AI, and quant, here’s my quick take:
1. Strong Engineer-First Culture
DeepSeek was bootstrapped in 2023 and quickly gained recognition by replicating large-model techniques. With limited resources, they concentrate on core breakthroughs (like reasoning) rather than trying to tackle everything at once. Their engineering and research teams primarily hail from top Chinese universities, often with limited work experience.
2. Resource Efficiency
They’re known for innovative strategies to reduce GPU usage and training costs—an essential skill in quant, where efficient feature engineering and model development are key to alpha generation. Rather than brute-forcing with massive compute, they prioritize high-quality data and targeted fine-tuning.
3. Minimal SFT (Supervised Fine-Tuning)
They have shown that robust reasoning can emerge with relatively little supervised data. They leverage a strong base model to generate data and then apply focused SFT to achieve impressive results.
4. Distillation: Short-Term Gains, Long-Term Trade-Offs
Distilling large models into smaller ones yields solid performance boosts. However, it may cap the model’s overall potential if genuinely new architectures aren’t explored.
5. Open Source vs. Closed Source
DeepSeek’s open releases show that smaller teams can still innovate at a high level. That pressures proprietary models to justify their high cost and massive resource usage.
6. Implications for Builders & 2025
Expect more architectural diversity beyond transformers and advanced RL/agent applications. Many companies will opt for cost-efficient models (achieving 90–95% of state-of-the-art performance at less than 10% of the cost) for real-world products rather than continually chasing ever-larger, more resource-intensive SOTA models.
Bottom Line
DeepSeek proves you don’t always need massive GPU farms to make large language models shine. The core drivers of AGI—data, architecture, and compute—are showing diminishing returns from brute-force scaling, so we can anticipate more innovation in architectures and data synthesis. This may reduce NVIDIA’s training-centric advantage (as demand shifts toward inference), but it will likely accelerate overall AGI adoption—an instance of the Jevons Paradox in action.