DeepSeek: When Innovation Shines
innovation FTW

DeepSeek: When Innovation Shines

Note: This section is part of a longer blog post on Model Size and its impact on performance and inference (to be published next week)

It would be remiss of me not to mention why DeepSeek is causing such a stir and why people and financial markets are losing their minds over it.

Training LLMs is expensive, and companies like OpenAI and Anthropic spend between $70 and $150M (rumoured) on compute per model. Let that sink in - that's just for Compute. They need massive data centres packed with tens of thousands of GPUs to make this happen. Everyone assumed that better AI models needed more compute power, which meant hundreds of millions in investment. Until now.

DeepSeek flips this script by building a model that matches or even beats GPT-4 and Claude on many tasks - and they do it with just under $6M (see footnote 1). That's like getting a Ferrari's performance for the price of a Toyota. They pulled this off with several clever innovations:

  • Think of traditional AI as storing numbers like your bank balance with tons of decimal places (32 of them). DeepSeek found a way to do the same calculations accurately with just 8 decimal places. That means they need way less memory to get the same results.
  • They took the innovation of "expert systems" and sparse models to another level, so instead of one big model trying to know everything, they have specialised experts that only wake up when needed. Very similar to Mixtrals LLM.
  • Their "multi-token" system is like reading multiple words at once instead of one at a time - imagine reading "New York City" as one chunk rather than three separate words. This makes it at least 2x faster and works correctly 90% of the time. This makes a huge difference when you're processing billions of words.

The cherry on top - it's open-source with a very generous MIT license which allows for unrestricted commercial use. This could unleash a wave of innovation from developers, researchers, and creators who were previously priced out of the AI race. Sometimes the biggest breakthroughs come not from throwing more resources at a problem, but from fundamentally rethinking how we solve it.


  1. There is some contention around the reported $6 million training cost for DeepSeek-R1; it's likely being confused with DeepSeek-V3, released last December.

Sita Kunz SKR ?????????

Technology Consulting | People Connector | Salesforce Enthusiast

1 个月

Thanks sharing and summaries the long post for the readers ???? I like ???? That's like getting a Ferrari's performance for the price of a Toyota.

Ian Douglas

Driving Innovative Agentic AI Solutions with Global System Integrators and Top Consulting Partners @ Salesforce across US and Canada

1 个月

The distillation of OpenAI's data quality into their own efficient data for learning is definitely a cool headline here Anup Jadhav. The other that sticks out for me is the RL that produced its super-powered approach to CoT. That is going to be the creative spark from RL that all others incorporate to power reasoning. With it the agentic era is really going to take off! I believe they also didn't even need to do RLHF!?! ??

Mark Wraith

Salesforce CTA - Slalom

1 个月

If I understand it correctly we are now into a world where LLMs are training LLMs. And what DeepSeek have achieved is a highly optimised architecture for doing that. I believe the data that they have trained the model on is in fact closed source, but it is known to be a mixture of other open source and commercial models. The irony is OpenAI can hardly complain that their model was used to train DeepSeek, given OpenAI used copyrighted material for training.

Michael Gill

Chief Technical Officer | AWS Cloud Architect & Engineer | Salesforce Architect | MVP Hall of Fame

1 个月

Thanks for this Anup Jadhav perfect

Thank you Anup for this summary. Attempting to use an analogy: are they offering something similar to columnar database benefits over relational, when columnar made massive strides in performance and cost - as long as we knew how to handle the nulls in the database?

要查看或添加评论,请登录

Anup Jadhav的更多文章

社区洞察

其他会员也浏览了