Hide and DeepSeek
Adrian De Luca
Technologist, Advisor, Investor & Director Cloud Acceleration at Amazon Web Services (AWS)
How quickly the world changes again! On my first week back from a 6 week summer sabbatical where my mind was mostly occupied with whether it was a beach or pool day, I return back to work in the midst of the DeepSeek frenzy.
DeepSeek, or more precisely the R1 model is of course latest large language model to be unleashed onto the world from a Chinese firm that appears to offer remarkable reasoning capabilities showing comparable results to OpenAI’s o1 models. Most of the public commentary over the past few weeks has swirled around the reported tiny $6M(USD) cost to train, the tech sell off on Wall Street, this being a “sputnik moment” for the US, and where the hell did the Chinese get all those GPU’s anyway where there is supposed to be an embargo? In this hot take, I’m going to cover none of that and instead offer some of my initial experience and what I learned playing with it, but more importantly offer my perspective on why this is yet another important moment in the development of this disruptive branch of Artificial Intelligence, and how it will most likely thrust us forward even faster.
Firstly. I’m going to come right out and say that the emergence of DeepSeek R1 is genuinely a great thing for everyone. Putting aside any nationalist or political interests, whether the cost to train it were in fact true, and the AI boogie man, the release of this very capable reasoning model introduces several new advances in large language models that everyone should understand. I’m going to attempt to break them down in some very specific areas and hopefully not get overly techy.?
How open is open?
Open source software has, by in large been good for users and companies. Its principles grounded in being not only freely available, but also freely distributable and modifiable has proven to significantly accelerate the development of features, integrations and overall capabilities of the tools we use in our society today. Think about all the things we wouldn’t have today if Liuns Torvalds has never open sourced the Linux Operating System?
We are at the earliest phases of Generative AI technologies, and with LLM’s showing so much promise we must rapidly evolve in order for it to reach broadscale adoption. That means getting it into the hands in as many clever developers, scientists, engineers and builders so open source seems like a no brainer. Unfortunately, open source has been having a put of a “moment” with several high profile companies walking away from standard licenses and defining their own versions and muddying the waters. But in the world of AI, open source is much more nuanced. OpenAI, who as the name implies started out with noble intentions being “open” with their technology actually turned out to quite closed about almost everything. For this development to occur, and for trust to be gained, transparency is needed at every layer which means not just the source code of how it works, but the training data they used and weights applied. The recent announcement of the Open Weight Definition (OWD) is news that we are heading in this direction, but it ultimately comes down to the companies developing adopting these standards.
Now with DeepSeek R1, they have taken a decisive step ahead of other model providers to open source the weights used for fine tuning and training code, although the underlying training data is still unavailable and largely unknown. While still not fully transparent, these assets together with R1’s whitepaper will give others the ability to inspect, learn, evolve and even adapt its efficiencies to other models. But let me be clear, its not “open source”, but it’s the most open we’ve seen yet and it will no doubt prompt other model creators to rethink their level of transparency.
A new approach to training
Most LLM’s available today from companies like OpenAI, Anthropic, Mistral, Cohere and others use a supervised training approach with instruction based fine-tuning, this means in order to get better results you need to increase the number of parameters (now in the tens of billions) which leads to more complex number crunching which ultimately translates to needing a lot more GPU horsepower to throw at it. This is what is making them so expensive to train.
DeepSeek’s R1 is different, it uses a reinforcement learning training approach, which leverages an efficient rule based reward system compared to the complex neural reward system, all of this means it requires far few accelerators or can be done with less capable GPU’s. The architecture they use is actually called a Mixture of Experts (MoE), and by using these chips they are able to exploit a phenomenon known as “sparsity”. I ?won’t go into all the details here (but read here if you are interested) but it can effectively “switch off” large parts of the models neural network’s weights, making it more efficient to train while not materially affecting the model’s output. They pioneered a clever way around the use of tokens called Multi-Head Latent Attention (again read here for more details) for long context inference. They’ve also found some neat ways in being able to do model distillation, or compression to a small 1.5B parameters making it more efficient with its knowledge transfer.
Now I know that was all a bit heavy to pack in, and whether DeepSeek spent only $6M or more will remain to be debated, however the experience I got with it was mostly impressive. Not only were the answers well structured and contextual, but its reasoning and chain of thought was very logical. What is more important is we now have new techniques and potentially new lines of research to explore on how do the most expensive part of frontier models which is the training. And if more models can be trained this way, we can dramatically reduce the cost to develop a wide range of models, which would encourage us to apply them in significantly more use cases.
Are we back to Jevons Paradox?
Such a dramatic enhancement to efficiency has seemingly broken the cardinal rule that training frontier models is uber expensive, requiring hundreds of millions of dollars of investment. And that all the “great data center build out” predominantly led by the large cloud hyperscale’s will actually not be needed, or at least to the same extent, after all. Industry pundits and investors have at least signalled as much they set nVidia stock tumbling on Wall Street immediately after the news of DeepSeek, and even continued to dog the recent earnings reports of Amazon, Google and Microsoft as they stridently defended their infrastructure plans to the tune of tens of billions of dollars.
领英推荐
But these fears and concerns are largely unfounded, and we know through many generations of technological advancements like the cloud. When such efficiencies are unlocked, and the cost barrier is reduced, people will find more ways use it, not less. Several people online have made reference to the Jevons Paradox and how the better unit cost benefits will fuel greater and faster consumption as a way to foresee what is most likely going to play out.
DeepSeek’s economics for running prompts are tantalisingly attractive, they charge just $0.55 per million input tokens and $2.19 per million output tokens, which way more affordable when compared to OpenAI's API pricing of $15 and $60 for the same. ?The fact is AI economics matter not only when it comes to consumers, but companies committing to their AI projects and laying down their budgets. In 2025 as we see many companies who have experimented and validated Generative AI and will be looking to transition them into production, so this is going to be a very critical factor.
So can we trust it?
When it comes to data privacy, security and responsibility in Generative AI, there is no alternative to maintaining a very bar. The nature of how we interreact with LLMs through prompts means they have to be very good blocking harmful behaviours, not contributing to illegal activities or cybercrimes and avoid providing misinformation just to name a few. Increasingly we are seeing models incorporate reasoning transparency as part of its output to help users. The consequences of wavering on any of these areas is simply too high to calculate when you think of use cases where they are being used to give people credible advice.
Unfortunately, this is already an area where DeepSeek are already failing. We have seen a dreadful fail of the HarmBench tests with a 100% attack success rate and cloud security provider Wiz.io’s research exposed a publicly accessible database with sensitive information where they were able to get full control of operations. There is also founded evidence that information is being send back to Chinese servers. And concerns about its Chinese origins have already led several countries around the world to call for or outright block its use at least within Government.
All of this is quite worrying and DeepSeek really need to get a handle on it, however this is where my guidance for experimenting with any model comes in. Whether you are playing around, experimenting, piloting or putting into production you should always take the measures of protect yourself. Since we can’t audit training data, you should rigorously test model outputs for bias, reliability, and security vulnerabilities for yourself. Implementing guardrails from model hosters to filter, monitor, and validate responses before they reach users is also a must.
In summary
Despite several shortcomings on this release of DeepSeek R1 model, it deserves its place in the ever growing line up of frontier models. It also shows that those working in this brave new world of AI should seek to be open and transparent in the true pursuit of democratising access to everyone on the planet, and avoid closing off access regardless of demographics, means or sovereignty. As I stated at the beginning I remain overwhelmingly optimistic about how this will push development further and faster into the future. My team and I will surely be experimenting and evaluating it in our customer prototyping engagements this year where I will be looking forward to putting it the test against real customer use cases. I'm also keen to hear other peoples perspectives as comments on this blog.
But what I have learned over my many decades in tech is that these are moments to embrace with vigour and curiosity, but not without well founded knowledge and caution.
So bravely go forward and experiment, and you can get your hands on it here on AWS.
?
?
Great post. Also thanks for the economic comparison. Do you have a source for these numbers please?