OpenAI vs. DeepSeek: The Distillation Battle Shaping Next-Gen AI
Vinay Ananth R.
Empowering businesses with innovative solutions | Sales | Generative AI & ML | IoT/ IIoT | Cloud | Presales | Product Owner
The latest hype in the AI sphere has included accusations surrounding two heavyweights: OpenAI and DeepSeek.
Released on January 20th, DeepSeek’s R1 made history within a week, triggering one of the largest single-day losses in US stock market history (a bad day for NVIDIA).
R1’s power play was the fact that it performed nearly as well as OpenAI's most advanced models (and allegedly 1000x cheaper to build than GPT-4).
Now, we have an idea of just how they did it.
OpenAI recently emerged with allegations that DeepSeek copied their models —specifically, through a process called model distillation. Model distillation is a deep learning technique that leverages powerful AI systems to create smaller, more efficient ones.
Distillation isn’t an evil in its own right, but it does cross into some ethical gray areas, which is precisely where OpenAI has targeted its claims. OpenAI alleges that DeepSeek leveraged distillation to train R1, in a way that violated OpenAI’s Terms of Use.
The thing is, we know for a fact that DeepSeek uses distillation... but so do others (including OpenAI).
Let’s take an objective look at the distillation controversy today.
We’ll cover:
Let’s dive in.
What Is Model Distillation?
Outside of the context of LLMs, distillation is all about extracting the most valuable parts from a messy mixture. Think of how we turn seawater into fresh drinking water: boil off the salt and keep the good stuff. The key idea is separating what’s essential from the excess.
Model distillation draws this parallel in AI—where essential knowledge is extracted from a massive AI model and passed down to a smaller one.
Model distillation involves knowledge transfer between:
Distillation provides a powerful method for a student model to learn to replicate the teacher’s intelligence—while also cutting down on size, energy use, and computing power.?
How Distillation Works
In distillation, the teacher models’ outputs become the student model’s training data:
As a result, the student model learns not only the answer but the reasoning process, as well.
As the diagram below depicts, the teacher model also helps the student learn nuanced inter-class relationships by providing:
This enables the student model’s ability to yield an efficient approximation of the teacher’s performance.
Benefits of Distillation
A distilled model offers many benefits over larger models:
One tradeoff of distillation is that you might sacrifice a sliver of accuracy, but that’s an acceptable compromise for many use cases.
Distillation has a distinct appeal over purely fine-tuning a pretrained LLM. Fine-tuning involves training an LLM for specific tasks on specialized datasets. However, large, fine-tuned LLMs boast billions of parameters, soaking up loads of computing power (and energy bills).
Distillation offers a clever route that is more cost-effective:
Often, distillation and fine-tuning are used in tandem to achieve effective LLMs with less demand on resources.
Real-world example
GPT-4o mini is distilled from GPT-4o, which is how this smaller model hits a sweet spot between performance and resource efficiency.
Ethical Concerns and OpenAI's Accusations
If a student model’s training data consists mostly of a teacher model’s outputs, would distillers need explicit permission to harvest that material?
As useful as this technique is, distillation wades into murky ethical waters, where it becomes difficult to draw the line between knowledge transfer and blatant copyright infringement.
While neither OpenAI nor Microsoft have disclosed evidence, OpenAI is accusing DeepSeek of using unfair methods to achieve their low-cost models:
There is some irony in this controversy, as OpenAI itself has also been under fire for scraping people’s data without permission to train ChatGPT. OpenAI considered this to be fair use, but not everyone agreed with that (and that’s precisely why they were sued for 15 million euros by an Italian watchdog last December).
With that, we could entertain the possibility that OpenAI might be trying to position itself as an advocate of data privacy with this accusation.
DeepSeek has not commented on whether they leveraged OpenAI through distillation. This controversy is largely speculative, but we do know that DeepSeek does use distillation to create its models.
DeepSeek’s Known Uses of Distillation
While not related to the controversy, DeepSeek has documented how it has used distillation in its technical reports. We can explore these known methods, not for the sake of investigation but to learn a bit more about distillation in action.
In their December 2024 technical report, DeepSeek shared how it used distillation to transfer knowledge from DeepSeek-R1 to DeepSeek-V2.5. This approach helped it yield significant improvements on LiveCodeBench and MATH-500 benchmarks.
While distillation boosted performance, it also increased the average response length (as the table below shows). To strike a balance between accuracy and computational efficiency, DeepSeek ended up fine-tuning these distillation settings for DeepSeek-V3.
In their January 2025 DeepSeek-R1 paper, DeepSeek showed how it combined distillation and fine-tuning to create lean models with impressive logical capabilities.
DeepSeek applied supervised fine-tuning (SFT) to distilled models based on open-source architectures (ranging from 1.5B to 70B parameters):
In the table below, we can see how DeepSeek’s distilled models outperformed comparable models on reasoning-related benchmarks.
With multiple distilled checkpoints now open-sourced, DeepSeek demonstrated how smaller models are no longer as limited by their capacities. Through distillation, it can leverage large-model reasoning to achieve remarkable performance across complex tasks.
While the question of whether DeepSeek inappropriately used OpenAI data remains speculative, there’s no doubt that it is using distillation to affect powerful, efficient models.
Final Verdict for DeepSeek
The jury is still out on a few factors.
We don’t know the truth about:
Despite uncertainties and speculations, DeepSeek continues to be a remarkable example of how innovation happens.
If DeepSeek was still able to optimize their model so well using distillation on inferior chips, that’s still a great achievement. In fact, even if DeepSeek was able to achieve the last ten percent of its model’s optimization with distillation, color me impressed.
As AI continues to advance, we will continue to see more controversy around data ownership, intellectual property rights, and fair use. The likely truth is that those topics will remain a gray area for some time to come, and some questions may never get an answer.
The DeepSeek Controversy and the Future of AI
Despite ethical uncertainties, the next few years will certainly see an accelerated AI arms race, where: