OpenAI vs. DeepSeek: The Distillation Battle Shaping Next-Gen AI

The latest hype in the AI sphere has included accusations surrounding two heavyweights: OpenAI and DeepSeek.

Released on January 20th, DeepSeek’s R1 made history within a week, triggering one of the largest single-day losses in US stock market history (a bad day for NVIDIA).

R1’s power play was the fact that it performed nearly as well as OpenAI's most advanced models (and allegedly 1000x cheaper to build than GPT-4).

Now, we have an idea of just how they did it.

OpenAI recently emerged with allegations that DeepSeek copied their models —specifically, through a process called model distillation. Model distillation is a deep learning technique that leverages powerful AI systems to create smaller, more efficient ones.

Distillation isn’t an evil in its own right, but it does cross into some ethical gray areas, which is precisely where OpenAI has targeted its claims. OpenAI alleges that DeepSeek leveraged distillation to train R1, in a way that violated OpenAI’s Terms of Use.

The thing is, we know for a fact that DeepSeek uses distillation... but so do others (including OpenAI).

Let’s take an objective look at the distillation controversy today.

We’ll cover:

  • What model distillation is (and how it works)
  • How DeepSeek uses distillation
  • What OpenAI's accusations mean for the future of AI

Let’s dive in.


What Is Model Distillation?

Outside of the context of LLMs, distillation is all about extracting the most valuable parts from a messy mixture. Think of how we turn seawater into fresh drinking water: boil off the salt and keep the good stuff. The key idea is separating what’s essential from the excess.

Model distillation draws this parallel in AI—where essential knowledge is extracted from a massive AI model and passed down to a smaller one.


Model distillation involves knowledge transfer between:

  • A large, supercharged model (the teacher model)
  • A smaller, more efficient model (the student model)

Distillation provides a powerful method for a student model to learn to replicate the teacher’s intelligence—while also cutting down on size, energy use, and computing power.?

How Distillation Works


The student model receives the teacher model’s input data, label, and rationale


In distillation, the teacher models’ outputs become the student model’s training data:

  • The teacher model processes input data and generates a label (correct answer) and a rationale (explanation).
  • The student model then learns from this enhanced data, capturing both the decisions and reasoning of the teacher.

As a result, the student model learns not only the answer but the reasoning process, as well.

As the diagram below depicts, the teacher model also helps the student learn nuanced inter-class relationships by providing:

  • Soft labels: Probabilistic predictions across classes
  • Hard labels: Correct answers


The teacher model helps a student model learn inter-class relationships

This enables the student model’s ability to yield an efficient approximation of the teacher’s performance.

Benefits of Distillation

A distilled model offers many benefits over larger models:

  • Efficiency and faster inference: Distilled models provide faster performance with lower latency (while retaining intelligence).
  • Reduced computational costs: Distilled models require less hardware and energy compared to large models.
  • Deployability on edge devices: These models can be deployed on devices with limited computing resources, such as smartphones and IoT devices.
  • Expedite development of specialized LLMs: We can fine-tune distilled models to create smaller, domain-specific models from general-purpose LLMs for areas like medicine, coding, etc.

One tradeoff of distillation is that you might sacrifice a sliver of accuracy, but that’s an acceptable compromise for many use cases.

Distillation has a distinct appeal over purely fine-tuning a pretrained LLM. Fine-tuning involves training an LLM for specific tasks on specialized datasets. However, large, fine-tuned LLMs boast billions of parameters, soaking up loads of computing power (and energy bills).

Distillation offers a clever route that is more cost-effective:

  • Extract the model’s core intelligence and trim the excess.
  • Create a leaner, distilled model that’s faster and cheaper to deploy.
  • Fine-tune the smaller, distilled model for specialized use cases if needed.

Often, distillation and fine-tuning are used in tandem to achieve effective LLMs with less demand on resources.

Real-world example

GPT-4o mini is distilled from GPT-4o, which is how this smaller model hits a sweet spot between performance and resource efficiency.

Ethical Concerns and OpenAI's Accusations

If a student model’s training data consists mostly of a teacher model’s outputs, would distillers need explicit permission to harvest that material?

As useful as this technique is, distillation wades into murky ethical waters, where it becomes difficult to draw the line between knowledge transfer and blatant copyright infringement.

While neither OpenAI nor Microsoft have disclosed evidence, OpenAI is accusing DeepSeek of using unfair methods to achieve their low-cost models:

  • DeepSeek allegedly exploited OpenAI to perform distillation at scale, using multiple unrelated accounts.
  • Microsoft security researchers reportedly detected individuals performing data exfiltration using OpenAI API in late 2024, which they believe to be linked to DeepSeek.

There is some irony in this controversy, as OpenAI itself has also been under fire for scraping people’s data without permission to train ChatGPT. OpenAI considered this to be fair use, but not everyone agreed with that (and that’s precisely why they were sued for 15 million euros by an Italian watchdog last December).

With that, we could entertain the possibility that OpenAI might be trying to position itself as an advocate of data privacy with this accusation.

DeepSeek has not commented on whether they leveraged OpenAI through distillation. This controversy is largely speculative, but we do know that DeepSeek does use distillation to create its models.

DeepSeek’s Known Uses of Distillation

While not related to the controversy, DeepSeek has documented how it has used distillation in its technical reports. We can explore these known methods, not for the sake of investigation but to learn a bit more about distillation in action.

In their December 2024 technical report, DeepSeek shared how it used distillation to transfer knowledge from DeepSeek-R1 to DeepSeek-V2.5. This approach helped it yield significant improvements on LiveCodeBench and MATH-500 benchmarks.

While distillation boosted performance, it also increased the average response length (as the table below shows). To strike a balance between accuracy and computational efficiency, DeepSeek ended up fine-tuning these distillation settings for DeepSeek-V3.


Table: Performance at the cost of response length

In their January 2025 DeepSeek-R1 paper, DeepSeek showed how it combined distillation and fine-tuning to create lean models with impressive logical capabilities.

DeepSeek applied supervised fine-tuning (SFT) to distilled models based on open-source architectures (ranging from 1.5B to 70B parameters):

  • Qwen (Qwen, 2024b)
  • Llama (AI@Meta, 2024)

In the table below, we can see how DeepSeek’s distilled models outperformed comparable models on reasoning-related benchmarks.


Benchmark comparisons

With multiple distilled checkpoints now open-sourced, DeepSeek demonstrated how smaller models are no longer as limited by their capacities. Through distillation, it can leverage large-model reasoning to achieve remarkable performance across complex tasks.

While the question of whether DeepSeek inappropriately used OpenAI data remains speculative, there’s no doubt that it is using distillation to affect powerful, efficient models.

Final Verdict for DeepSeek

The jury is still out on a few factors.

We don’t know the truth about:

  • Whether DeepSeek was indeed able to train R1 so cheaply
  • To what extent R1 used distillation

Despite uncertainties and speculations, DeepSeek continues to be a remarkable example of how innovation happens.

If DeepSeek was still able to optimize their model so well using distillation on inferior chips, that’s still a great achievement. In fact, even if DeepSeek was able to achieve the last ten percent of its model’s optimization with distillation, color me impressed.

As AI continues to advance, we will continue to see more controversy around data ownership, intellectual property rights, and fair use. The likely truth is that those topics will remain a gray area for some time to come, and some questions may never get an answer.

The DeepSeek Controversy and the Future of AI

Despite ethical uncertainties, the next few years will certainly see an accelerated AI arms race, where:

  • The solution is not in trying to restrict access to infrastructure (as the US attempted with the chip ban against China).
  • The solution is to find faster, more efficient ways to create powerful and specialized LLMs.

要查看或添加评论,请登录

Vinay Ananth R.的更多文章