OpenAI Accuses DeepSeek of Copying Its Technology: Ethical and Legal Implications

OpenAI Accuses DeepSeek of Copying Its Technology: Ethical and Legal Implications

OpenAI has accused DeepSeek of improperly using its data to train its open-source reasoning model, DeepSeek R-1.


?

?



?

Image source: Upstox News

?

In the past couple of days, there’s been controversy surrounding DeepSeek, a Chinese AI startup, and its alleged use of OpenAI’s proprietary models.

?

The issue surfaced after DeepSeek launched two new models, DeepSeek-V3 and DeepSeek-R1. These models perform just as well as OpenAI’s but come at much lower prices.

OpenAI has accused DeepSeek of improperly using its data to train these models, which has set off a heated debate about intellectual property rights in the AI world and the ethics behind model distillation.

Model distillation, also known as knowledge distillation, is a method in machine learning where knowledge from a large, complex model (the "teacher") is transferred to a smaller, more efficient model (the "student").


A distilled model is basically a smaller model that performs similarly to the larger one but requires fewer computational resources.

If you’re interested in knowing how an OpenAI model is distilled, check out this?documentation.


What Exactly Was Copied?

In the fall of 2024, Microsoft’s security researchers observed a group believed to be connected to DeepSeek extracting large amounts of data from OpenAI’s API.

This activity raised concerns that DeepSeek was using distillation to replicate OpenAI’s models without authorization. The excessive data retrieval was seen as a violation of OpenAI’s terms and conditions, which restrict the use of its API for developing competing models.

According to Mark Chen, Chief Research Officer at OpenAI, DeepSeek managed to independently find some of the core ideas OpenAI had used to build its o1 reasoning model.

Chen noted that the reaction to DeepSeek, which caused NVIDIA to lose $650 billion in market value in a single day, might have been overblown.


However, I think the external response has been somewhat overblown, especially in narratives around cost. One implication of having two paradigms (pre-training and reasoning) is that we can optimize for a capability over two axes instead of one, which leads to lower costs.?—?Mark Chen


While OpenAI hasn't revealed all the details, it has confirmed that there's substantial evidence suggesting DeepSeek used distillation techniques to train its models.

In response, OpenAI and Microsoft have blocked access to OpenAI’s API for accounts suspected to be linked to DeepSeek. This action is part of a larger initiative by U.S. AI companies to protect their intellectual property and prevent unauthorized use of their models.

The situation has also raised national security concerns, prompting the White House to review the implications of such practices on the U.S. AI industry.

Model Distillation is Legal

Model distillation itself is not inherently illegal. It is a widely used technique in the AI industry to create more efficient models by transferring knowledge from a larger model to a smaller one.

Take the?Stanford Alpaca?model as an example. Alpaca is a language model fine-tuned using supervised learning from a LLaMA 7B model on 52K instruction-following demonstrations generated from OpenAI’s text-davinci-003.


Image from Stanford Alpaca


The data generation process results in 52K unique instructions and the corresponding outputs, which cost less than $500 using the OpenAI API.

It demonstrates how distillation can be used to create smaller, more affordable models that still perform well.

In fact, if you read DeepSeek’s?whitepaper, DeepSeek R-1 is a distilled model from Qwen (Qwen, 2024b) and Llama (AI@Meta, 2024).

To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.


Based on DeepSeek’s findings, it appears that this straightforward distillation method significantly enhances the reasoning abilities of smaller models.

The controversy stems from allegations that DeepSeek used OpenAI’s model outputs to fine-tune their own models, which may be against OpenAI’s terms of service. This raises questions about fair use, data ownership, and the competitive landscape in the AI industry.


Running DeepSeek API Requires OpenAI Libraries


To use DeepSeek’s API, you need to run?‘npm install OpenAI.’


Image source: DeepSeek



Yep, you read that right. DeepSeek works with OpenAI’s client libraries! This is possible because DeepSeek’s REST API is fully compatible with OpenAI’s API.

?Quite an interesting turn of events in the AI world!

?

  1. DeepSeek avoided spending weeks building Node.js and Python client libraries by reusing OpenAI’s code.
  2. Developers using OpenAI can easily try or switch to DeepSeek by just changing the base URL and API key.
  3. If DeepSeek ever needs to make changes, they can simply fork the library and replace OpenAI with DeepSeek.


?

As a developer, this is a good thing, and I don’t see it as a huge problem because this is a common practice for LLM providers and aggregators. OpenRouter, Ollama, DeepInfra, and a bunch of others do this too.

In terms of API access, DeepSeek claims that you can utilize the R1 API at a significantly lower cost compared to OpenAI’s offerings.


  • ????? $0.14 per million input tokens (cache hit)
  • ?????? $0.55 per million input tokens (cache miss)
  • ?????? $2.19 per million output tokens


The cost for output tokens is nearly 30 times lower compared to the $60 per million output tokens for O1. This represents a significant reduction in expenses for companies managing extensive AI operations.


Take a look at this visual comparison of DeepSeek’s R1 and OpenAI’s models.


Image source: DeepSeek


Switching to the R-1 API would mean huge savings. You can learn more about DeepSeek’s API access here.


Final Remarks

DeepSeek was barely known outside research circles until last month when it launched its v3 model. Since then, it has caused AI stocks to drop and even been called a “competitor” by OpenAI’s CEO. It’s unclear how things will play out for DeepSeek in the coming months, but it has caught the attention of both the public and major AI labs.

Ironically, it feels weird that OpenAI is accusing DeepSeek of IP theft, given their history of copyright infringement. OpenAI gathered massive amounts of data from the internet to train its models, including copyrighted material, without seeking permission. This practice has resulted in lawsuits from notable figures such as George R.R. Martin and Elon Musk (regarding Twitter data).

OpenAI may become even more closed off as a result. Do you remember the incident when Musk shut down free API access to X (formerly Twitter) due to data being stolen? Although there’s a thin chance that OpenAI will do the same, it’s not unlikely to happen.


要查看或添加评论,请登录

Khaled Hasan Prince的更多文章

社区洞察

其他会员也浏览了