So Close(d), No Matter How Far... (they claim to be Open Source)

So Close(d), No Matter How Far... (they claim to be Open Source)

Over the last two weeks, the entire AI legal and regulatory ecosystem has been eclipsed by Liang Wenfeng's DeepSeek-R1, how it created shockwaves within the Magnificent Seven, splashed our Robinhood accounts red, and emerged as the most downloaded app on major operating systems. Fast forward the hype, we have witnessed lawmakers maneuvering regulations to limit or completely ban its usage (here), concerns around storage of personal information in China, and OpenAI having a sudden epiphany towards respect for third party's IP protection (here). Without delving into the merits of these battle and geopolitical narratives, what really excites the technology enthusiasts is the "Eureka" moment for the AI landscape.


Deep Seek and Open Source

DeepSeek offers several key differentiators, including its comparable (and in some instances, far superior) performance, cost-effectiveness (2% of ChatGPT's cost), and training efficiency on lower computing power (H800s). However, what stands out is High-Flyer's strategic decision to license DeepSeek V-1 and R1 under the MIT License. By now, after reading some articles and watching YouTube videos, most enthusiasts would have a rudimentary understanding of what this broadly means, i.e., everyone can access its code, how it works, and modify it for their use.

Discussions around Open Source and AI are not new. We've long known about Musk's criticism towards OpenAI for not really being "Open" (FWIW, it has no qualms in using Wikipedia data for training its LLMs), Meta's commitment towards Open Source by releasing Llama 2 and later Llama 3.1405 under its Community License Agreement (here). However, DeepSeek's arrival has again fueled fire to this discussion.

Yet, this "I am bringing the Sexy Back!" moment for Open Source raises a couple of interesting questions. Is Llama or DeepSeek really Open Source, or can it be? ?To figure that out, we need to take a few steps backward and answer some more elementary questions:

1.???? What exactly is Open Source vis-à-vis AI?

2.???? Can AI genuinely be Open Source?

AI and Open Source

In August 2024, the Open-Source Initiative (OSI) emphasized that an AI system can be considered Open Source if its terms grant "freedoms" (akin to those of software), precisely, the four freedoms to:

1.???? ?Use for any purpose without permission;

2.???? Study (how it works and inspect its components);

3.???? Modify (for any purpose, including output change); and

4.???? Share (with others to use with or without modifications, for any purpose).

Notably, these freedoms must apply to both the overall AI function system and its discrete components.

Having a definition is a positive development; however, the goblin grins behind the terms and conditions. So, what elements must be disclosed and allowed to be studied, shared, and modified to be OS?

An AI system, ideally includes the following components, i.e., (a) Model Weights, (b) Training Data, (c) Code (scripts/ algorithms used to train the model), (d) Fine-Tuning Code (Code used to adjust pre-trained models for specific workflows), and (e) AI Stacks (technology, framework and infrastructure used to facilitate the use and deployment of AI systems.


Does DeepSeek or Llama meet this criterion?

  1. License: Llama 3.1405 is licensed under the Llama 3 Community License Agreement, which limits use and includes explicit prior permission for organizations with 700 Mn+ users, whereas DeepSeek-R1 is licensed under the MIT License.
  2. Model Weights and Architecture: Both Llama and DeepSeek have released their model weights under custom and MIT licenses, respectively.
  3. Training Data: Neither model has disclosed the training data, i.e., it cannot be used, reproduced, or verified.
  4. Training Code: Llama has not released the training code; DeepSeek has also not released its complete training code, limiting the users from retraining the models.
  5. Fine Tuning Code: Both models make limited fine-tuning code/ tools available, i.e., users can adjust but not rebuild them.
  6. Stacks (Deployment Infrastructure): Llama is optimized for proprietary hardware; DeepSeek is optimized for similar but flexible hardware.

Both Llama and DeepSeek-R1 demonstrate some Open-Source characteristics, yet they do not meet the OSI standard holistically. While their model weights are public, their training code and data remain closed. ?At best, they can be described as Checkered Open-Source, providing some freedoms without complete transparency.

Even then, DeepSeek holds an edge over Llama, as its model weights do not impose purpose or prior consent limitations.


Conclusion

While Llama and DeepSeek may not impress Purists for whom the Open Source definition is binary, this imperfect release is a giant leap for innovation as it represents the shift in the debate. It challenges the narrative that Open AI, Anthropic, and Google have been brute-forcing, i.e., the closed proprietary models are the only way to achieve efficiency!

What Llama started, DeepSeek has amplified by removing the purpose limitation for model weights. This would enable researchers and small start-ups to ameliorate fine-tuned models, leading to broader adoption, more competition, and a democratized AI landscape.

Critiques argue that having "something" is better than "nothing", i.e., a strict Open-Source standard that no one can copy with will ultimately make itself redundant. While licensing standards must evolve and adapt to the technology, relaxing core Open-Source principles to accommodate corporate interests sets a dangerous precedent. This could allow companies to adopt tactics to whitewash themselves by making a few components Open-Source while keeping other critical components opaque.

As the debates around training data continue to be argued before different legal and regulatory forums, we should not accept its secrecy as a norm. At best, some wiggle room could be accorded so long as the ultimate goals, i.e., democratizing innovation, ensuring better safety, and fostering wider adoption, are not compromised.

Open Source is a commitment, not a marketing tool!


Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

3 周

Given Meta's focus on open-weighting Llama and its potential for fine-tuning via DeepSeek, how does the HiFlyer model's instruction tuning methodology, influenced by Chinese AI research, differ in its approach to knowledge representation compared to Llama's parameter-based adaptation?

回复

要查看或添加评论,请登录

Kabir Darshan S.的更多文章

社区洞察

其他会员也浏览了