Open Source AI

Open Source AI


This week's edition focused on Open Source after my visit to All Things Open Conference , it was top of mind. If you want to read the whole edition of the newsletter. I suggest you check out the full edition of The Artificially Intelligent Enterprise .


The Open Source Initiative's (OSI) release of version 1.0 of the Open Source AI Definition (OSAID) at the All Things Open 2024 conference in Raleigh, North Carolina, may be well-intentioned but raises several practical and strategic concerns. While touted as a breakthrough after years of collaboration among tech giants, their official endorsements are noticeably missing on the endorsements page .

Brief Overview of Open Source

Think of open source like sharing a recipe: instead of keeping it secret, anyone can see, use, and improve it. Open source software is code available to everyone, allowing users to understand, modify, and enhance it. This is different from most software, where only the creator controls how it works.

Why Should You Care?

  1. Transparency and Trust: Open source software is open for anyone to inspect, so there are no hidden surprises or privacy concerns. This transparency builds user trust.
  2. Freedom and Control: With open source, you’re not tied to one company’s limitations. If you need to make changes or customize the software, you can do it (or have someone do it for you).
  3. Community and Quality: Open source thrives on global collaboration. Anyone can suggest improvements, meaning bugs are fixed quickly, new features are added, and quality improves over time.
  4. Cost-Effectiveness: Open source software is often free to use, providing high-quality tools without licensing fees, making it a great choice for individuals and organizations.
  5. Shared Innovation: Open source is a powerful strategy for sharing the development load on “non-differentiating” technology—the tools everyone needs but don’t set companies apart. By collaborating on foundational technologies, industries can move forward faster, focusing their unique resources on what differentiates them.

Open Source in Your Daily Life

Open source is all around you:

  • Smartphones: Android, one of the most popular operating systems, is open source.
  • Web Browsers: Firefox and Chromium (used in Chrome) are open source, allowing secure browsing with a strong privacy focus.
  • Streaming Services: Behind the scenes, open source software powers Netflix, YouTube, and other streaming services, keeping them fast and reliable.
  • Smart Home Devices: Many smart home systems and routers use open source software to ensure flexibility and security.

How Open Source Set the Stage for AI

The LAMP stack (Linux, Apache, MySQL, and PHP) exemplifies how open source has allowed industries to innovate faster. By building the web’s foundation, LAMP helped companies and developers worldwide focus on creating new digital experiences without reinventing core technology. This shared framework enabled the rapid development of web services and cloud computing platforms, which are now crucial for artificial intelligence (AI).

Today, open source is helping AI progress in similar ways. Tools like TensorFlow, PyTorch, and OpenAI’s models provide shared foundations that developers and companies can build upon, allowing the industry to move forward faster while supporting collaboration, transparency, and innovation.

By choosing open source, you’re supporting a system that accelerates development, shares knowledge, and prioritizes openness for the good of everyone.

Open Source has proven to be a powerful catalyst for innovation. It demonstrates that immense benefits accrue to everyone by removing barriers to learning, using, sharing, and improving software systems. These benefits arise from licenses that adhere to the Open Source Definition (OSD), granting key freedoms to use, study, modify, and distribute software without excessive restriction.

The same freedoms are essential for AI to enable developers, deployers, and end users to benefit from enhanced autonomy, transparency, frictionless reuse, and collaborative improvement. However, with the rise of large language models (LLMs) like Meta’s Llama 3, it’s becoming increasingly clear that applying traditional open source licensing to AI introduces unique challenges—revealing a need for an adapted framework tailored to AI.

The Complexity of Open Source in AI: Square Peg, Round Hole

The Open Source Definition was developed with software in mind, and it works best for applications with standard dependencies, accessible codebases, and achievable reproducibility. LLMs diverge from these norms in ways that make it challenging to apply the OSD to them:

  1. Data Transparency: One of the most significant hurdles is data transparency. For an AI model to be reproducible in the open source sense, it needs a complete Training Data Bill of Materials (TDBOM). This is similar to a Software Bill of materials that is becoming popular in software supply chains. This includes the data’s provenance, selection criteria, processing steps, and labeling processes. Releasing this information is difficult for many LLMs due to the proprietary nature of data sources, privacy issues, and the vast scale of data involved. Or beyond that it can be very difficult to identify and confirm the source of the vast amount of data used to train foundation models.
  2. Full Source Code: In traditional open source, access to the full source code facilitates modification and reproducibility. With LLMs, providing the complete code used in data preprocessing, tokenization, training configurations, and fine-tuning processes is not only resource-intensive but often reveals sensitive internal methodologies or optimizations, which companies guard as proprietary.
  3. Parameters and Computational Barriers: Traditional open source licensing assumes accessible software that users can run and modify with standard resources. In contrast, LLMs require extensive computational power to reproduce, even with access to model weights and architecture. This makes “openness” in AI much more resource-intensive and exclusive, limiting the practical freedoms intended by the OSD.
  4. Legal and Reproducibility Issues: Unlike software, simply releasing weights and source code doesn’t necessarily mean a system is accessible. Without the underlying data and a way to reproduce the results fully, LLMs fall short of open source’s transparency and accessibility goals.

Case in Point: Meta’s Llama 3 License

Meta’s Llama 3 Community License Agreement offers a prominent example of how traditional open source licensing falls short in AI applications. While Meta has labeled Llama 3 as “open source,” its license diverges from OSD norms in several key ways, making the term “open source” somewhat misleading:

  1. Controlled Redistribution and Use: Meta’s license grants a non-exclusive, non-transferable, royalty-free license but mandates prominent branding (“Built with Meta Llama 3”) on any derivative works or products using Llama 3. It also restricts using the model to train or improve any other LLM. Such conditions are inconsistent with standard open source licenses, which typically avoid branding mandates and restrictions on derivative use.
  2. Proprietary Data and Reproducibility: The license allows users access to model weights and code but lacks data transparency—without which true reproducibility isn’t feasible. Open source emphasizes accessibility, but without a TDBOM, reproducing Llama 3’s training process is challenging, if not impossible.
  3. Limitations on Usage: Meta’s license restricts specific applications of the Llama 3 model, including its use in military, critical infrastructure, and other high-risk environments, as outlined in its Acceptable Use Policy. Such restrictions highlight the gap between open source’s unrestrictive ethos and LLM-specific requirements.
  4. Additional Licensing Complexity: Llama 3’s license terminates user rights if their organization exceeds 700 million monthly active users without securing an additional license. These limitations further demonstrate the controlled nature of Llama’s distribution and fall short of true open source principles.

Lessons from the Past: Custom Licenses and Market Confusion

Meta is not the first to adopt a unique license structure and label it as “open source,” a practice that has historically caused market confusion. In the late 1990s, Sun Microsystems’ “Community Source License ” for Java introduced additional restrictions, leading many to question its “open source” label. Similarly, SugarCRM’s “Sugar Public License” introduced restrictive terms inconsistent with OSD, ultimately causing a backlash in the developer community. In both cases, these licenses sought to balance proprietary control with open source-like freedoms, creating confusion and fracturing trust in open source definitions. Meta’s “Community License” for Llama 3, by adding restrictions on use and redistribution, risks repeating these past mistakes, potentially confusing users and diluting the open source label.

Open Source Clarity: Avoiding Legal and Compliance Pitfalls

Many companies understand open source as software with specific freedoms and responsibilities, as defined by recognized standards like the Open Source Initiative (OSI). This clarity has allowed developers, legal teams, and business leaders to use, modify, and distribute software within a well-defined legal framework for decades. However, as new AI models and frameworks emerge that do not comply with traditional open source definitions, this shift can potentially lead to confusion or unintentional legal exposure.

The Risk of “Muddied” Open Source Definitions

  1. Legal Ambiguity: When frameworks position themselves as open source without adhering to standard OSI definitions, legal departments face uncertainty. Traditional open source offers a consistent understanding of licensing terms and usage rights, but new interpretations may carry limitations or conditions that could surprise compliance and legal teams. For companies with strict governance over software use, this can introduce unforeseen risks.
  2. Developer Implications: For developers accustomed to OSI-compliant open source, ambiguous licensing can lead to practices that might inadvertently violate usage terms. Developers rely on standard open source norms to integrate, modify, or distribute software without needing extensive legal reviews. Non-standard definitions can complicate this understanding, requiring additional time and legal oversight.
  3. Risk of Non-Compliance: Companies that misunderstand or misinterpret licensing could unintentionally violate terms, leading to potential legal challenges, reputational harm, and fines. In regulated industries, this could also disrupt compliance audits or limit flexibility with internal processes.

Clear Paths Forward for Legal and Development Teams

To navigate these complexities, legal and development teams should focus on tools that strictly adhere to recognized open source definitions or collaborate closely to review the terms of newer frameworks. Ensuring that frameworks are used compliant and legally securely requires aligning with a company’s open source policies or adjusting them to account for this evolving category of AI tooling.

Proposing a New Approach: “Responsible AI Openness”

Rather than forcing AI models like Llama into the open source category, the industry could benefit from a new framework that upholds core open source values while accommodating AI-specific needs. A “Responsible AI Openness” framework could provide a more realistic and transparent approach by focusing on the following elements:

  1. Transparent Methodology and TDBOM: This approach would mandate disclosure of data methodologies and high-level data attributes without requiring full, unrestricted data access. A TDBOM could enable sufficient reproducibility for researchers and developers to understand and improve AI models while respecting proprietary boundaries.
  2. Open Architecture and Weights: LLMs could be released with access to model architecture, weights, and configuration files, but for non-commercial use, to encourage research and responsible development without opening proprietary components to misuse.
  3. Ethical Usage Requirements: This framework could explicitly encourage ethical usage and establish limits for high-risk applications without restricting core freedoms, balancing responsible use and openness.
  4. Community Governance and Feedback: Similar to open source projects, a community-driven governance model could be implemented to address model evolution, facilitate responsible development, and avoid unintended consequences.

Conclusion: Reimagining Open Source for AI

For AI to benefit from open source’s foundational principles, it needs a framework that respects the unique challenges of LLMs. Calling models like Meta’s Llama 3 “open source” despite restrictions confuses the market and risks diminishing trust in the open source community. Instead, a “Responsible AI Openness” approach, focused on transparency, ethical usage, and responsible distribution, would be better suited to the needs of large language models while preserving the spirit of open source.

By embracing a new licensing model that accommodates AI’s complex requirements, we can foster collaboration, innovation, and transparency without misusing the open source label—paving the way for a more accessible and ethically responsible AI ecosystem.

Anton Makohonov

CTO at if.team | Team Lead at Webnauts

1 周

It can be dangerous to open big models, because it can be used for attacks, or creation something harmful. But in other hand it can develop AI faster

回复
Zion Melson

Hire FAANG talent on Discord | Used by top VC backed startups | Send me a DM for access ???

1 周
回复
Lian Wee ?? LOO

Business Operations Strategist | Digital Transformation Evangelist | AI Enthusiast | Tech Gadgets Lover | Foodie | Kindness

1 周

The OSI’s OSAID release is a step forward, but more clarity on transparency is needed.

回复
Deugo Harold

Power engineer || Energy Market || Six sigma Green Belt SSGB?

1 周

There’s no denying the need for AI-specific definitions in open source. This is complex.

Aliya Jasrai

Helping startups and existing businesses transform their passion into profitable ventures, creating strong brands that resonates with customers and drive sustainable growth?? Increase awareness | Global reach ??

1 周

Traditional open-source models don’t address the complexities of proprietary data in AI.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了