Open-Source AI Models: Navigating the Landscape of Transparency and Licensing
Darren Culbreath
Generative AI Leader / Digital Transformation & Cloud Modernization
A lot has changed since my March 2023 article on Moving from Search engines to LLMS:
As leaders, we are witnessing a transformative shift in the artificial intelligence landscape. The emergence of open-source Large Language Models (LLMs) is reshaping how businesses approach AI development and deployment, presenting opportunities and challenges that demand our attention.
The Paradigm Shift
Historically, LLMs have been the domain of major tech companies with substantial resources. However, the open-source movement is democratizing access to these powerful AI tools, bringing several key advantages to the forefront:
Strategic Implementation
To effectively leverage open-source LLMs, organizations should consider the following strategic approaches:
As large tech companies like vie for dominance in the AI landscape, the question of what constitutes true "openness" has become increasingly complex. [2,11]
The Shifting Landscape of Open-Source Licensing
The tech industry has long embraced common principles and well-understood open-source licenses governed by the Open Source Initiative (OSI). However, in recent years, there has been a growing "divergence" from this paradigm, with some companies shifting to commercial licenses or "open-ish" licenses that do not fully align with the OSI's definition of open source.
One example is the case of some large language models (LLMs) that are loosely referred to as "open source" but do not meet the OSI's principles.
The issue of "open-ish" licenses has been particularly prevalent in the AI community. Companies may publish a model's architecture and weights but withhold the training data and code, or they may release the trained weights under a license that prohibits commercial use or restricts derivative work. This ambiguity around what is truly "open" can hinder the progress of AI adoption and create challenges for entrepreneurs and developers. [2]
Navigating the Spectrum of Openness
The concept of "open source" has evolved, and projects may adopt various levels of openness. Hacker Noon's article "Shades of Open Source" explains that the value of open source manifests in different ways, and the lack of assured independence can pose potential risks to ecosystem partners. [3]
Some popular open-source projects are closely tied to particular vendors, which can create vendor dependence. This has led to a more nuanced understanding of what "open" means in the context of software development and AI models. [3]
The Importance of Transparency and Reproducibility
Indeed, open AI models, where all the components – including training data, code, weights, architecture, technical reports, and evaluation code – are released under permissive licenses, can promote transparency, reproducibility, and collaboration in developing and applying large language models. [2]
As Bernard Marr's article highlights, the split over the definition of "open" has been reported as a cause of disagreement around some high-profile LLM releases, such as Meta's Llama and X.ai 's Grok. While these models have made the weights and architecture publicly available, they have not revealed all of the code or training data, leading to skepticism about their open-source status. [4]
The Emergence of a "Herd of Models"
In response to the dominance of closed-source models like ChatGPT, researchers have explored the concept of a "herd of models" – a framework that leverages widely available open-source technology to compete against proprietary models. [5]
This approach aims to address the issues of access and scale by creating model repositories where users can upload model weights and quantized versions of models trained using different paradigms. By having a herd of open-source models, researchers can cover a significant portion of the deficit when proprietary models cannot answer a query. [5,20]
The Risks and Challenges of Open-Source AI
While open-source AI models offer many benefits, risks and challenges must be considered. As the CIO's article highlights, just because a model is open source doesn't necessarily mean it provides the same level of transparency and information about its background and development. [6]
When a company downloads a model for its use, the model can become even further removed from the sources, and the developers working on the last model might need to be made aware of the fixes or issues that have been addressed in the original base model. [6]
The Debate Over Code Generation and AI Assistance
Using large language models (LLMs) in code generation has also sparked debate within the open-source community. Some FOSS (Free and Open-Source Software) projects, such as Gentoo Linux and NetBSD, have banned using code generated with the assistance of LLM tools, citing concerns over license violations and ownership issues. [7,8]
The Debian project, however, has decided against joining these bans, recognizing the potential benefits of AI-assisted code generation while acknowledging the need to address the associated risks. [8]
领英推荐
The Push for Transparency and Accountability
In response to the growing concerns over the openness of AI models, organizations like the Open Source Initiative (OSI) have taken steps to define what constitutes "Open Source AI." The OSI is embarking on a global series of workshops to solicit input from stakeholders on this issue, as there currently needs to be an accepted way to determine whether an AI system is truly open source. [17]
The debate over open-source AI has also extended to datasets, which are crucial for developing large language models. High-quality open-source datasets are paramount to enabling innovation in open, generative models, and initiatives like HuggingFace's FineWeb dataset and Stability AI's Stable Audio Open are examples of efforts to promote transparency and collaboration in this space. [9,10]
The Legal Landscape and Ongoing Challenges
The open-source AI landscape has its legal challenges. In one notable case, a lawsuit was filed against GitHub's AI code assistant, alleging that Microsoft (which owns GitHub) treated open-source code as if it were in the public domain rather than respecting the terms of the open-source licenses. While the lawsuit was ultimately dismissed, it highlights the ongoing legal complexities surrounding using open-source code in AI systems. [12]
Additionally, the rise of open-source cybersecurity tools has raised concerns about the potential risks to the internet as a whole, as vulnerabilities in these tools could be exploited by bad actors. This has led to calls for greater collaboration and accountability within the open-source community to address these security challenges. [18,19]
The Future of Open-Source AI
Despite the challenges, the future of open-source AI remains promising. Companies like Meta, IBM, and others are actively contributing to the open-source ecosystem, with initiatives like Meta's LLM Compiler, which aims to optimize code and revolutionize compiler design, and IBM's open-sourcing of its Granite AI models. [14,15]
As the open-source AI landscape continues to evolve, it will be crucial for stakeholders to work together to define clear standards, promote transparency, and ensure that the principles of open source are upheld in the development and deployment of these powerful technologies. [16,17]
My Enterprise Checklist: Key Considerations Before Implementing Open-Source LLMs
References:
[2] "Many Companies Are Launching Misleading "Open" AI Models — Here's Why That's Dangerous for Entrepreneurs," Entrepreneur, June 4, 2024,?(Link)
[3] "Shades of Open Source - Understanding The Many Meanings of "Open"," Hacker Noon, June 17, 2024,?(Link)
[4] "7 Essential Open-Source Generative AI Models Available Today - Bernard Marr," Bernard Marr, May 20, 2024,?(Link)
[5] "How a Herd of Models Challenges ChatGPT's Dominance: Abstract and Introduction," Hacker Noon, June 5, 2024,?(Link)
[6] "10 things to watch out for with open source gen AI - CIO," CIO, May 15, 2024,?(Link)
[7] "Gentoo and NetBSD ban 'AI' code, but Debian doesn't – yet," The Register, May 18, 2024,?(Link)
[8] "SAP publishes open source manifesto," CIO, June 27, 2024,?(Link)
[9] "Datasets Matter: The Battle Between Open and Closed Generative AI is Not Only About Models Anymore," The Sequence, June 9, 2024,?(Link)
[10] "What to Know About the Open Versus Closed Software Debate," The New York Times, May 29, 2024,?(Link)
[11] "GitHub AI code assistant lawsuit dismissed," Boing Boing, July 9, 2024,?(Link)
[12] "The 10 Coolest Open-Source Software Tools Of 2024 (So Far)," CRN, July 11, 2024,?(Link)
[13] "Zephyr: Direct Distillation of LM Alignment: Related Work," Hacker Noon, July 3, 2024,?(Link)
[14] "Meta's LLM Compiler is the latest AI breakthrough to change the way we code," VentureBeat, June 27, 2024,?(Link)
[15] "Not all 'open source' AI models are actually open: here's a ranking," Nature, June 19, 2024,?(Link)
[16] "Open Source Initiative tries to define Open Source AI," The Register, May 16, 2024,?(Link)
[17] "Open-source cybersecurity could derail the internet as we know it," Quartz, May 9, 2024,?(Link)
[18] "Open-Source Cybersecurity Is a Ticking Time Bomb," Gizmodo, May 8, 2024,?(Link)
[19] "How a Herd of Models Challenges ChatGPT's Dominance: Conclusion, Discussion, and References," Hacker Noon, June 5, 2024,?(Link)
[20] "IBM open-sources its Granite AI models - and they mean business," ZDNet, May 13, 2024,?(Link)
Fractional Chief Artificial Intelligence Officer | AI/GenAI Executive | Thought Leader | Speaker | Change Enthusiast
2 个月Darren Culbreath, fantastically rich and insightful article on a leading topic, that others don't deep dive into. Thank you for your sharing!