Open Source AI vs. AI Model Openness

Open Source AI vs. AI Model Openness

A single definition of open source AI is problematic because, unlike open source software, AI models include non source code artifacts that have their own nuances in terms of security, privacy, and intellectual property. Model developers are not always willing to disclose all of the information needed to exactly reproduce their models. Freely available AI models are not necessarily fully open source. Perhaps the best definition of open source software is that of the Apache Software Foundation:

Open source software is software with source code that anyone can view, edit, and contribute to. An open source project includes all aspects of creating, maintaining, and distributing open source software including community building and mentoring, communication, the release process, and everything in between.

Since AI models are the product of complex systems, the equivalent of open source would be open architecture, open preprocessing and training code, open data, open weight, and open inference or serving code. According to the Open Source Initiative (OSI), AI models comprise (1) a model architecture, (2) model parameters (weights), and (3) inference code for running the model. Training data is not included, but “...information about the data used to train the system” is included “so that a skilled person can build a substantially equivalent system.”

The terms “skilled person” and “substantially equivalent” in the OSI definition point to a difference between open source software and partially open AI models. Software patents, for example, teach those skilled in the art how inventions work but do not include source code. By their nature, software patents reflect proprietary technology, which is not consistent with open source.

A definition of "open source AI" that categorizes AI models that are, in fact, only partially open as either open source or not open source is inherently imprecise. Measuring AI model openness may be a more practical approach than adapting established definitions of open source to fit AI models, which are different to traditional open source.

Reproducibility

OSI does not define what “substantially equivalent” means. Open source software projects do more than provide information about software that can be used to create similar software. They include the source code needed to compile or run the software, and to exactly reproduce the results of other developers. The ability to reproduce and confirm experimental results is a principle of science and not being able to reproduce results does not fit scientific use cases.

There are many different users and use cases for AI models. Reproducing the training of a large model, especially a large language model (LLM), is resource intensive. Substantially equivalent models can be trained using the same general data sources, such as Common Crawl. At scale, model responses should fall with an acceptable error bound. In any case, access to model weights is a more important factor. Without having a model’s original training data, it is possible to perform fine tuning using specialized data to adjust model weights or to use low-rank adaptation (LoRA) to fine-tune a subset of a base model’s weights. In other words, meaningful R&D is still possible without access to original training data.

Training Data

AI models can be trained using information that cannot legally be made public, or that model developers do not wish to make public, or using data available exclusively through commercial arrangements. OSI’s definition allows training data to be replaced by a description of the data used, but there is no currently accepted data bill of materials (DBOM) standard. The real issue for model developers is the risk and cost associated with making training data available when data sources may have different rights associated with them.?

Training data is a controversial topic due to copyright and intellectual property concerns recently highlighted by DeepSeek. Excluding training data from the definition of open source AI avoids controversy at the cost of clarity. According to OSI, an open source AI model can be trained using “un-shareable data” and “...data obtainable from third parties […] for fee.” This creates two types of AI models: (1) partially open AI models trained on un-sharable or un-free data, and (2) fully open AI models trained on open data sets. Both types fall under OSI’s definition of open source AI.

Setting aside copyright and intellectual property concerns, sharing proprietary training data could expose AI model developers to other risks. For example, personally identifiable information could be anonymized, but it is not clear if anonymization can be done reliably at scale across diverse data sets or if anonymized data could later be de-anonymized. Alternatively, AI models can be developed entirely using open data sources like Common Crawl, OpenML, ML Commons, and more. Use of synthetic data could also limit AI model developer risk exposure.

Model Weights

To be open source, AI models must be open weight. Open weight means that model weights (parameters) are freely available. Training data and training algorithms create model weights, which are used for inference. AI model research and development activities, like fine tuning and interpretability research, depend on the analysis and manipulation of model weights. While AI models used for classification can be tested for output equivalency, generative AI models, such as large language models (LLMs), generate disparate outputs if there are differences in training or fine tuning data.

OSI states that “parameters shall be made available under OSI-approved terms” but “does not require a specific legal mechanism [to assure] model parameters are freely available.” Model developers can make model weights available using established open source licenses like Apache 2.0, or offer their own licenses which may or may not be compatible with open source principles.

Licensing

After analyzing Llama2 (Meta), Grok (X/Twitter), Phi-2 (Microsoft), and Mixtral (Mistral), OSI found that required components were not open or that their licenses were incompatible with OSI’s open source definition. Existing open source software licenses, like the Apache 2.0 license, do not distinguish between different AI model components (e.g., model weights versus training data), or between use cases that involve different risks. For these reasons, AI-specific licenses, like the Meta Llama 2 Community License Agreement and Responsible AI Licenses (RAIL), have been developed, but these licenses are not necessarily open source licenses.

Measuring Openness

Traditional open source software definitions and licenses do not neatly fit partially open AI models. Adapting existing definitions of open source to encompass AI model components with varying degrees of openness is problematic. The result is an ambiguous definition that potentially excludes some or all training data and that contemplates the various terms under which model weights are available on a case-by-case basis.

Rather than relying on imprecise categorizations (either open source or not open source), users of AI models should make their own data-driven decisions. Measuring AI model openness might be more effective than fitting AI models into existing definitions of open source, or altering the definition of open source. For example, the Model Openness Framework, supported by the Linux Foundation, categorizes AI models as (1) open science, (2) open tooling, and (3) open model and sets forth clear criteria for each category. The MOF project publishes the Model Openness Tool (MoT) which shows, for each model, exactly what is open and what is not.

要查看或添加评论,请登录

Ron Herardian的更多文章

  • AI Models: You Break it, You Buy It

    AI Models: You Break it, You Buy It

    Lawmakers across the globe have taken steps to ensure that AI models are secure, safe, trustworthy, fair, and…

    1 条评论
  • Can the AI Industry Regulate Itself?

    Can the AI Industry Regulate Itself?

    AI regulation in the United States lies at the intersection of existing federal laws, a growing patchwork of state…

  • Why Containers Will Replace VMs

    Why Containers Will Replace VMs

    Sun Microsystems introduced containers more than ten years ago, but containers have recently become the hottest trend…

    25 条评论

社区洞察

其他会员也浏览了