Licensing of datasets for machine learning and Open Source (Part I)

Licensing of datasets for machine learning and Open Source (Part I)

Over the years, the field of artificial intelligence (AI) has undergone rapid transformation. Starting in the 1950s, AI has evolved significantly, with milestones at nearly every decade. Initially, computing machines were large-scale calculators, but visionaries like Alan Turing imagined machines capable of expanding beyond their original programming. It was the Dartmouth Conference in 1956 that laid the foundation for AI research, and John McCarthy coined the term "artificial intelligence." Fast forward to today, AI technologies generates creative responses, such as text and images, at an unprecedented pace. AI models, especially machine learning and deep learning, play a pivotal role in advancing AI by learning from data and making predictions, revolutionizing fields like natural language processing, computer vision, and robotics[1].? AI models, powered by machine learning algorithms, learn from data to make predictions or identify patterns. These models can be supervised (requiring labeled input-output pairs) or unsupervised (processing raw, unlabeled data). While datasets serve as the training material for these models, covering diverse domains such as computer vision, finance, and more. High-quality datasets are essential for driving advancements in artificial intelligence. This article will be in two parts. I will first be delving into understanding the basic copyright principles then I will look at the various machine learning datasets and also look at the objectives of this article. Finally for this part I will delve with the challenges associated with licensing of these datasets. In the next part, at the end of the next quarter, I will start with the investigation around understanding the complexity of dataset licensing. Let’s begin!!!!

Introduction:

Over the past decade, the growth of machine learning systems has been exponential. Central to these systems are the models trained using datasets. Machine learning datasets serve as collections of data used for training and evaluating models. As the field continues to scale, the importance of datasets has surged. Collecting and developing high-quality datasets has become crucial. However, the landscape has evolved. Early dataset collection practices involved meticulous annotation, which was time-consuming. Nowadays, data collection tends toward unconstrained sources, including web data. These datasets originate from diverse channels, such as government agencies, research institutions, and private companies.

Traditional data collection and annotation are often slow and costly, leading developers to use script-like tools to gather data directly from the Internet. Unfortunately, many developers do not take data licensing seriously, resulting in datasets with unclear terms and conditions. Understanding the licensing terms is crucial for determining the commercial exploitability of these datasets. My study of these datasets revealed that most datasets have uncertain commercial use due to neglected licensing considerations.

Two primary laws govern datasets and the data within them: copyright law and contract law. Copyright law prohibits copying or modifying any part of a work without the express permission of the copyright holder, meaning copyrighted data cannot be used or distributed commercially without authorization [2]. Many publicly available datasets contain copyrighted data, and using these to build commercial AI software may result in copyright infringement. Contract law, on the other hand, allows the copyright holder to grant a license that outlines the rights and obligations of others. Exercising rights not granted by the license or failing to fulfill obligations can lead to a contract violation. Although the specific protections provided by these laws vary by country or region, they generally offer similar safeguards. Furthermore, the development of artificial intelligence has led to the introduction of several privacy and fairness-related laws governing the use of datasets. It is essential to understand the terms and conditions of the dataset licenses to ensure compliance with the creators' wishes and avoid legal or ethical violations.

Moreover, various licenses govern machine learning datasets, each with specific terms and conditions. Common licenses include the Open Data Commons Public Domain Dedication and License (PDDL), the Creative Commons Attribution License (CC BY), and the GNU General Public License (GPL). These licenses will be detailed in the next section. In this article, I categorize the licensing issues that lead to violations in machine learning systems into three main types:

1. Commercial vs. Non-Commercial Use: Datasets often contain images sourced from the Internet, which may or may not be intended for commercial use. For instance, a dataset licensed under CC BY-NC is designated for non-commercial use only.??

2. Dataset Withdrawal for Legal Reasons: Datasets may be retracted due to legal disputes or other concerns. An example is the MS-Celeb-1M dataset [3], which was withdrawn following a controversy over portrait rights.??

3. Purpose-Achieved Dataset Withdrawal: Sometimes, datasets are withdrawn once their original purpose has been fulfilled. For example, the MegaFace [4] dataset was removed from the public domain after the completion of its associated competition.

Understanding these licensing issues is crucial to ensure legal and ethical compliance in the use of machine learning datasets. In the first issue described above, the use of a non-commercial dataset for commercial purposes is considered to result in a license violation. In recent instances, if a machine learning system persists in using a withdrawn dataset, it may be deemed a violation of licensing terms.

Objective of this article:

My primary goal in this article is to investigate dataset licenses and develop a system to assist developers with license management. In pursuit of this ultimate objective, I also aim to achieve two additional objectives:

1. Clarify Dataset Licenses: Ensure it is easy to determine if the machine learning system uses datasets that are commercially permissible.

2. Ensure Dataset Traceability: Clarify the origin of the data within the dataset to facilitate future development and modifications by the dataset developer.

Upon achieving these objectives, this article makes the following contributions:

Initial Investigation and Organization: I have initially investigated and organized the licenses of frequently used datasets, or those with high usage frequency, on GitHub.

Collation and Analysis: While collating datasets, I also gathered their data sources and analyzed conflicting licenses between the datasets and the data they contain.

Licensing Compatibility: I investigated the compatibility of licenses between datasets and systems, analyzing and explaining the two ultimate goals of the GQM (Goal Question Metric) model for dataset investigations.

Community Appeal: Provide recommendations to the machine learning systems and datasets community the provision of a complete license-based system for datasets, especially in the absence of well-defined laws.

Studies on Licensing of machine learning datasets:

Before delving in the licensing aspect let us try to understand what Machine Learning for Software Engineering means. Machine Learning for Software Engineering typically involves applying machine learning methods to enhance various development processes or conduct research within software engineering. It is particularly common to use machine learning to test software systems. More relevant to this article, however, is Software Engineering for Machine Learning, which employs software engineering principles to standardize and measure the development of machine learning systems. This field focuses on designing, developing, and maintaining software systems that support the creation and deployment of machine learning models. As machine learning becomes increasingly prevalent across industries such as healthcare, finance, and transportation, the importance of this field grows.

A key challenge in Software Engineering for Machine Learning is managing the complexity of the models and the data they process. Machine learning models can be highly complex, with millions of parameters and numerous input variables, making it difficult to understand their predictions and troubleshoot issues. Additionally, delivering machine learning system solutions to production environments is often hindered by uncertainty. Determining the starting point of a machine learning application and ensuring its safety and correctness can be challenging. Recent work has also highlighted the impact of ML-based system applications on ISO security issues. In response to these challenges, both industry and academia have made efforts to automate the development process by creating frameworks and environments that support ML workflows and experimentation. Version control is another crucial tool in this domain, enabling developers to track code changes over time and collaborate effectively on large projects, where multiple developers may work on the same codebase simultaneously.

Software Engineering for Machine Learning extends beyond the development of machine learning system applications to influence data management, training, and coding processes. The development of Software Engineering for Machine Learning often involves data scientists working within software engineering teams responsible for applying machine learning systems. A significant challenge in this field is managing the large volumes of data required to train and evaluate models. This data can originate from various sources, including sensors, cameras, and other data collection devices, and come in multiple formats, such as images, videos, and audio recordings. Handling this data is time-consuming and complex, necessitating specialized software and hardware for storage and processing.

Now let us have a macro look at the software licensing. Software licensing is a legal agreement that outlines the terms and conditions for using, distributing, and modifying software. These licenses are crucial in the software development industry as they protect the rights of developers and ensure software is used according to their intentions. Copyright protection means software can only be used with the copyright owner's permission. Given that software is easy to copy but costly to create, there are two main copyright protection strategies: private and non-private. A private copyright policy restricts some or all potential users. In this model, code is protected by copyright and then distributed under a license that grants specific rights to users[5]. Proprietary licenses are the most common type, often used for commercial software, granting developers exclusive rights. These licenses typically limit software use to a single user or organization and prohibit modification or distribution without the developer’s consent. Non-private copyright strategies include placing software in the public domain or licensing it as open source. Public domain software waives copyright protection, allowing anyone to use and modify it without compensation. Richard Stallman of MIT pioneered the GNU Public License, a foundational element of the open-source software movement aimed at maximizing openness and reducing barriers to software use to foster innovation. Open-source licenses grant intellectual property rights, facilitating collaboration and exchanges among users and developers with diverse motivations. Freeware licenses allow users to use software for free while the developer retains all rights. Under such licenses, the software can be used and distributed but not modified or sold without the developer’s permission. Freeware is typically intended for personal or non-commercial use.

With the rise of machine learning, many private commercial machine learning systems have become profitable. However, numerous machine learning systems are available on open-source platforms like GitHub. Some of these publicly available systems have appropriate open-source licenses, while others do not. In my investigation, I aim to understand the licensing of machine learning systems and assess the compatibility of these systems with datasets by evaluating the commercial availability of their licenses.

Similar to software, machine learning datasets can be governed by various types of licenses, each with specific terms and conditions. Some standard licenses include the Open Data Commons Public Domain Dedication and License (PDDL), the Creative Commons Attribution License (CC BY), and the GNU General Public License (GPL).

- PDDL: This public domain dedication allows anyone to use the data for any purpose without restrictions or attribution requirements. It is often used for datasets created by government agencies or other public entities to ensure maximum accessibility and reuse.

?- CC BY: This license requires users to give attribution to the original creator. It is commonly used for datasets created by researchers or private individuals, allowing them to retain control over how their data is used.

- GPL: Commonly used for software and other forms of open-source code, this license allows users to access and use the data but requires any modifications to the dataset to be made publicly available under the same license. It is useful for datasets intended to be part of larger open-source projects.[6]

However, research and specifications for licensing machine learning datasets are scarce, and no previous studies have focused on licensing violations that occur when building commercial AI software using publicly available datasets[7]. Various issues within the dataset can affect its licensing, including fairness, ethical considerations, and personal privacy. These concepts intersect but all significantly impact the dataset's licensing.

- Fairness: The fairness of a dataset refers to how biases in dataset annotation can affect machine learning models. Machine learning's success relies on data-driven approaches with large amounts of data, making models vulnerable to biases present in the data itself.

- Ethical Considerations: Ethical challenges can arise during the creation and development of datasets. The prioritization of various processes in dataset creation often leads to lower prioritization of factors like data management and offensive labels, posing ethical concerns.

- Personal Privacy: Many datasets contain images or information collected from the Internet that may include personal data without the individual's consent. This raises significant privacy issues.

Addressing these issues is crucial for ensuring that the licensing of machine learning datasets complies with relevant laws and ethical standards, and for avoiding violations when using publicly available data in commercial AI software.

Numerous studies and tools aim to mitigate the issues caused by problematic datasets. For instance, to address privacy concerns with publicly available image datasets, developers often obfuscate private information before releasing the data. Researchers are also increasingly required to review the ethicality of datasets.? The heightened focus on dataset-related problems in both academia and industry has led to stricter controls over dataset creation and development. However, unlike the well-established licensing systems for open-source software, a comprehensive licensing framework for datasets remains undeveloped. Open-source licenses have significantly impacted open-source development, prompting extensive study within the field of software engineering. Consequently, the evaluation of open-source software licenses is well-documented.

Despite this, licensing datasets presents unique challenges. Dataset licensing is more complex than software licensing, and the specifications for dataset licenses are often incomplete. In response to this gap, Benjamin proposed the Montreal Data License (MDL) to standardize dataset licensing. Additionally, Barclay's study highlights the growing difficulty of tracking data usage and understanding data sources and contributions within increasingly complex data ecosystems. Hutchinson also proposed a framework to enhance transparency and accountability based on ethical practices in data development. These efforts underscore the need for a more robust and clear licensing system for datasets, ensuring legal and ethical use of data in machine learning and other applications.

With this I conclude Part I of this article. In the next part I will present my preliminary investigation as well as main investigation for understanding the complexity of dataset licensing.


[1] How has AI developed over the years and what's next?. https://www.weforum.org/agenda/2022/12/how-ai-developed-whats-next-digital-transformation/.

[2] Halina Kaminski and Mark Perry. Open source software licensing patterns. Computer Science Publications, page 10, 2007.

[3] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European conference on computer vision, pages 87–102. Springer, 2016.

[4] Ira Kemelmacher-Shlizerman, Steven M Seitz, Daniel Miller, and Evan Brossard. The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4873–4882, 2016.

[5] Lawrence Rosen. Open source licensing. Software Freedom and Intellectual Property Law, 2005.

[6] Richard Stallman. Free software, free society: Selected essays of Richard M. Stallman. Lulu. com, 2002.

[7] Gopi Krishnan Rajbahadur, Erika Tuck, Li Zi, Zhang Wei, Dayi Lin, Boyuan Chen, Zhen Ming, Daniel Morales German, et al. Can i use this publicly available dataset to build commercial ai software? most likely not. arXiv preprint arXiv:2111.02374, 2021.

Abhay Kumar

Digital Transformation using Cloud Native solutions

2 个月

Insightful

回复

要查看或添加评论,请登录

Utsav Sharma的更多文章

社区洞察

其他会员也浏览了