Is the blocking of artificial intelligence system’s web crawling legitimate?
Giulio Coraggio
Solving Legal Challenges of the Future | Head of Intellectual Property & Technology | Partner @ DLA Piper | IT, AI, Privacy, Cyber & Gaming Lawyer
The recent challenges to the web crawling by artificial intelligence systems like ChatGPT by French media raised questions on its compliance.
It is recent news that some French media have decided to block the web crawling tool “GPTBot,” used by OpenAI to train its Generative AI system like GPT4 powering ChatGPT, from accessing their websites. ?As artificial intelligence systems become more widespread and used in increasingly diverse areas, the collection of data indeed comes to assume a very different weight for newspapers. ?Some decide to stop it and others to exploit the opportunity.
The case between OpenAI and French media
OpenAI had long stated that it was using the GPTBot to fuel the training of upcoming versions of their Generative Artificial Intelligence system, particularly GPT-5. ?This version will thus be able to build on much broader knowledge. ?A crawler is software that reads all the contents of a web page or database in an automated manner and also makes a copy of all the documents present, sorting them according to an index for ease of later use.
Since the release of the well-known ChatGPT system, OpenAI had disclosed that most of the data used to train the model came from the Internet, while stating a time coverage of the contents extending to September 2021.
Issues relating to data quality for Generative AI systems
The issues associated with this data collection for GPTs relate primarily to the quality of the data collected and analyzed. ?Poor data quality is a phenomenon that has increased with the spread and increased use of Big Data and has long been an obstacle to the healthy development of AI systems. ?For example, data collected from social platforms evidently have lower quality than data collected from articles published by newspapers, which are much more curated and possess higher value and quality.
The collection of data on the Internet also raises questions for supervisory and regulatory authorities. ?Data protection authorities have recently raised privacy concerns about the collection of data on social media and other public websites. ?Information publicly accessible on the Internet remains subject to data protection laws in any case. ?This type of practice exposes users to risks such as cyber attacks, identity theft, unauthorized surveillance, and unwanted marketing.
The reaction to OpenAI’s web crawling
Faced with this indiscriminate collection of data, newspapers such as Radio France and TF1 discontinued the availability of their site to ChatGPT’s web-crawler and subsequently proposed an agreement to OpenAI that would guarantee them compensation. ?Other media outlets around the world such as, for example,
The New York Times and CNN also disabled GPTBot wanting to protect and avoid copyright infringement of content, but especially wanting to exclude the possibility that other companies, using OpenAI’s products, could benefit from the intellectual work done by newspapers.
The American Journalism Project, a major U.S. philanthropic organization that aims to rebuild and sustain local news has entered into an agreement with OpenAI to experiment with ways in which AI can support the news sector. ?The purpose of this partnership would be to improve local news realities, as, with the use of AI, newspapers could expand their capabilities.
Why web crawling by artificial intelligence systems might be legitimate
One of the main gaps challenged to the current version of the EU AI Act is the lack of coordination with copyright and data protection laws.
When discussing the potential copyright violations by artificial intelligence systems, a key consideration is the applicability of the text and data mining (TDM) exception outlined in the Copyright Directive 2019/790/EU.? This exception allows for TDM activities on intellectual properties like software or databases, irrespective of the purpose or who conducts it, given:
However, the extent of this opt-out mechanism is influenced by how the rights holder reserves it.? Article 4(3) of the Copyright Directive mandates that online reservations must be machine-readable.? Opting out can also be facilitated by incorporating a clause in a contract, a point confirmed by the Directive itself, which doesn’t mandate Article 4.
Additionally, the reservation’s nature is unrelated to the existence of mechanisms preventing data extraction.? The reservation only serves an informative purpose.? Hence, even if there are no protective measures, adding a reservation to the website’s R&D is enough.
The reservation can:
领英推荐
Another challenge is the retention of copies post-data mining.? Reproductions can be held only as long as necessary for TDM. Hence, they can’t be kept for tasks beyond TDM, like validating results.? Some believe copies for data mining can be retained for AI training, but it depends on whether AI training falls under TDM or a subsequent activity. If it’s the former, then copies might be kept during AI training.
The Directive doesn’t address the use of data post-computational analysis.? Some experts suggest that leveraging data mining results might need the copyright owner’s permission. If only segments of content are mined, it’s essential to see if these segments are individually creative and protected.? Some argue that using creative pieces doesn’t breach copyrights if the author’s intended meaning becomes unrecognizable in the new setting.
In summary, developers looking to train AI systems using copyrighted data should:
The relevance of the TDM under the newly established regime of the EU AI Act
Under the current version of the EU AI Act, a disclosure of the IP protected material used for the training of artificial intelligence systems is likely to be required.? This obligation risks to lead to major disputes, unless the disclosing party is able to maintain the legality of this conduct, relying for instance on the above mentioned TDM.
Such type of assessments are included in the compliance evaluations covered by DLA Piper’s PRISCA AI Compliance, a legal tech tool able to perform a maturity assessment of artificial intelligence solutions against the major regulatory obligations.? You can read more on the topic HERE.? There is no doubt that the web crawling by artificial intelligence systems might lead to potential challenges, and therefore companies exploiting AI shall have a valid defense.
Besides, you can find the following article interesting “€ 20 million privacy fine against Clearview AI facial recognition system in Italy“.
Authors: Giulio Coraggio and Marco Guarna
Legal Tech Tools and Offerings
Prisca AI Compliance
Prisca AI Compliance is turn-key solution to assess the maturity of artificial intelligence systems against the main regulations and technical standards providing a score of compliance and identifying corrective actions to be undertaken. Read more
Transfer - DLA Piper legal tech solution to support Transfer Impact Assessments
This presentation shows DLA Piper legal tech tool named "Transfer" to support our clients to perform a transfer impact assessment after the Schrems II case. Read more
DLA Piper Turnkey solution on NFT and Metaverse projects
You can have a look at DLA Piper capabilities and areas for NFT and Metaverse projects. Read more
Director at BluKonnekt
1 年I guess it depends on whether it is public or private data.