登录查看更多内容

Is the blocking of artificial intelligence system’s web crawling legitimate?

Giulio Coraggio

Solving Legal Challenges of the Future | Head of Intellectual Property & Technology | Partner @ DLA Piper | IT, AI, Privacy, Cyber & Gaming Lawyer

发布日期: 2023年9月19日

The recent challenges to the web crawling by artificial intelligence systems like ChatGPT by French media raised questions on its compliance.

It is recent news that some French media have decided to block the web crawling tool “GPTBot,” used by OpenAI to train its Generative AI system like GPT4 powering ChatGPT, from accessing their websites. ?As artificial intelligence systems become more widespread and used in increasingly diverse areas, the collection of data indeed comes to assume a very different weight for newspapers. ?Some decide to stop it and others to exploit the opportunity.

The case between OpenAI and French media

OpenAI had long stated that it was using the GPTBot to fuel the training of upcoming versions of their Generative Artificial Intelligence system, particularly GPT-5. ?This version will thus be able to build on much broader knowledge. ?A crawler is software that reads all the contents of a web page or database in an automated manner and also makes a copy of all the documents present, sorting them according to an index for ease of later use.

Since the release of the well-known ChatGPT system, OpenAI had disclosed that most of the data used to train the model came from the Internet, while stating a time coverage of the contents extending to September 2021.

Issues relating to data quality for Generative AI systems

The issues associated with this data collection for GPTs relate primarily to the quality of the data collected and analyzed. ?Poor data quality is a phenomenon that has increased with the spread and increased use of Big Data and has long been an obstacle to the healthy development of AI systems. ?For example, data collected from social platforms evidently have lower quality than data collected from articles published by newspapers, which are much more curated and possess higher value and quality.

The collection of data on the Internet also raises questions for supervisory and regulatory authorities. ?Data protection authorities have recently raised privacy concerns about the collection of data on social media and other public websites. ?Information publicly accessible on the Internet remains subject to data protection laws in any case. ?This type of practice exposes users to risks such as cyber attacks, identity theft, unauthorized surveillance, and unwanted marketing.

The reaction to OpenAI’s web crawling

Faced with this indiscriminate collection of data, newspapers such as Radio France and TF1 discontinued the availability of their site to ChatGPT’s web-crawler and subsequently proposed an agreement to OpenAI that would guarantee them compensation. ?Other media outlets around the world such as, for example,

The New York Times and CNN also disabled GPTBot wanting to protect and avoid copyright infringement of content, but especially wanting to exclude the possibility that other companies, using OpenAI’s products, could benefit from the intellectual work done by newspapers.

The American Journalism Project, a major U.S. philanthropic organization that aims to rebuild and sustain local news has entered into an agreement with OpenAI to experiment with ways in which AI can support the news sector. ?The purpose of this partnership would be to improve local news realities, as, with the use of AI, newspapers could expand their capabilities.

Why web crawling by artificial intelligence systems might be legitimate

One of the main gaps challenged to the current version of the EU AI Act is the lack of coordination with copyright and data protection laws.

When discussing the potential copyright violations by artificial intelligence systems, a key consideration is the applicability of the text and data mining (TDM) exception outlined in the Copyright Directive 2019/790/EU.? This exception allows for TDM activities on intellectual properties like software or databases, irrespective of the purpose or who conducts it, given:

The person has lawful access for TDM purposes.
The copyright owner hasn’t explicitly reserved rights against TDM activities (an opt-out mechanism). This implies TDM activities would fall under their exclusive control.

However, the extent of this opt-out mechanism is influenced by how the rights holder reserves it.? Article 4(3) of the Copyright Directive mandates that online reservations must be machine-readable.? Opting out can also be facilitated by incorporating a clause in a contract, a point confirmed by the Directive itself, which doesn’t mandate Article 4.

Additionally, the reservation’s nature is unrelated to the existence of mechanisms preventing data extraction.? The reservation only serves an informative purpose.? Hence, even if there are no protective measures, adding a reservation to the website’s R&D is enough.

The reservation can:

Fast Company 9 个月前

Combating shadow AI

Cloudflare 2 个月前

ChatGPT, Large Language Models (LLMs), and Data…

Debbie Reynolds 1 年前

Be a digital statement without IT protection mechanisms, like the protocols in robots.txt files.
Be integrated into a digital rights management system that offers cyber protection and an automatically detectable declaration.
However, it can’t just be technical protection measures without a declaration. While these measures don’t outright make TDM unlawful, they can prohibit extractions conflicting with the chosen technical measure, as circumventing such measures is forbidden.

Another challenge is the retention of copies post-data mining.? Reproductions can be held only as long as necessary for TDM. Hence, they can’t be kept for tasks beyond TDM, like validating results.? Some believe copies for data mining can be retained for AI training, but it depends on whether AI training falls under TDM or a subsequent activity. If it’s the former, then copies might be kept during AI training.

The Directive doesn’t address the use of data post-computational analysis.? Some experts suggest that leveraging data mining results might need the copyright owner’s permission. If only segments of content are mined, it’s essential to see if these segments are individually creative and protected.? Some argue that using creative pieces doesn’t breach copyrights if the author’s intended meaning becomes unrecognizable in the new setting.

In summary, developers looking to train AI systems using copyrighted data should:

Secure legal access to the data.
Ensure rights holders haven’t excluded reproductions for TDM.
Retain copies only for the TDM duration.

The relevance of the TDM under the newly established regime of the EU AI Act

Under the current version of the EU AI Act, a disclosure of the IP protected material used for the training of artificial intelligence systems is likely to be required.? This obligation risks to lead to major disputes, unless the disclosing party is able to maintain the legality of this conduct, relying for instance on the above mentioned TDM.

Such type of assessments are included in the compliance evaluations covered by DLA Piper’s PRISCA AI Compliance, a legal tech tool able to perform a maturity assessment of artificial intelligence solutions against the major regulatory obligations.? You can read more on the topic HERE.? There is no doubt that the web crawling by artificial intelligence systems might lead to potential challenges, and therefore companies exploiting AI shall have a valid defense.

Besides, you can find the following article interesting “€ 20 million privacy fine against Clearview AI facial recognition system in Italy“.

Authors: Giulio Coraggio and Marco Guarna

Legal Tech Tools and Offerings

Prisca AI Compliance

Prisca AI Compliance is turn-key solution to assess the maturity of artificial intelligence systems against the main regulations and technical standards providing a score of compliance and identifying corrective actions to be undertaken. Read more

Transfer - DLA Piper legal tech solution to support Transfer Impact Assessments

This presentation shows DLA Piper legal tech tool named "Transfer" to support our clients to perform a transfer impact assessment after the Schrems II case. Read more

DLA Piper Turnkey solution on NFT and Metaverse projects

You can have a look at DLA Piper capabilities and areas for NFT and Metaverse projects. Read more

Innovation Law Insights

8,106 位关注者

Girish A.

Director at BluKonnekt

1 年

I guess it depends on whether it is public or private data.

要查看或添加评论，请登录

Giulio Coraggio的更多文章

AI Governance: A Strategic Imperative for Organizations

2024年10月14日

AI Governance: A Strategic Imperative for Organizations

It has become increasingly clear that the intersection of artificial intelligence (AI) and governance is pivotal for…

1 条评论
California AI Law Vetoed: Is Winter Coming for AI?

2024年10月7日

California AI Law Vetoed: Is Winter Coming for AI?

This week's headlines in the legal tech sphere have been dominated by developments in California's approach to…

1 条评论
Europe stands at a crossroads in the development of generative AI

2024年9月25日

Europe stands at a crossroads in the development of generative AI

Forty-nine technologists, academics, and business leaders—including Mark Zuckerberg—have warned that inconsistent…

2 条评论
AI and GDPR: ECJ AG on Balancing Automated Decision Disclosure and Trade Secrets

2024年9月18日

AI and GDPR: ECJ AG on Balancing Automated Decision Disclosure and Trade Secrets

The recent European Court of Justice (ECJ) Advocate General's opinion in case C-203/22 is an important development in…
Free speech vs Digital Platforms: Musk, Durov and the Future of the Internet

2024年9月12日

Free speech vs Digital Platforms: Musk, Durov and the Future of the Internet

Recent events involving two tech titans, Elon Musk and Pavel Durov, have sparked global discussions about freedom of…

2 条评论
My holidays are over, what did you learn from your vacations?

2024年9月2日

My holidays are over, what did you learn from your vacations?

My holidays are officially over, but as those close to me know, I just can't stop learning from every experience! Every…

3 条评论
AI Training Under Scrutiny by EU Privacy Authorities

2024年8月26日

AI Training Under Scrutiny by EU Privacy Authorities

X's suspension of processing certain personal data for training its AI chatbot tool, Grok, following the order by the…
Why is an AI training on your employees highly recommended and mandatory?

2024年8月19日

Why is an AI training on your employees highly recommended and mandatory?

A new study reveals a lack of knowledge among employees about how to use AI, which runs counter to C-level expectations…

1 条评论
Is Privacy for Generative AI at a Turning Point?

2024年8月12日

Is Privacy for Generative AI at a Turning Point?

The Hamburg Data Protection Authority's position on the lack of personal data processing by LLMs during data storage…

2 条评论
AI Act: When is the Survey on your Employees becoming a prohibited artificial intelligence practice?

2024年8月5日

AI Act: When is the Survey on your Employees becoming a prohibited artificial intelligence practice?

The AI Act lists among the artificial intelligence prohibited practices the usage of systems inferring emotions in the…

1 条评论

See all articles

Is the blocking of artificial intelligence system’s web crawling legitimate?

Giulio Coraggio

Solving Legal Challenges of the Future | Head of Intellectual Property & Technology | Partner @ DLA Piper | IT, AI, Privacy, Cyber & Gaming Lawyer

The case between OpenAI and French media

Issues relating to data quality for Generative AI systems

The reaction to OpenAI’s web crawling

Why web crawling by artificial intelligence systems might be legitimate

领英推荐

The relevance of the TDM under the newly established regime of the EU AI Act

Legal Tech Tools and Offerings

Prisca AI Compliance

Transfer - DLA Piper legal tech solution to support Transfer Impact Assessments

DLA Piper Turnkey solution on NFT and Metaverse projects

Innovation Law Insights

8,106 位关注者

Giulio Coraggio的更多文章

社区洞察

其他会员也浏览了

AI Experts Suggest a Global Artificial Intelligence Agency

AI Newsletter

Shadow AI and Data Poisoning in Large Language Models: Implications for Global Security and Mitigation Strategies

AI ONE ON ONE. Episode 8: In the Hot Seat: ChatGPT Speaks Out on Privacy and Safety Concerns

Special Report: Navigating the World of GenAI in 2024

GPT Prompt Bug: ""

Security of Generative AI Services: Safeguarding the Future of AI

What We Can Learn from the FTC’s OpenAI Probe

New Update from OpenAI

Understanding and Mitigating Attacks Against AI-Based Solutions

The case between OpenAI and French media

Issues relating to data quality for Generative AI systems

The reaction to OpenAI’s web crawling

Why web crawling by artificial intelligence systems might be legitimate

领英推荐

The relevance of the TDM under the newly established regime of the EU AI Act

Legal Tech Tools and Offerings

Prisca AI Compliance

Transfer - DLA Piper legal tech solution to support Transfer Impact Assessments

DLA Piper Turnkey solution on NFT and Metaverse projects

Innovation Law Insights

8,106 位关注者

Giulio Coraggio的更多文章

AI Governance: A Strategic Imperative for Organizations

California AI Law Vetoed: Is Winter Coming for AI?

Europe stands at a crossroads in the development of generative AI

AI and GDPR: ECJ AG on Balancing Automated Decision Disclosure and Trade Secrets

Free speech vs Digital Platforms: Musk, Durov and the Future of the Internet

My holidays are over, what did you learn from your vacations?

AI Training Under Scrutiny by EU Privacy Authorities

Why is an AI training on your employees highly recommended and mandatory?

Is Privacy for Generative AI at a Turning Point?

AI Act: When is the Survey on your Employees becoming a prohibited artificial intelligence practice?

社区洞察

其他会员也浏览了

AI Experts Suggest a Global Artificial Intelligence Agency

AI Newsletter

Shadow AI and Data Poisoning in Large Language Models: Implications for Global Security and Mitigation Strategies

AI ONE ON ONE. Episode 8: In the Hot Seat: ChatGPT Speaks Out on Privacy and Safety Concerns

Special Report: Navigating the World of GenAI in 2024

GPT Prompt Bug: ""

Security of Generative AI Services: Safeguarding the Future of AI

What We Can Learn from the FTC’s OpenAI Probe

New Update from OpenAI

Understanding and Mitigating Attacks Against AI-Based Solutions