登录查看更多内容

The new OpenAI User Agent and its consequences

Pierluigi Vinciguerra

Co Founder and CTO at Databoutique.com | Writing on The Web Scraping Club

发布日期: 2023年8月9日

The latest post by Gergely Orosz from The Pragmatic Engineer put the focus on one of the major concerns about the spread of AI and LLMs: the copyright of data used in their training.

He basically removed the access to the newly introduced OpenAI User Agent, so that its blog will not be used anymore for training GPT models. By scraping its content, OpenAI basically cut the bond between Gergely and his audience, providing it without any citation to its blog and, on top of that, making pay the users for it.

Here's the link to the original post: https://www.dhirubhai.net/posts/gergelyorosz_i-updated-my-blogs-robotstxt-to-opt-out-activity-7094762821527171072-8DYn

I totally agree with what he wrote.

It seems like AI companies used blogs, social media platforms, and other companies like orchards. They entered there during the night and took some fruit without telling anyone. In fact, only now and only OpenAi created its User Agent to opt out. And what it meant to be a public canteen for the benefit of the community (a non-profit organization), turned out to be a marmalade?shop (for-profit organization).

Out of the metaphor, web scraping practices are old as the web itself. It started with a far west where brute force was the only rule and, with years passing by and legal battle after a legal battle, it has become a more regulated field.

Today, there are legal and ethical standards that every company in this industry selling products based on web data should follow.?

Without considering the most technical ones, the general rules are:

- no Personal Identifiable information scraped

- Data should be public (no data behind a login)

- Data must be not protected by copyright

领英推荐

AI-Powered news roundup: Edition 2

Siili Solutions 6 个月前

Why some new open-ish AI licenses may not fly in the…

Charlie Hull 11 个月前

A New Technique to Safeguard Open Source AI

Diana Wolf T. 3 个月前

- Scraping should not harm the target website

In the financial world, the investment data standard organization redacted some best practices for web scraping (https://www.investmentdata.org/_files/ugd/c6ff57_a592a633d2b9446f8cb3539e1a77ac37.pdf).

Long story short: if today you have a business based on web-scraped data (competitive intelligence tools, sentiment analysis services, alternative data for the financial market), you need to play fair following the rules.

By doing so, you're adding value to the global economy, providing a product/service that doesn't damage others.

That's not happening actually with most of the AI companies. They're cannibalizing traffic from companies like StackOverflow, just to mention one clear use case I've mentioned in one of the latest posts of The Web Scraping Club .

No alt text provided for this image — Image taken from a Reddit Discussion: https://www.reddit.com/r/ChatGPT/comments/15ai84c/chatgpt_was_trained_on_stackoverflow_data_and_is/

There's a court case where it seems copyrighted data has been used by OpenAI, and we’ll see how it will end https://www.theguardian.com/books/2023/jul/05/authors-file-a-lawsuit-against-openai-for-unlawfully-ingesting-their-books

But even ingesting blog posts or user-generated content to create an indirect competitor, without even mentioning the sources, maybe is not legally enforceable but it’s dangerous for the content creator economy and the health of the web itself.?

Paradoxically, more damages are made to these businesses, like less revenues due to less traffic, the more the training of LLMs in danger. If blogging or keeping alive StackOverflow is no more remunerative and they are shut down, there will be less available data for the training itself.

But I want to close my reflection with some hope:

- having a clear user agent to exclude from your websites it's the first step to making some order in this wild west. Of course, for one declared user agent there will be 100 undeclared or camouflaged as genuine Google Chrome requests.

- Perplexity approach is a great one. For each query, you have links to the sources used. A revenues share approach between content provider and AI using it, could be a positive innovation to the web itself.

Andrea Panzeri

Soluzioni e Assistenza Informatica per Aziende

1 个月

Grande Pierluigi ??

要查看或添加评论，请登录

查看全部

The new OpenAI User Agent and its consequences

Pierluigi Vinciguerra

Co Founder and CTO at Databoutique.com | Writing on The Web Scraping Club

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Is Open Source AI Is Eating Commercial Providers Lunch?

November 19, 2023

Investor Panic Fuels Closed-Source Narrative, but Open-Source AI Foundation Models Will Define the Future

Open Source LLMs vs. Closed LLMs

【AI】data for training LLM v.s. Reddit (UGC)

First Major AI Law Approved: Industry News, Guides, & Handy Scraping Tools

OpenAI to Unleash Open-Source Model: Will It Benefit the AI Race?

How to make your own AI chatbot: a conceptual cheat sheet

Insight of the Week: Open Source AI

Amazon Investigating Perplexity AI Over Data Scraping Allegations.

领英推荐

What is device fingerprinting?

2023年5月21日

Web Scraping news recap - April 2023

2023年5月1日

Web scraping and alternative data for financial markets

2023年4月25日

Writing a web scraper with ChatGPT. Is it a good idea?

2023年4月16日

How to scrape Datadome protected websites (early 2023 version)

2023年4月14日

XPath vs CSS selectors: a comparison

2023年4月2日

Bypass Cloudflare with these web scraping tools

2023年2月14日

Bypass Cloudflare Bot Protection with GoLogin

2023年1月19日

How I've built my home made mobile proxy

2023年1月15日

Scraping OpenSea data to analyze NFT collections

2023年1月6日

社区洞察

其他会员也浏览了

Is Open Source AI Is Eating Commercial Providers Lunch?

November 19, 2023

Investor Panic Fuels Closed-Source Narrative, but Open-Source AI Foundation Models Will Define the Future

Open Source LLMs vs. Closed LLMs

【AI】data for training LLM v.s. Reddit (UGC)

First Major AI Law Approved: Industry News, Guides, & Handy Scraping Tools

OpenAI to Unleash Open-Source Model: Will It Benefit the AI Race?

How to make your own AI chatbot: a conceptual cheat sheet

Insight of the Week: Open Source AI

Amazon Investigating Perplexity AI Over Data Scraping Allegations.