The new OpenAI User Agent and its consequences

The new OpenAI User Agent and its consequences

The latest post by Gergely Orosz from The Pragmatic Engineer put the focus on one of the major concerns about the spread of AI and LLMs: the copyright of data used in their training.

He basically removed the access to the newly introduced OpenAI User Agent, so that its blog will not be used anymore for training GPT models. By scraping its content, OpenAI basically cut the bond between Gergely and his audience, providing it without any citation to its blog and, on top of that, making pay the users for it.

Here's the link to the original post: https://www.dhirubhai.net/posts/gergelyorosz_i-updated-my-blogs-robotstxt-to-opt-out-activity-7094762821527171072-8DYn

I totally agree with what he wrote.

It seems like AI companies used blogs, social media platforms, and other companies like orchards. They entered there during the night and took some fruit without telling anyone. In fact, only now and only OpenAi created its User Agent to opt out. And what it meant to be a public canteen for the benefit of the community (a non-profit organization), turned out to be a marmalade?shop (for-profit organization).


Out of the metaphor, web scraping practices are old as the web itself. It started with a far west where brute force was the only rule and, with years passing by and legal battle after a legal battle, it has become a more regulated field.

Today, there are legal and ethical standards that every company in this industry selling products based on web data should follow.?

Without considering the most technical ones, the general rules are:

- no Personal Identifiable information scraped

- Data should be public (no data behind a login)

- Data must be not protected by copyright

- Scraping should not harm the target website

In the financial world, the investment data standard organization redacted some best practices for web scraping (https://www.investmentdata.org/_files/ugd/c6ff57_a592a633d2b9446f8cb3539e1a77ac37.pdf).

Long story short: if today you have a business based on web-scraped data (competitive intelligence tools, sentiment analysis services, alternative data for the financial market), you need to play fair following the rules.

By doing so, you're adding value to the global economy, providing a product/service that doesn't damage others.

That's not happening actually with most of the AI companies. They're cannibalizing traffic from companies like StackOverflow, just to mention one clear use case I've mentioned in one of the latest posts of The Web Scraping Club .

No alt text provided for this image
Image taken from a Reddit Discussion: https://www.reddit.com/r/ChatGPT/comments/15ai84c/chatgpt_was_trained_on_stackoverflow_data_and_is/

There's a court case where it seems copyrighted data has been used by OpenAI, and we’ll see how it will end https://www.theguardian.com/books/2023/jul/05/authors-file-a-lawsuit-against-openai-for-unlawfully-ingesting-their-books

But even ingesting blog posts or user-generated content to create an indirect competitor, without even mentioning the sources, maybe is not legally enforceable but it’s dangerous for the content creator economy and the health of the web itself.?

Paradoxically, more damages are made to these businesses, like less revenues due to less traffic, the more the training of LLMs in danger. If blogging or keeping alive StackOverflow is no more remunerative and they are shut down, there will be less available data for the training itself.

But I want to close my reflection with some hope:

- having a clear user agent to exclude from your websites it's the first step to making some order in this wild west. Of course, for one declared user agent there will be 100 undeclared or camouflaged as genuine Google Chrome requests.

- Perplexity approach is a great one. For each query, you have links to the sources used. A revenues share approach between content provider and AI using it, could be a positive innovation to the web itself.

Andrea Panzeri

Soluzioni e Assistenza Informatica per Aziende

1 个月

Grande Pierluigi ??

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了