The new OpenAI User Agent and its consequences
Pierluigi Vinciguerra
Co Founder and CTO at Databoutique.com | Writing on The Web Scraping Club
The latest post by Gergely Orosz from The Pragmatic Engineer put the focus on one of the major concerns about the spread of AI and LLMs: the copyright of data used in their training.
He basically removed the access to the newly introduced OpenAI User Agent, so that its blog will not be used anymore for training GPT models. By scraping its content, OpenAI basically cut the bond between Gergely and his audience, providing it without any citation to its blog and, on top of that, making pay the users for it.
Here's the link to the original post: https://www.dhirubhai.net/posts/gergelyorosz_i-updated-my-blogs-robotstxt-to-opt-out-activity-7094762821527171072-8DYn
I totally agree with what he wrote.
It seems like AI companies used blogs, social media platforms, and other companies like orchards. They entered there during the night and took some fruit without telling anyone. In fact, only now and only OpenAi created its User Agent to opt out. And what it meant to be a public canteen for the benefit of the community (a non-profit organization), turned out to be a marmalade?shop (for-profit organization).
Out of the metaphor, web scraping practices are old as the web itself. It started with a far west where brute force was the only rule and, with years passing by and legal battle after a legal battle, it has become a more regulated field.
Today, there are legal and ethical standards that every company in this industry selling products based on web data should follow.?
Without considering the most technical ones, the general rules are:
- no Personal Identifiable information scraped
- Data should be public (no data behind a login)
- Data must be not protected by copyright
领英推荐
- Scraping should not harm the target website
In the financial world, the investment data standard organization redacted some best practices for web scraping (https://www.investmentdata.org/_files/ugd/c6ff57_a592a633d2b9446f8cb3539e1a77ac37.pdf).
Long story short: if today you have a business based on web-scraped data (competitive intelligence tools, sentiment analysis services, alternative data for the financial market), you need to play fair following the rules.
By doing so, you're adding value to the global economy, providing a product/service that doesn't damage others.
That's not happening actually with most of the AI companies. They're cannibalizing traffic from companies like StackOverflow, just to mention one clear use case I've mentioned in one of the latest posts of The Web Scraping Club .
There's a court case where it seems copyrighted data has been used by OpenAI, and we’ll see how it will end https://www.theguardian.com/books/2023/jul/05/authors-file-a-lawsuit-against-openai-for-unlawfully-ingesting-their-books
But even ingesting blog posts or user-generated content to create an indirect competitor, without even mentioning the sources, maybe is not legally enforceable but it’s dangerous for the content creator economy and the health of the web itself.?
Paradoxically, more damages are made to these businesses, like less revenues due to less traffic, the more the training of LLMs in danger. If blogging or keeping alive StackOverflow is no more remunerative and they are shut down, there will be less available data for the training itself.
But I want to close my reflection with some hope:
- having a clear user agent to exclude from your websites it's the first step to making some order in this wild west. Of course, for one declared user agent there will be 100 undeclared or camouflaged as genuine Google Chrome requests.
- Perplexity approach is a great one. For each query, you have links to the sources used. A revenues share approach between content provider and AI using it, could be a positive innovation to the web itself.
Soluzioni e Assistenza Informatica per Aziende
1 个月Grande Pierluigi ??