The business model of scraping
Vaclav Vincalek
Technology entrepreneur, CTO and technology advisor for startups and fast-growing companies. Creating Strategic options with Technology.
Scraping — the word which rose to prominence, thanks to AI.
It has been around since the dawn of the first website that was visited by the bot of the first search engine.
Ever since, hundreds of them keep visiting every site, all the time. For what purpose? Some are part of a search engine like Google or Bing. Some are scanning for a particular type of content or change in the content.
The example here would be a bot scanning for changes in product prices or new press releases. Others are searching for website vulnerabilities, so they can either break those sites or make them part of a virus army.
Everyone has a reason to scrape. There is also an industry based around optimizing your website to be visited, scraped and indexed to reach the highest level in Google search.
While providing all the information for free and begging Google to visit you, you are hoping that in exchange, people will visit your website, where they can either buy something, or you can serve them ads (conveniently provided by Google).
That business model has been working now for decades, and, for Google, it was so successful that it led to several lawsuits by the US Department Of Justice (DoJ). It also led to the demise of the newspaper industry. True, media organizations started complaining about it, but they were really the ones who didn't see the Internet as a threat.
That was then, and now we are in a new era of scraping. Welcome to the world of AI.
After all the AI companies scraped the internet and started using it to train their models, everyone got up in arms that these companies stole the content and started using it for unauthorized purposes.
The concern is that AI will reproduce and replace all creative work, with no attribution and no remuneration.
People creating original content are the most vocal about it. They’re demanding a stop to the practice of scraping their content and are asking their work won’t be used for any model going forward.
领英推荐
As you can imagine, understanding the technology’s capabilities, controlling it and enforcing it is beyond the means of most people.
Fortunately, we have technology companies which can do this on their behalf — for a fee. To humor you, one of the ways you, as a website owner, can prevent your content from being scraped by these bots is to use the file 'robots.txt.' This allows you to explicitly state which parts of the website can be scraped and by what agent.
It is entirely up to the particular scraping agent to observe these rules. It? is purely voluntary. It is like having a sign in front of the bank, saying, 'Please, don't take any money from this big pile.'
Cute.
Another self-defense mechanism doomed to fail is to poison your content.?
The owners of the content or the websites are trying to understand together, with the developers of these AI models, a new business model.
So far, these attempts are just that — early attempts with no actual realistic outcomes.
A few examples:
It won't work. The tech is changing too fast, we have no production (i.e. money-making) ready applications and we will also find that the way we train these AI models will keep changing. Creating restrictions is a window dressing exercise or a short-term monetary gain.
The recurrent pattern here? Once tech matures, the products will get defined and the business model will follow. And it will be a simple one. Right now, there is just too much noise.
Scraping relevant data from various online sources can provide insights into the accessibility, visibility, and preferences of the target audience. This approach allows businesses to customize their marketing strategies to target the specific needs of the local population.