登录查看更多内容

The business model of scraping

Vaclav Vincalek

Technology entrepreneur, CTO and technology advisor for startups and fast-growing companies. Creating Strategic options with Technology.

发布日期: 2024年8月12日

+ 关注

Scraping — the word which rose to prominence, thanks to AI.

It has been around since the dawn of the first website that was visited by the bot of the first search engine.

Ever since, hundreds of them keep visiting every site, all the time. For what purpose? Some are part of a search engine like Google or Bing. Some are scanning for a particular type of content or change in the content.

The example here would be a bot scanning for changes in product prices or new press releases. Others are searching for website vulnerabilities, so they can either break those sites or make them part of a virus army.

Everyone has a reason to scrape. There is also an industry based around optimizing your website to be visited, scraped and indexed to reach the highest level in Google search.

While providing all the information for free and begging Google to visit you, you are hoping that in exchange, people will visit your website, where they can either buy something, or you can serve them ads (conveniently provided by Google).

That business model has been working now for decades, and, for Google, it was so successful that it led to several lawsuits by the US Department Of Justice (DoJ). It also led to the demise of the newspaper industry. True, media organizations started complaining about it, but they were really the ones who didn't see the Internet as a threat.

That was then, and now we are in a new era of scraping. Welcome to the world of AI.

After all the AI companies scraped the internet and started using it to train their models, everyone got up in arms that these companies stole the content and started using it for unauthorized purposes.

The concern is that AI will reproduce and replace all creative work, with no attribution and no remuneration.

People creating original content are the most vocal about it. They’re demanding a stop to the practice of scraping their content and are asking their work won’t be used for any model going forward.

领英推荐

Combating shadow AI

Cloudflare 2 个月前

Apply as a Speaker, 2023 Legal Wrap-Up from Zyte and…

Zyte 9 个月前

Investor Panic Fuels Closed-Source Narrative, but…

Bart De Witte 4 个月前

As you can imagine, understanding the technology’s capabilities, controlling it and enforcing it is beyond the means of most people.

Fortunately, we have technology companies which can do this on their behalf — for a fee. To humor you, one of the ways you, as a website owner, can prevent your content from being scraped by these bots is to use the file 'robots.txt.' This allows you to explicitly state which parts of the website can be scraped and by what agent.

It is entirely up to the particular scraping agent to observe these rules. It? is purely voluntary. It is like having a sign in front of the bank, saying, 'Please, don't take any money from this big pile.'

Cute.

Another self-defense mechanism doomed to fail is to poison your content.?

The owners of the content or the websites are trying to understand together, with the developers of these AI models, a new business model.

So far, these attempts are just that — early attempts with no actual realistic outcomes.

A few examples:

Reddit in AI content licensing deal with Google - If there is a deal that will catch the attention of the DoJ, this is the one. A search engine, which has a monopoly on search, pays a content provider for exclusive access to its content. The irony is that all the content created there is done by its individual users who will definitely not get paid, nor their content will receive any attribution. Even worse, these people now can’t delete their content from Reddit. Reddit owns it. And you are wondering why Google changed its tagline from 'Don't be evil.'
Perplexity will soon start selling ads within AI search - This is just wishful thinking. It will take a long time before any meaningful traffic and, hence, revenue, will come out of this. Also, the attribution, contracts and payments will become very complicated very quickly. On a larger scale, it will become unworkable.
OpenAI Strikes a Deal to License News Corp Content - The challenge with all of the above is that the technology is still in its infancy. Also, the thing is that many people associate the term AI with the large language models (LLMs) that provide the illusion of early-functioning AGI. Combine this with trying to apply controls which (barely) worked in the past, and you are getting these nonsensical attempts.

It won't work. The tech is changing too fast, we have no production (i.e. money-making) ready applications and we will also find that the way we train these AI models will keep changing. Creating restrictions is a window dressing exercise or a short-term monetary gain.

The recurrent pattern here? Once tech matures, the products will get defined and the business model will follow. And it will be a simple one. Right now, there is just too much noise.

Recurrent Patterns

1,318 位关注者

Icare Hydn Duplessy 淳于乐

2 个月

Scraping relevant data from various online sources can provide insights into the accessibility, visibility, and preferences of the target audience. This approach allows businesses to customize their marketing strategies to target the specific needs of the local population.

要查看或添加评论，请登录

查看全部

The business model of scraping

Vaclav Vincalek

Technology entrepreneur, CTO and technology advisor for startups and fast-growing companies. Creating Strategic options with Technology.

领英推荐

Recurrent Patterns

1,318 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

The Cat-and-Mouse Game of Bots: An Insightful Talk by Zyte's Principal Reverse Engineer- Evgeny.

Khwaja's Take on Louis Gerstner's WSJ Piece, Open Source AI, DEI, UnitedHealth Cyberattack, Nike's Pivot, BLUEPRINTSummit, Chipotle "and" much more

AI-Powered news roundup: Edition 2

Cloudflare Introduces New Solution to Defend Against AI-Powered Bots

Introducing SearchGPT: OpenAI’s Bold Move into the Search Engine Market

How OpenAI's Powerful New SearchGPT is Shaking Up the Search Engine Industry

First Major AI Law Approved: Industry News, Guides, & Handy Scraping Tools

SearchGPT and the future of SEO

Google AI Overviews Buzz: AIOs for unlogged users, increased number of links, and more

Building an AI Search app with 100 lines of code

领英推荐

Recurrent Patterns

1,318 位关注者

The trouble with domain names

2024年10月14日

Humans 2 : AI 0

2024年10月7日

LinkedIn's continuous slide into AI nothingness

2024年9月30日

AI, the broken brand promise of tech

2024年9月23日

Software with a Soul

2024年9月16日

There’s WALL-E. And then there’s VALL-E

2024年9月9日

Google is a monopoly. Now what?

2024年9月2日

Post-Quantum World started last week

2024年8月26日

222

2024年8月19日

SearchGPT is a PR stunt

2024年8月5日

社区洞察

其他会员也浏览了

The Cat-and-Mouse Game of Bots: An Insightful Talk by Zyte's Principal Reverse Engineer- Evgeny.

Khwaja's Take on Louis Gerstner's WSJ Piece, Open Source AI, DEI, UnitedHealth Cyberattack, Nike's Pivot, BLUEPRINTSummit, Chipotle "and" much more

AI-Powered news roundup: Edition 2

Cloudflare Introduces New Solution to Defend Against AI-Powered Bots

Introducing SearchGPT: OpenAI’s Bold Move into the Search Engine Market

How OpenAI's Powerful New SearchGPT is Shaking Up the Search Engine Industry

First Major AI Law Approved: Industry News, Guides, & Handy Scraping Tools

SearchGPT and the future of SEO

Google AI Overviews Buzz: AIOs for unlogged users, increased number of links, and more

Building an AI Search app with 100 lines of code