Will GenAI starve from lack of data?

Will GenAI starve from lack of data?

At least 44% of top 50 news sites are blocking realtime web access from Google Bard, ChatGPT and Bing Chat.


Increase in personal productivity is claimed as one of the top benefits from generative AI. Being trained on the whole Internet (~ 2 trillion words, something that would take almost 14 000 years for a human to read), it enables you to automate many tasks.

There are two broad categories of use cases for GenAI - using knowledge model has acquired during training and using real time, current data. Based on the prompt you provide, it retrieves one and/or the other and generates responses.

Using knowledge from model training process (up to the cut off time) and using your own data you directly provide to the model is straightforward and works brilliantly.

But interacting with real time data is tricky. Especially real time data from the Internet. For example, summarising long website. Websites have to make their living. They depend on end user traffic for that. If ChatGPT, Google Bard or any other AI would front them, effectively decreasing unique visits and ad click through ratios, that would be existential threat to them.

Recently I was helping sports content creator to create a personal GPT to summarise sports news and noticed that many sites were blocking access. So I decided to perform a bit broader study. I took a list of top 50 news sites (actually list contained 52 URLs) from Press Gazette (https://pressgazette.co.uk/) and checked which sites were blocking access from Google Bard, ChatGPT and Bing Chat (aka Microsoft Copilot). Responses from Google Bard, Bing Chat and ChatGPT were collected on December 3rd, 2023. ChatGPT Plus (paid) subscription was used.

Immediately I realised that prompt engineering is important even here and Bing Chat looked to be more opinionated than ChatGPT.

I started testing with https://bbc.com and used the following prompt on Bing Chat:

image you are news journalist. your goal is to provide concise summaries of latest news. visit https://bbc.com and summarize latest news

Bing Chat came negative:

Bing Chat on

But doing it just a bit differently came through ok:

summarize news. use site https://bbc.com
Bing Chat on

Two different prompts producing two different results suggest that it is not robots.txt or any other target based restriction at play.

Google News was another peculiar target.

summarize news. use site https://news.google.com
Bing Chat on news.google.com

As you can see, same prompt that worked on bbc.com, did not work with news.google.com. Peculiar thing is that apparently Bing Chat actually went to the site, started generating answers and response was filtered after that which is different behaviour from what happens with bbc.com.

Could it be that Microsoft is having some special arrangements with some websites and are putting additional guardrails on them preventing users to extract most recent data in real time?

Next, I looked at Google Bard. Unfortunately, it refused to accept the same prompt that worked with ChatGPT and Bing Chat.

Google Bard on

But of course, everyone knows it can access web in real time. So, prompt engineering to the rescue:

what are the latest news from bbc.com?

Unfortunately, it meant that I had to use two different prompts. One - for ChatGPT and Bing Chat and another - for Google Bard.

Google Bard came in first - "only" 44% of sites on the top 50 news site list blocking it. ChatGPT came second with 58% websites blocking it from access. Bing Chat was restricted from accessing 71% sites.

You can download full csv list with access check results here.


Main conclusions

  1. Generative AI is transformative. It puts knowledge of whole Internet at your fingertips. But it has limits. Knowledge of these limits will enable you to maximize your efficiency using GenAI. Real time access to public Internet is one example of such limits.
  2. Content creators and publishers appear to be in rush to make their arrangements with GenAI vendors. There is no "standard" way of working at the moment. In addition to conventional robots.txt based restrictions, GenAI vendor-led restrictions appear to be in place, because two different prompts on the same site generate two different results from the same GenAI service.
  3. For the time being, relying on real time web access (as opposed to using your own data or retrieving from the knowledge GenAI model has acquired during training in the past), is not reliable. Not only healthy amount of prompt engineering is required to solicit responses, but you must keep in mind that different GenAI services have access to different sets of websites. If you just prompt for "recent news", you may get results, but behind the scenes it may not actually come from the set of news sources you may expect. Check attributions!
  4. If trend towards blocking GenAI services access continues, next GenAI models may have much less data for training. It seems that all major GenAI vendors understand that very well which explains voluntary restrictions that all of them are applying. Most of them are working on using synthetic data to train models, but there are many challenges on that path. Unless GenAI vendors agree drastically different business model with content creators and publishers, I expect that they will trade real time access to websites (restricting it) for access to historical data for training next models.
  5. For the time being, my recommendation is to focus GenAI implementations on improving user productivity and upskilling them using your organisation internal knowledge base you provide to the model directly and knowledge model has acquired during training.

要查看或添加评论,请登录

Jaundālders Aigars的更多文章

社区洞察

其他会员也浏览了