Will GenAI starve from lack of data?
Jaundālders Aigars
Passionate about all things AI and cyber. Enjoy both building and breaking things.
At least 44% of top 50 news sites are blocking realtime web access from Google Bard, ChatGPT and Bing Chat.
Increase in personal productivity is claimed as one of the top benefits from generative AI. Being trained on the whole Internet (~ 2 trillion words, something that would take almost 14 000 years for a human to read), it enables you to automate many tasks.
There are two broad categories of use cases for GenAI - using knowledge model has acquired during training and using real time, current data. Based on the prompt you provide, it retrieves one and/or the other and generates responses.
Using knowledge from model training process (up to the cut off time) and using your own data you directly provide to the model is straightforward and works brilliantly.
But interacting with real time data is tricky. Especially real time data from the Internet. For example, summarising long website. Websites have to make their living. They depend on end user traffic for that. If ChatGPT, Google Bard or any other AI would front them, effectively decreasing unique visits and ad click through ratios, that would be existential threat to them.
Recently I was helping sports content creator to create a personal GPT to summarise sports news and noticed that many sites were blocking access. So I decided to perform a bit broader study. I took a list of top 50 news sites (actually list contained 52 URLs) from Press Gazette (https://pressgazette.co.uk/) and checked which sites were blocking access from Google Bard, ChatGPT and Bing Chat (aka Microsoft Copilot). Responses from Google Bard, Bing Chat and ChatGPT were collected on December 3rd, 2023. ChatGPT Plus (paid) subscription was used.
Immediately I realised that prompt engineering is important even here and Bing Chat looked to be more opinionated than ChatGPT.
I started testing with https://bbc.com and used the following prompt on Bing Chat:
image you are news journalist. your goal is to provide concise summaries of latest news. visit https://bbc.com and summarize latest news
Bing Chat came negative:
But doing it just a bit differently came through ok:
summarize news. use site https://bbc.com
Two different prompts producing two different results suggest that it is not robots.txt or any other target based restriction at play.
领英推荐
Google News was another peculiar target.
summarize news. use site https://news.google.com
As you can see, same prompt that worked on bbc.com, did not work with news.google.com. Peculiar thing is that apparently Bing Chat actually went to the site, started generating answers and response was filtered after that which is different behaviour from what happens with bbc.com.
Could it be that Microsoft is having some special arrangements with some websites and are putting additional guardrails on them preventing users to extract most recent data in real time?
Next, I looked at Google Bard. Unfortunately, it refused to accept the same prompt that worked with ChatGPT and Bing Chat.
But of course, everyone knows it can access web in real time. So, prompt engineering to the rescue:
what are the latest news from bbc.com?
Unfortunately, it meant that I had to use two different prompts. One - for ChatGPT and Bing Chat and another - for Google Bard.
Google Bard came in first - "only" 44% of sites on the top 50 news site list blocking it. ChatGPT came second with 58% websites blocking it from access. Bing Chat was restricted from accessing 71% sites.
You can download full csv list with access check results here.
Main conclusions