登录查看更多内容

Will GenAI starve from lack of data?

Jaundālders Aigars

Passionate about all things AI and cyber. Enjoy both building and breaking things.

发布日期: 2023年12月4日

At least 44% of top 50 news sites are blocking realtime web access from Google Bard, ChatGPT and Bing Chat.

Increase in personal productivity is claimed as one of the top benefits from generative AI. Being trained on the whole Internet (~ 2 trillion words, something that would take almost 14 000 years for a human to read), it enables you to automate many tasks.

There are two broad categories of use cases for GenAI - using knowledge model has acquired during training and using real time, current data. Based on the prompt you provide, it retrieves one and/or the other and generates responses.

Using knowledge from model training process (up to the cut off time) and using your own data you directly provide to the model is straightforward and works brilliantly.

But interacting with real time data is tricky. Especially real time data from the Internet. For example, summarising long website. Websites have to make their living. They depend on end user traffic for that. If ChatGPT, Google Bard or any other AI would front them, effectively decreasing unique visits and ad click through ratios, that would be existential threat to them.

Recently I was helping sports content creator to create a personal GPT to summarise sports news and noticed that many sites were blocking access. So I decided to perform a bit broader study. I took a list of top 50 news sites (actually list contained 52 URLs) from Press Gazette (https://pressgazette.co.uk/) and checked which sites were blocking access from Google Bard, ChatGPT and Bing Chat (aka Microsoft Copilot). Responses from Google Bard, Bing Chat and ChatGPT were collected on December 3rd, 2023. ChatGPT Plus (paid) subscription was used.

Immediately I realised that prompt engineering is important even here and Bing Chat looked to be more opinionated than ChatGPT.

I started testing with https://bbc.com and used the following prompt on Bing Chat:

image you are news journalist. your goal is to provide concise summaries of latest news. visit https://bbc.com and summarize latest news

Bing Chat came negative:

But doing it just a bit differently came through ok:

summarize news. use site https://bbc.com

Two different prompts producing two different results suggest that it is not robots.txt or any other target based restriction at play.

领英推荐

How Bing vs. Bard became Google’s Super Bowl-level AI…

VentureBeat 2 年前

Who Will Win the AI Search Arms Race?

Edward Yardeni 2 年前

Real Time Lessons from DeepSeek's Disruption of the…

Robert Farrell 1 个月前

Google News was another peculiar target.

summarize news. use site https://news.google.com

As you can see, same prompt that worked on bbc.com, did not work with news.google.com. Peculiar thing is that apparently Bing Chat actually went to the site, started generating answers and response was filtered after that which is different behaviour from what happens with bbc.com.

Could it be that Microsoft is having some special arrangements with some websites and are putting additional guardrails on them preventing users to extract most recent data in real time?

Next, I looked at Google Bard. Unfortunately, it refused to accept the same prompt that worked with ChatGPT and Bing Chat.

But of course, everyone knows it can access web in real time. So, prompt engineering to the rescue:

what are the latest news from bbc.com?

Unfortunately, it meant that I had to use two different prompts. One - for ChatGPT and Bing Chat and another - for Google Bard.

Google Bard came in first - "only" 44% of sites on the top 50 news site list blocking it. ChatGPT came second with 58% websites blocking it from access. Bing Chat was restricted from accessing 71% sites.

You can download full csv list with access check results here.

Main conclusions

Generative AI is transformative. It puts knowledge of whole Internet at your fingertips. But it has limits. Knowledge of these limits will enable you to maximize your efficiency using GenAI. Real time access to public Internet is one example of such limits.
Content creators and publishers appear to be in rush to make their arrangements with GenAI vendors. There is no "standard" way of working at the moment. In addition to conventional robots.txt based restrictions, GenAI vendor-led restrictions appear to be in place, because two different prompts on the same site generate two different results from the same GenAI service.
For the time being, relying on real time web access (as opposed to using your own data or retrieving from the knowledge GenAI model has acquired during training in the past), is not reliable. Not only healthy amount of prompt engineering is required to solicit responses, but you must keep in mind that different GenAI services have access to different sets of websites. If you just prompt for "recent news", you may get results, but behind the scenes it may not actually come from the set of news sources you may expect. Check attributions!
If trend towards blocking GenAI services access continues, next GenAI models may have much less data for training. It seems that all major GenAI vendors understand that very well which explains voluntary restrictions that all of them are applying. Most of them are working on using synthetic data to train models, but there are many challenges on that path. Unless GenAI vendors agree drastically different business model with content creators and publishers, I expect that they will trade real time access to websites (restricting it) for access to historical data for training next models.
For the time being, my recommendation is to focus GenAI implementations on improving user productivity and upskilling them using your organisation internal knowledge base you provide to the model directly and knowledge model has acquired during training.

要查看或添加评论，请登录

Jaundālders Aigars的更多文章

Too hard for most Generative AI? Finally there is such everyday use case!

2024年2月23日

Too hard for most Generative AI? Finally there is such everyday use case!

Finding limits to any technology has been always part of my mindset. It is not to claim that technology doesn't work.

6 条评论
Google Gemini - unfair claims or finally a competitor to OpenAI?

2023年12月8日

Google Gemini - unfair claims or finally a competitor to OpenAI?

We are close to the most spectacular event of the year for many astronomers and sky-watchers - Geminid meteor shower…

3 条评论
Battle of the AI Titans Ep.2: Can they spell?

2023年10月26日

Battle of the AI Titans Ep.2: Can they spell?

Spoiler alert - yes, some of them finally can. But not always.

1 条评论
Battle of the Titans: ChatGPT Plus vs. Bing Chat Enterprise

2023年10月12日

Battle of the Titans: ChatGPT Plus vs. Bing Chat Enterprise

I often get asked "is GenAI tool X better than GenAI tool Y?". It is like discussing what is better - spoon or fork.
Microservices – is this grill for your grilling needs?

2022年8月3日

Microservices – is this grill for your grilling needs?

Microservices have been around since at least 2011. Recently I am seeing increasing interest in enterprise and…
Open Data - a vaccine for re-identification and privacy issues?

2020年2月5日

Open Data - a vaccine for re-identification and privacy issues?

Thank you @LATA_org for inviting to speak and host one of the breakout sessions in the Data Driven Nation conference!…
How secure is secure?

2016年8月3日

How secure is secure?

This is yet another example of someone claiming to have designed secure service but actually trading security for…

See all articles

Will GenAI starve from lack of data?

Jaundālders Aigars

Passionate about all things AI and cyber. Enjoy both building and breaking things.

领英推荐

Main conclusions

Jaundālders Aigars的更多文章

社区洞察

其他会员也浏览了

Morning Thrust | Weekly Highlights (75th Edition: 19.05.2024)

Speed up your desk research game with these GPTs

AI-ming for the stars

AI for Us Issue #11

Technovation Corner: #24 November

Business & Technology Snapshot by TUATARA – November 2024

INBOX INSIGHTS: Harmful AI Choices, AI in the News

Goodbye RAG? Gemini 2.0 Flash Might Have Just Killed It!

Chat with Your City: Steps to Build an AI Chatbot Using Llama 3 and DSPy

The AI race might be about the UI layer, not the LLMs

领英推荐

Main conclusions

Jaundālders Aigars的更多文章

Too hard for most Generative AI? Finally there is such everyday use case!

Google Gemini - unfair claims or finally a competitor to OpenAI?

Battle of the AI Titans Ep.2: Can they spell?

Battle of the Titans: ChatGPT Plus vs. Bing Chat Enterprise

Microservices – is this grill for your grilling needs?

Open Data - a vaccine for re-identification and privacy issues?

How secure is secure?

社区洞察

其他会员也浏览了

Morning Thrust | Weekly Highlights (75th Edition: 19.05.2024)

Speed up your desk research game with these GPTs

AI-ming for the stars

AI for Us Issue #11

Technovation Corner: #24 November

Business & Technology Snapshot by TUATARA – November 2024

INBOX INSIGHTS: Harmful AI Choices, AI in the News

Goodbye RAG? Gemini 2.0 Flash Might Have Just Killed It!

Chat with Your City: Steps to Build an AI Chatbot Using Llama 3 and DSPy

The AI race might be about the UI layer, not the LLMs