登录查看更多内容

Is Your Content Being Used to Train the AI in Google Bard?

David Meerman Scott

Author of 12 books including NEW RULES OF MARKETING & PR and WSJ bestseller FANOCRACY | marketing & business growth speaker | advisor to emerging companies

发布日期: 2023年4月25日

The Washington Post released a fascinating analysis of how AI chatbots gather content on the public Web. The report,?Inside the secret list of websites that make AI like ChatGPT sound smart?(subscription required) is a fascinating read.

I was especially interested to see that the analysis includes a tool to check if your own website data is being used as an input to train Google’s C4 data set (Colossal Clean Crawled Corpus), a large language model like?ChatGPT?that helps power?Google Bard.

The analysis ranked the roughly 10 million websites based on how many “tokens” appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.

Business and industrial websites made up the biggest category of content in the Google’s C4 data set (16 percent of categorized tokens). Google’s C4 data set also includes more than half a million personal blogs (3.8 percent of categorized tokens).

Many people are concerned that these AI models harvest their data. They see it as “stealing” because the content is used without attribution. As a writer, I can certainly understand that.

Let’s dig a little deeper

While the source of data used in AI generated results isn’t yet reported, I firmly believe over time AI companies will list where the data in a specific response comes from.

AI‐TechPark 1 年前

Beyond Google: How to Rank in AI Search Results and…

Greg Jameson 6 个月前

Everything You Need To Know about Google Bard AI

Jorge Oliveira 1 年前

Perhaps, governments will require reporting. Perhaps there will eventually be a way for a website owner to opt-in or opt-out of having their data used. I suspect that soon, AI companies will volunteer the source of data used in a response.

Chatbots are the new search

No matter how it happens, being part of chat responses will become valuable, just like being at the top of search engine results are valuable today.

Two of my URLs are included in the Google C4 dataset -?DavidMeermanScott.com?(where my blog is hosted) and?newsjacking.com. I already knew that both sites are also included in the ChatGPT dataset because when I enter specific queries, the resulting answers clearly pull from my content.

Today, companies are investing billions into surfacing content on search engines like Google via paid search ads and optimized content.

In the future, if your content is used to train AI and the chatbots include the websites accessed in their responses, that becomes a new way to generate attention for your content.

AI presents a new world with many opportunities! It’s fun to think about what’s coming next and to play around with what’s available now.?

The New Rules of Marketing

8,708 位关注者

Anthony Toigo

AI Automation-AAA - Writing Web3 - Business Consultant & More

1 年

How to design more in alignment t with Web3? Had an idea last year about creating articles as NFT with smart contracts. The best of all worlds will always be both evergreen and the long term royalties (recurring revenue) associated. (Why I built a phone company!) Everything at this moment is up for grabs as far as models and designs. Do we stick with centralized situations that drain everyone’s personal IP and creativity or get out of - design new boxes with the intent of proper distribution? Think big, think different. Everything is up for reinvention. So, let’s reinvent! We have to let go of to discover what’s possible and waiting.

1 次回应

Ash Roy CPA MBA

I help coaches & consultants attract better clients, & build brand authority | Marketer | Consultant to Emmy Nominated TV producer | Your 9-step business growth system (DM me) | Host: Productive Insights Podcast

1 年

Very interesting insight David. Makes perfect sense to me.

2 次回应

John Marrett

Helping mid-sized organizations increase sales and improve customer service since 1993 | #LinkedInLocal

1 年

Unless your content has been upvoted at least 3 times on Reddit (in which case OpenAI will place that content in a private dataset called WebText2), the primary source of crawled data in ChatGPT is Common Crawl, a public dataset that crawls the entire internet (like Google, Bing, etc.). If you want to opt out of new datasets created by the Common Crawl bot (called CCBot), you can by adding this to your robots.txt file: User-agent: CCBot Disallow: / I found this info (and quite a lot more!) about the sources of data used by ChatGPT here: How to Block ChatGPT From Using Your Website Content (2023.02.02) https://www.searchenginejournal.com/how-to-block-chatgpt-from-using-your-website-content/478384/#close

1 次回应

Raj Khera

Founder MakeMEDIA ? 3 exits to public firms ? I make SEO easy for B2B ? Podcast Host

1 年

David, I found my website morebusiness.com in that list as well. I'm not sure whether to be flattered that they consider my content useful for training their AI model for responses or frustrated because they don't openly cite the source in the response (unless asked by the user).

1 次回应

Stacey Hall

Sales Success Strategist/ Creator of 'The Alignment Marketing Formula' - Teams Make More Sales, Enjoy More Satisfaction and Achieve More Success! ^ TEDx and Keynote Presenter ^ Author of 5 #1 Global Best-SellerSales

1 年

"No matter how it happens, being part of chat responses will become valuable, just like being at the top of search engine results are valuable today." - such a positive perspective. Thank you!

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Is Your Content Being Used to Train the AI in Google Bard?

David Meerman Scott

Author of 12 books including NEW RULES OF MARKETING & PR and WSJ bestseller FANOCRACY | marketing & business growth speaker | advisor to emerging companies

领英推荐

Chatbots are the new search

The New Rules of Marketing

8,708 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Google is decisively reshaping the internet through AI.

The State of AI in January 2024: Data-Backed Revelations

SEO in the AI Age: Navigating New Search Trends

Business with AI: Topical Authority Advisor, Golden RatioGPT, and AthenaGPT

Google Generative AI : What are The Potential Implications of This New System For SEO

The Future of AI in Marketing and Search

Is the Rise of AI Problematic, or Will It Be the Humans About to Use It?

How to Use Google Bard AI? Google Bard AI Signup Guide

No AI For Your Content? Forget Every Other AI Detectors. Use This One Instead

Redux: ‘Mountains beyond Mountains’ abridged by AI

领英推荐

Chatbots are the new search

The New Rules of Marketing

8,708 位关注者

Why I’m a fan of iPie Pizzeria in Killington VT

2024年10月17日

Kamala Harris Told Me She’s a Music Fan and Is Especially Fond of Bob Marley

2024年10月7日

My Epic Adventure. How About Yours?

2024年10月1日

Great B2B Marketing Is About People

2024年9月20日

How LinkedIn Profiles Can Make or Break Complex Sales

2024年9月9日

Create and Connect Through Live Video

2024年8月26日

9th Edition of The New Rules of Marketing & PR Publishes Today

2024年8月20日

Your Original Content Plus AI Equals Sparkle-Darkle

2024年8月13日

Why I Prefer Perplexity To Google Or ChatGPT For Search

2024年8月8日

Companies Have An Obligation To Help Customers By Deploying AI & ML Within User Interfaces

2024年7月30日

社区洞察

其他会员也浏览了

Google is decisively reshaping the internet through AI.

The State of AI in January 2024: Data-Backed Revelations

SEO in the AI Age: Navigating New Search Trends

Business with AI: Topical Authority Advisor, Golden RatioGPT, and AthenaGPT

Google Generative AI : What are The Potential Implications of This New System For SEO

The Future of AI in Marketing and Search

Is the Rise of AI Problematic, or Will It Be the Humans About to Use It?

How to Use Google Bard AI? Google Bard AI Signup Guide

No AI For Your Content? Forget Every Other AI Detectors. Use This One Instead

Redux: ‘Mountains beyond Mountains’ abridged by AI