Is Your Content Being Used to Train the AI in Google Bard?
David Meerman Scott
Author of 12 books including NEW RULES OF MARKETING & PR and WSJ bestseller FANOCRACY | marketing & business growth speaker | advisor to emerging companies
The Washington Post released a fascinating analysis of how AI chatbots gather content on the public Web. The report,?Inside the secret list of websites that make AI like ChatGPT sound smart?(subscription required) is a fascinating read.
I was especially interested to see that the analysis includes a tool to check if your own website data is being used as an input to train Google’s C4 data set (Colossal Clean Crawled Corpus), a large language model like?ChatGPT?that helps power?Google Bard.
The analysis ranked the roughly 10 million websites based on how many “tokens” appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.
Business and industrial websites made up the biggest category of content in the Google’s C4 data set (16 percent of categorized tokens). Google’s C4 data set also includes more than half a million personal blogs (3.8 percent of categorized tokens).
Many people are concerned that these AI models harvest their data. They see it as “stealing” because the content is used without attribution. As a writer, I can certainly understand that.
Let’s dig a little deeper
While the source of data used in AI generated results isn’t yet reported, I firmly believe over time AI companies will list where the data in a specific response comes from.
领英推荐
Perhaps, governments will require reporting. Perhaps there will eventually be a way for a website owner to opt-in or opt-out of having their data used. I suspect that soon, AI companies will volunteer the source of data used in a response.
Chatbots are the new search
No matter how it happens, being part of chat responses will become valuable, just like being at the top of search engine results are valuable today.
Two of my URLs are included in the Google C4 dataset -?DavidMeermanScott.com?(where my blog is hosted) and?newsjacking.com. I already knew that both sites are also included in the ChatGPT dataset because when I enter specific queries, the resulting answers clearly pull from my content.
Today, companies are investing billions into surfacing content on search engines like Google via paid search ads and optimized content.
In the future, if your content is used to train AI and the chatbots include the websites accessed in their responses, that becomes a new way to generate attention for your content.
AI presents a new world with many opportunities! It’s fun to think about what’s coming next and to play around with what’s available now.?
AI Automation-AAA - Writing Web3 - Business Consultant & More
1 年How to design more in alignment t with Web3? Had an idea last year about creating articles as NFT with smart contracts. The best of all worlds will always be both evergreen and the long term royalties (recurring revenue) associated. (Why I built a phone company!) Everything at this moment is up for grabs as far as models and designs. Do we stick with centralized situations that drain everyone’s personal IP and creativity or get out of - design new boxes with the intent of proper distribution? Think big, think different. Everything is up for reinvention. So, let’s reinvent! We have to let go of to discover what’s possible and waiting.
I help coaches & consultants attract better clients, & build brand authority | Marketer | Consultant to Emmy Nominated TV producer | Your 9-step business growth system (DM me) | Host: Productive Insights Podcast
1 年Very interesting insight David. Makes perfect sense to me.
Helping mid-sized organizations increase sales and improve customer service since 1993 | #LinkedInLocal
1 年Unless your content has been upvoted at least 3 times on Reddit (in which case OpenAI will place that content in a private dataset called WebText2), the primary source of crawled data in ChatGPT is Common Crawl, a public dataset that crawls the entire internet (like Google, Bing, etc.). If you want to opt out of new datasets created by the Common Crawl bot (called CCBot), you can by adding this to your robots.txt file: User-agent: CCBot Disallow: / I found this info (and quite a lot more!) about the sources of data used by ChatGPT here: How to Block ChatGPT From Using Your Website Content (2023.02.02) https://www.searchenginejournal.com/how-to-block-chatgpt-from-using-your-website-content/478384/#close
Founder MakeMEDIA ? 3 exits to public firms ? I make SEO easy for B2B ? Podcast Host
1 年David, I found my website morebusiness.com in that list as well. I'm not sure whether to be flattered that they consider my content useful for training their AI model for responses or frustrated because they don't openly cite the source in the response (unless asked by the user).
Sales Success Strategist/ Creator of 'The Alignment Marketing Formula' - Teams Make More Sales, Enjoy More Satisfaction and Achieve More Success! ^ TEDx and Keynote Presenter ^ Author of 5 #1 Global Best-SellerSales
1 年"No matter how it happens, being part of chat responses will become valuable, just like being at the top of search engine results are valuable today." - such a positive perspective. Thank you!