The data that trains AI is under the spotlight — and even I’m weirded out
VentureBeat
VB is obsessed with transformative technology — including exhaustive coverage of AI and the gaming industry.
Welcome back to VentureBeat Weekly!?
In this week’s AI Beat, I addressed the growing concerns — and discomfort — around how today's large language models (LLMs) like ChatGPT are trained. When the Washington Post?published?a deep dive into Google’s?C4 data set last week, the public got an eye-opening peek under the hood at the millions of websites used including proprietary, personal, and offensive websites, that went into the training data of high-profile LLMs like Google's T5 and Meta's LLaMA.
Plus:
—?Sharon Goldman, Senior Writer, VentureBeat
This is The AI Beat, one of VentureBeat’s newsletter offerings. Sign up here to?get more stories like this in your inbox?every week.
It is widely understood that today’s AI is hungry for data and that?large language models (LLMs)?are trained on?massive?unlabeled?data sets.
But last week, the general public got a revealing peek under the hood of one of them, when the Washington Post?published?a deep dive into Google’s?C4 data set, or the English Colossal Clean Crawled Corpus.
领英推荐
Working with researchers from the?Allen?Institute?for AI, the publication uncovered the 15 million websites, including proprietary, personal, and offensive websites, that went into the training data —?which were used to train high-profile models like Google’s T5 and Meta’s LLaMA.
According to the article, the dataset was “dominated by websites from industries including journalism, entertainment, software development, medicine and content creation, helping to explain why these fields may be threatened by the new wave of artificial intelligence.”
The nonprofit?CommonCrawl?did a scrape for C4 in April 2019. CommonCrawl told The Washington Post that it “tries to prioritize the most important and reputable sites, but does not try to avoid licensed or copyrighted content.”
It shouldn’t come as a surprise, then, that a quick search of the websites in the dataset (offered in the article through a simple search box) showed that?VentureBeat?was well represented, with 10 million tokens (small bits of text used to process disorganized information — typically a word or phrase). But it was disconcerting to find that nearly every publication I’ve ever written for is, too — even the ones where I tried to sign favorable freelance contracts — and even my?personal?music website?is part of the dataset.
Keep in mind, I’ve developed a thick skin when it comes to icky data-digging. I started writing about?data analytics?over 10 years ago for a magazine covering the direct marketing industry — a business that for decades had relied on mailing list brokers that sold or rented access to valuable datasets. I spent years covering the wild and woolly world of digital advertising technology, with its creepy “cookies” that allow brands to follow you all around the web. And it’s felt like eons since I discovered that the GPS in my car and my phone was gathering data to share with brands.
So I had to ask myself: Why did I feel so weirded out that my creative output has been sucked into the vacuum of AI datasets when so much of my life is already up for grabs?
Read more from?Sharon Goldman, Senior Writer, on?VentureBeat.com.
This is?The AI Beat, one of VentureBeat’s newsletter offerings. Sign up here to?get more stories like this in your inbox?every week.