The data that trains AI is under the spotlight — and even I’m weirded out

The data that trains AI is under the spotlight — and even I’m weirded out

Welcome back to VentureBeat Weekly!?

In this week’s AI Beat, I addressed the growing concerns — and discomfort — around how today's large language models (LLMs) like ChatGPT are trained. When the Washington Post?published?a deep dive into Google’s?C4 data set last week, the public got an eye-opening peek under the hood at the millions of websites used including proprietary, personal, and offensive websites, that went into the training data of high-profile LLMs like Google's T5 and Meta's LLaMA.

Plus:

  • As AI risk grows, Anthropic calls for NIST funding boost: ‘This is the year to be ambitious’
  • Google consolidates AI research labs into Google DeepMind to compete with OpenAI
  • Microsoft releases Copilot for Viva, as it continues to roll out generative AI to apps
  • RedPajama replicates LLaMA dataset to build open source, state-of-the-art LLMs

—?Sharon Goldman, Senior Writer, VentureBeat

This is The AI Beat, one of VentureBeat’s newsletter offerings. Sign up here to?get more stories like this in your inbox?every week.


It is widely understood that today’s AI is hungry for data and that?large language models (LLMs)?are trained on?massive?unlabeled?data sets.

But last week, the general public got a revealing peek under the hood of one of them, when the Washington Post?published?a deep dive into Google’s?C4 data set, or the English Colossal Clean Crawled Corpus.

Working with researchers from the?Allen?Institute?for AI, the publication uncovered the 15 million websites, including proprietary, personal, and offensive websites, that went into the training data —?which were used to train high-profile models like Google’s T5 and Meta’s LLaMA.

According to the article, the dataset was “dominated by websites from industries including journalism, entertainment, software development, medicine and content creation, helping to explain why these fields may be threatened by the new wave of artificial intelligence.”

The nonprofit?CommonCrawl?did a scrape for C4 in April 2019. CommonCrawl told The Washington Post that it “tries to prioritize the most important and reputable sites, but does not try to avoid licensed or copyrighted content.”

It shouldn’t come as a surprise, then, that a quick search of the websites in the dataset (offered in the article through a simple search box) showed that?VentureBeat?was well represented, with 10 million tokens (small bits of text used to process disorganized information — typically a word or phrase). But it was disconcerting to find that nearly every publication I’ve ever written for is, too — even the ones where I tried to sign favorable freelance contracts — and even my?personal?music website?is part of the dataset.

Keep in mind, I’ve developed a thick skin when it comes to icky data-digging. I started writing about?data analytics?over 10 years ago for a magazine covering the direct marketing industry — a business that for decades had relied on mailing list brokers that sold or rented access to valuable datasets. I spent years covering the wild and woolly world of digital advertising technology, with its creepy “cookies” that allow brands to follow you all around the web. And it’s felt like eons since I discovered that the GPS in my car and my phone was gathering data to share with brands.

So I had to ask myself: Why did I feel so weirded out that my creative output has been sucked into the vacuum of AI datasets when so much of my life is already up for grabs?

Read the full story.

Read more from?Sharon Goldman, Senior Writer, on?VentureBeat.com.

This is?The AI Beat, one of VentureBeat’s newsletter offerings. Sign up here to?get more stories like this in your inbox?every week.

要查看或添加评论,请登录

VentureBeat的更多文章

社区洞察

其他会员也浏览了