登录查看更多内容

The data that trains AI is under the spotlight — and even I’m weirded out

VentureBeat

VB is obsessed with transformative technology — including exhaustive coverage of AI and the gaming industry.

发布日期: 2023年4月25日

Welcome back to VentureBeat Weekly!?

In this week’s AI Beat, I addressed the growing concerns — and discomfort — around how today's large language models (LLMs) like ChatGPT are trained. When the Washington Post?published?a deep dive into Google’s?C4 data set last week, the public got an eye-opening peek under the hood at the millions of websites used including proprietary, personal, and offensive websites, that went into the training data of high-profile LLMs like Google's T5 and Meta's LLaMA.

Plus:

As AI risk grows, Anthropic calls for NIST funding boost: ‘This is the year to be ambitious’
Google consolidates AI research labs into Google DeepMind to compete with OpenAI
Microsoft releases Copilot for Viva, as it continues to roll out generative AI to apps
RedPajama replicates LLaMA dataset to build open source, state-of-the-art LLMs

—?Sharon Goldman, Senior Writer, VentureBeat

This is The AI Beat, one of VentureBeat’s newsletter offerings. Sign up here to?get more stories like this in your inbox?every week.

It is widely understood that today’s AI is hungry for data and that?large language models (LLMs)?are trained on?massive?unlabeled?data sets.

But last week, the general public got a revealing peek under the hood of one of them, when the Washington Post?published?a deep dive into Google’s?C4 data set, or the English Colossal Clean Crawled Corpus.

领英推荐

The AI arms race may soon center on a competition for…

Fast Company 10 个月前

This AI newsletter is all you need #61

Towards AI 1 年前

A Free Massive New Language Model; Moder Data…

Steve Nouri 2 年前

Working with researchers from the?Allen?Institute?for AI, the publication uncovered the 15 million websites, including proprietary, personal, and offensive websites, that went into the training data —?which were used to train high-profile models like Google’s T5 and Meta’s LLaMA.

According to the article, the dataset was “dominated by websites from industries including journalism, entertainment, software development, medicine and content creation, helping to explain why these fields may be threatened by the new wave of artificial intelligence.”

The nonprofit?CommonCrawl?did a scrape for C4 in April 2019. CommonCrawl told The Washington Post that it “tries to prioritize the most important and reputable sites, but does not try to avoid licensed or copyrighted content.”

It shouldn’t come as a surprise, then, that a quick search of the websites in the dataset (offered in the article through a simple search box) showed that?VentureBeat?was well represented, with 10 million tokens (small bits of text used to process disorganized information — typically a word or phrase). But it was disconcerting to find that nearly every publication I’ve ever written for is, too — even the ones where I tried to sign favorable freelance contracts — and even my?personal?music website?is part of the dataset.

Keep in mind, I’ve developed a thick skin when it comes to icky data-digging. I started writing about?data analytics?over 10 years ago for a magazine covering the direct marketing industry — a business that for decades had relied on mailing list brokers that sold or rented access to valuable datasets. I spent years covering the wild and woolly world of digital advertising technology, with its creepy “cookies” that allow brands to follow you all around the web. And it’s felt like eons since I discovered that the GPS in my car and my phone was gathering data to share with brands.

So I had to ask myself: Why did I feel so weirded out that my creative output has been sucked into the vacuum of AI datasets when so much of my life is already up for grabs?

Read the full story.

Read more from?Sharon Goldman, Senior Writer, on?VentureBeat.com.

This is?The AI Beat, one of VentureBeat’s newsletter offerings. Sign up here to?get more stories like this in your inbox?every week.

The data that trains AI is under the spotlight — and even I’m weirded out

VentureBeat

VB is obsessed with transformative technology — including exhaustive coverage of AI and the gaming industry.

Welcome back to VentureBeat Weekly!?

It is widely understood that today’s AI is hungry for data and that?large language models (LLMs)?are trained on?massive?unlabeled?data sets.

领英推荐

The AI Beat

56,269 位关注者

VentureBeat的更多文章

社区洞察

其他会员也浏览了

The Ins and Outs of Retrieval-Augmented Generation (RAG)

How Trust in Generative AI is Plummeting

Beyond LLMs: Building magic

?? GPT-4 killer?

GenAI Weekly — Edition 16

GPTNext in November 2024 and should we pull the plug?!

The DeepSeek Everything you Need to Know About this New AI Chapter

Rising Chinese Juggernauts in AI: A New Era of Hyper Competition

No Connection, No Problem: AI Solutions with GPT4All and KNIME

When the AI rubber hits the?road

Welcome back to VentureBeat Weekly!?

It is widely understood that today’s AI is hungry for data and that?large language models (LLMs)?are trained on?massive?unlabeled?data sets.

领英推荐

The AI Beat

56,269 位关注者

VentureBeat的更多文章

Why efforts to control and restrict AI's development are a fool's errand

OpenAI's largest model GPT-4.5 is here: is it the last of its kind?

While DeepSeek distracts, the OpenAI diaspora is quietly making moves

DeepSeek vs. OpenAI Stargate: will scale or agility win the AI race?

The return of the AI Beat: AI at the frontier

The 4 biggest AI stories from 2024 and one key prediction for 2025

Happy Holid-AI-s

OpenAI goes premium

ChatGPT turns 2: more mainstream, yet more challenged, than ever

AI goes global

社区洞察

其他会员也浏览了

The Ins and Outs of Retrieval-Augmented Generation (RAG)

How Trust in Generative AI is Plummeting

Beyond LLMs: Building magic

?? GPT-4 killer?

GenAI Weekly — Edition 16

GPTNext in November 2024 and should we pull the plug?!

The DeepSeek Everything you Need to Know About this New AI Chapter

Rising Chinese Juggernauts in AI: A New Era of Hyper Competition

No Connection, No Problem: AI Solutions with GPT4All and KNIME

When the AI rubber hits the?road