登录查看更多内容

【AI】data for training LLM v.s. Reddit (UGC)

Katherine Shih

Expertise in building legal practice eyeing IPO.

发布日期: 2024年3月16日

+ 关注

Quick thoughts about Reddit (UGC platform) going for IPO (after reading the news 1. 2. 3. ):

Reddit's user data: US$0.37 / per post Prompt: How much is the value of each post? ( IPO valuation $6.4 billion - cost "tens of millions" ) / 17 billion forum posts. 【Claude 3】 ...the value of each post works out to approximately $0.37.
FTC wants Reddit to share (license) user data.
【My question】

How much license fee collected from LLM (ChatGPT、Gemini) will be considered fair ( to be shared between Reddit and its users?)
Reddit's plan to share such license fees with its users?
Saying, in the future if any UGC platform who wants to go for IPO needs to submit to FTC the plan of sharing license fee with users?
Saying , LLM's argument: we only pay the license fee for data of "good quality" : Great, define and reveal the required data quality to train LLM. ?
A tweet sharing the white paper (author: @georgejrjrjr ) about data quality for creating LLM, which aggregated 13,000 views (on 20240316) in the re-tweet by Michael Edward Johnson@johnsonmxe

Updates：

2024-June-1 Techcrunch

Few independent, not-for-profit efforts to create massive datasets anyone can use to train a generative AI model:

EleutherAI, a grassroots nonprofit research group , is working with the University of Toronto, AI2 and independent researchers to create The Pile v2, a set of billions of text passages primarily sourced from the public domain.
Hugging Face released FineWeb, a filtered version of the Common Crawl — the eponymous dataset maintained by the nonprofit Common Crawl, composed of billions upon billions of web pages — that Hugging Face claims improves model performance on many benchmarks.

2024-May-16 Reddit

OpenAI will access Reddit’s Data API to better understand and showcase Reddit content, especially on recent topics.
Reddit will be building on OpenAI’s platform of AI models to bring new AI-powered features to redditors and mods.
Lastly, OpenAI will become a Reddit advertising partner.

2024-April-25 Techcrunch

< Carv raises $10M Series A to help gamers monetize their data>

"Carv’s initial focus is on two key industries, gaming and AI, where it sees the biggest opportunity to help users control their data and monetize it. Users can choose to provide their data to Carv’s corporate customers in a way that preserves their privacy and is compliant with regulations, so that companies can use it for training AI models, market research and more."

"Carv offers three solutions: CARV Protocol, a modular data layer with cross-chain connectivity that connects web2 identities to web3 tokens; CARV Play, a cross-platform credentialing system and game distribution platform; and CARV’s AI Agent, CARA, a personalized gaming assistant that integrates with web3 wallets and can recommend games, activities and projects. "

"Carv differentiates itself by putting data ownership and monetization rights in the hands of users. Any revenue generated from leveraging users’ data gets shared back with the data creators and themselves,” Yu said. “Additionally, we’ve created a unified user ID standard (ERC-7231) that bridges web2 and web3, enabling seamless data portability versus today’s siloed solutions.”

领英推荐

OpenAI is going head to head with Google while Meta…

Steve Nouri 8 个月前

OpenAI brain drain: What to make of CTO Mira Murati’s…

Fast Company 5 个月前

Will Microsoft Acquire OpenAI?

Michael Spencer 3 年前

2024-Apr-13 Techcrunch

A startup, Vana, says it wants users to get paid for training data

"We think users should be able to bring their personal data from walled gardens, like Instagram, Facebook and Google, to your application, so you can create amazing personalized experience from the very first time a user interacts with your consumer AI application."

"Vana makes money by charging users a monthly subscription (starting at $3.99) and levying a “data transaction” fee on devs (e.g. for transferring data sets for AI model training)"

"This month, Vana launched what it’s calling the Reddit Data DAO (Digital Autonomous Organization), a program that pools multiple users’ Reddit data (including their karma and post history) and lets them to decide together how that combined data is used. "

"Then there’s the matter of how to fairly distribute payments that the DAO might receive from data buyers."

"Kazlauskas floats the idea that members of the DAO could choose to share their cross-platform and demographic data, making the DAO potentially more valuable and incentivizing sign-ups."

2024-Apr-06 REUTERS

"Seattle-based Defined.ai licenses data to a range of companies including Google, Meta, Apple, Amazon and Microsoft ... $1 to $2 per image, $2 to $4 per short-form video and $100 to $300 per hour of longer films. The market rate for text is $0.001 per word. Images of nudity, which require the most sensitive handling, go for $5 to $7... Defined.ai splits those earnings with content providers."

Relevant article：

【AI】web news for training a personal AI agent

【AI】data for training LLM v.s. Reddit (UGC)

【Creator Economy】Marketplace with AI tool to monetize

TOMEK

1 年

Impressive analysis on the valuation of user-generated content in the context of Reddit's IPO – it really highlights the intricacies of digital asset valuation in today's economy.

1 次回应

要查看或添加评论，请登录

Katherine Shih的更多文章

【AI】Can we inhabit a humanoid robot?

2024年10月27日

【AI】Can we inhabit a humanoid robot?

https://x.com/elonmusk/status/1824645757321757038?ref_src=twsrc%5Etfw Personally, I think people's varied understanding…
【AI】Gemini in the Chrome : Summarize emails in your Gmail

2024年9月6日

【AI】Gemini in the Chrome : Summarize emails in your Gmail

Setup step-by-step : https://docs.google.
Body Scan。Dance。Humanoid-Robot。

2024年8月20日

Body Scan。Dance。Humanoid-Robot。

While I still keep an eye on how Bryan Johnson promotes his Don't Die project in the strong sense of KOL- approach, I…
Fw: 【全新一週】達明機器人跟 NVIDIA 的 digital twins 及 Omniverse 玩了好一陣子了！他們玩什麼？怎麼玩在一起的？│EP122

2024年8月12日

Fw: 【全新一週】達明機器人跟 NVIDIA 的 digital twins 及 Omniverse 玩了好一陣子了！他們玩什麼？怎麼玩在一起的？│EP122

【全新一週】達明機器人跟 NVIDIA 的 digital twins 及 Omniverse 玩了好一陣子了！他們玩什麼？怎麼玩在一起的？│EP122 https://open.spotify.
【AI】 for R&D to quickly understand LEGAL concepts ( USA patent )

2024年8月8日

【AI】 for R&D to quickly understand LEGAL concepts ( USA patent )

This article demonstrates an example of how you can use AI-tools to quickly understand LEGAL concepts , especially…
【Generative Physical AI】NVIDIA

2024年6月6日

【Generative Physical AI】NVIDIA

NVIDIA Generative Physical AI needs three types of computers: Nvidia AI supercomputers: to train AI models. Nvidia…
【AI】 What to build? (Use LLM to filter options)

2024年5月27日

【AI】 What to build? (Use LLM to filter options)

Step 1: Learn "big trend". - Population fall won't last long (insight from Professor Robin Hanson ) - a humanoid robot…
【Spatial AI】Stanford-Fei-Fei Li

2024年5月7日

【Spatial AI】Stanford-Fei-Fei Li

Stanford Fei-Fei Li World Labs Messages in YouTube video: Spatial Intelligence - Human【 eyes taking lights and project…
Fw:全球生成式AI相關之訴訟大觀-- 盤點OpenAI、Microsoft、Meta、Midjourney等AI官司

2024年4月24日

Fw:全球生成式AI相關之訴訟大觀-- 盤點OpenAI、Microsoft、Meta、Midjourney等AI官司

The litigation process up to now In UK/US : procedural rulings In China : 2 substantive judgments (affirmation of…
【Liquid AI】MIT -Daniela Rus

2024年4月21日

【Liquid AI】MIT -Daniela Rus

MIT Liquid AI an MIT spin-off , with the mission is to build state-of-the-art, general-purpose trustworthy AI systems…

See all articles

【AI】data for training LLM v.s. Reddit (UGC)

Katherine Shih

Expertise in building legal practice eyeing IPO.

领英推荐

Katherine Shih的更多文章

社区洞察

其他会员也浏览了

OpenAI Wants to be Valued at $29 Billion

OpenAI's SearchGPT: It's About the Optics

OpenAI Saga, Inflection-2, Kai-Fu Lee's Rise & Upcoming LLMs ??

TechNews: SearchGPT from OpenAI arrives, OpenAI's Secret Project, 97% of CrowdStrike Systems back and more

OpenAI Is An App Company Now

After Musk’s Lawsuit, No More Illusions About OpenAI

Weekend Warp

?? Fired or Return of the King: The Turbulent at OpenAI / 200K Token: Claude 2.1 / Be Careful on Youtube about AI-Generated Contents

6 Free Open-Source Replacements for OpenAI’s Deep Research AI

OpenAI's Open-Source Dilemma: A Genuine Shift or Just Strategic Posturing?

领英推荐

Katherine Shih的更多文章

【AI】Can we inhabit a humanoid robot?

【AI】Gemini in the Chrome : Summarize emails in your Gmail

Body Scan。Dance。Humanoid-Robot。

Fw: 【全新一週】達明機器人跟 NVIDIA 的 digital twins 及 Omniverse 玩了好一陣子了！他們玩什麼？怎麼玩在一起的？│EP122

【AI】 for R&D to quickly understand LEGAL concepts ( USA patent )

【Generative Physical AI】NVIDIA

【AI】 What to build? (Use LLM to filter options)

【Spatial AI】Stanford-Fei-Fei Li

Fw:全球生成式AI相關之訴訟大觀-- 盤點OpenAI、Microsoft、Meta、Midjourney等AI官司

【Liquid AI】MIT -Daniela Rus

社区洞察

其他会员也浏览了

OpenAI Wants to be Valued at $29 Billion

OpenAI's SearchGPT: It's About the Optics

OpenAI Saga, Inflection-2, Kai-Fu Lee's Rise & Upcoming LLMs ??

TechNews: SearchGPT from OpenAI arrives, OpenAI's Secret Project, 97% of CrowdStrike Systems back and more

OpenAI Is An App Company Now

After Musk’s Lawsuit, No More Illusions About OpenAI

Weekend Warp

?? Fired or Return of the King: The Turbulent at OpenAI / 200K Token: Claude 2.1 / Be Careful on Youtube about AI-Generated Contents

6 Free Open-Source Replacements for OpenAI’s Deep Research AI

OpenAI's Open-Source Dilemma: A Genuine Shift or Just Strategic Posturing?