【AI】data for training LLM v.s. Reddit (UGC)
image generated by Stylar.ai

【AI】data for training LLM v.s. Reddit (UGC)


Quick thoughts about Reddit (UGC platform) going for IPO (after reading the news 1. 2. 3. ):


  1. Reddit's user data: US$0.37 / per post Prompt: How much is the value of each post? ( IPO valuation $6.4 billion - cost "tens of millions" ) / 17 billion forum posts. 【Claude 3】 ...the value of each post works out to approximately $0.37.
  2. FTC wants Reddit to share (license) user data.
  3. 【My question】

  • How much license fee collected from LLM (ChatGPT、Gemini) will be considered fair ( to be shared between Reddit and its users?)
  • Reddit's plan to share such license fees with its users?
  • Saying, in the future if any UGC platform who wants to go for IPO needs to submit to FTC the plan of sharing license fee with users?
  • Saying , LLM's argument: we only pay the license fee for data of "good quality" : Great, define and reveal the required data quality to train LLM. ?
  • A tweet sharing the white paper (author: @georgejrjrjr ) about data quality for creating LLM, which aggregated 13,000 views (on 20240316) in the re-tweet by Michael Edward Johnson@johnsonmxe


Updates:


2024-June-1 Techcrunch

<AI training data has a price tag that only Big Tech can afford>

Few independent, not-for-profit efforts to create massive datasets anyone can use to train a generative AI model:

  • EleutherAI, a grassroots nonprofit research group , is working with the University of Toronto, AI2 and independent researchers to create The Pile v2, a set of billions of text passages primarily sourced from the public domain.
  • Hugging Face released FineWeb, a filtered version of the Common Crawl — the eponymous dataset maintained by the nonprofit Common Crawl, composed of billions upon billions of web pages — that Hugging Face claims improves model performance on many benchmarks.


2024-May-16 Reddit

<Reddit and OpenAI Build Partnership>

  • OpenAI will access Reddit’s Data API to better understand and showcase Reddit content, especially on recent topics.
  • Reddit will be building on OpenAI’s platform of AI models to bring new AI-powered features to redditors and mods.
  • Lastly, OpenAI will become a Reddit advertising partner.


2024-April-25 Techcrunch

< Carv raises $10M Series A to help gamers monetize their data>

"Carv’s initial focus is on two key industries, gaming and AI, where it sees the biggest opportunity to help users control their data and monetize it. Users can choose to provide their data to Carv’s corporate customers in a way that preserves their privacy and is compliant with regulations, so that companies can use it for training AI models, market research and more."

"Carv offers three solutions: CARV Protocol, a modular data layer with cross-chain connectivity that connects web2 identities to web3 tokens; CARV Play, a cross-platform credentialing system and game distribution platform; and CARV’s AI Agent, CARA, a personalized gaming assistant that integrates with web3 wallets and can recommend games, activities and projects. "

"Carv differentiates itself by putting data ownership and monetization rights in the hands of users. Any revenue generated from leveraging users’ data gets shared back with the data creators and themselves,” Yu said. “Additionally, we’ve created a unified user ID standard (ERC-7231) that bridges web2 and web3, enabling seamless data portability versus today’s siloed solutions.”


2024-Apr-13 Techcrunch

<Vana plans to let users rent out their Reddit data to train AI>

A startup, Vana, says it wants users to get paid for training data

"We think users should be able to bring their personal data from walled gardens, like Instagram, Facebook and Google, to your application, so you can create amazing personalized experience from the very first time a user interacts with your consumer AI application."

"Vana makes money by charging users a monthly subscription (starting at $3.99) and levying a “data transaction” fee on devs (e.g. for transferring data sets for AI model training)"

"This month, Vana launched what it’s calling the Reddit Data DAO (Digital Autonomous Organization), a program that pools multiple users’ Reddit data (including their karma and post history) and lets them to decide together how that combined data is used. "

"Then there’s the matter of how to fairly distribute payments that the DAO might receive from data buyers."

"Kazlauskas floats the idea that members of the DAO could choose to share their cross-platform and demographic data, making the DAO potentially more valuable and incentivizing sign-ups."


2024-Apr-06 REUTERS

<Inside Big Tech's underground race to buy AI training data>

"Seattle-based Defined.ai licenses data to a range of companies including Google, Meta, Apple, Amazon and Microsoft ... $1 to $2 per image, $2 to $4 per short-form video and $100 to $300 per hour of longer films. The market rate for text is $0.001 per word. Images of nudity, which require the most sensitive handling, go for $5 to $7... Defined.ai splits those earnings with content providers."



Relevant article:

【AI】web news for training a personal AI agent

【AI】data for training LLM v.s. Reddit (UGC)

【Creator Economy】Marketplace with AI tool to monetize





Impressive analysis on the valuation of user-generated content in the context of Reddit's IPO – it really highlights the intricacies of digital asset valuation in today's economy.

要查看或添加评论,请登录

Katherine Shih的更多文章

社区洞察

其他会员也浏览了