Reddit's Structured Approach to Unstructured Data
Why “AI Companies” are willing to pay up for Reddit’s data
Someone is paying Reddit $66 million per year for access to their data.
From Reddit’s S1 filing:
“In January 2024, we entered into certain data licensing arrangements with performance obligations in contracts with an original expected duration exceeding one year. Under such arrangements, we deliver our content over time through continuous access to our data API as well as quarterly transfers of Reddit data over the term of the arrangement. The aggregate transaction price of such arrangements totaled $203.0 million. We expect a minimum of $66.4 million of revenue to be recognized during the year ending December 31, 2024 and the remaining thereafter.”
Who was willing to pay Reddit $66 million per year for their data?
Well, turns out that so far it’s only one company: Google.
So, why is Google paying Reddit $66 million per year for their data? Maybe a bit more bluntly, why is Google willing to pay Reddit and not another social media company for their data? This week, I’ve been Nerding Out on the deal, data licensing agreements, competitive offerings, and Reddit itself to figure out why.
Hi, Eric here! Quick ask - if you've been enjoying these articles and want them delivered directly to you, you can subscribe via Substack where I publish articles a full day earlier.
Ok back to the article!
Google and Reddit’s Deal
Reading between the lines from Google’s and Reddit’s press releases, it sounds to me like there’s a few benefits that users of both platforms will see as a result of the deal.
The first insight into why Google values Reddit’s data so much comes in their press release as well. Google mentions the following qualities that they find valuable in the Reddit’s data: real-time, structured, and unique.
Arguably, all social media is real-time and each platform is unique, so why Reddit?
Social Media Monetization
To answer that question, it helps to understand social platforms a little more.
The average American spends 2 hours and 25 minutes per day on social media. The amount of content created each minute is equally as astounding. At the end of the day, all of that time and data is typically monetized by companies in three ways
With advertising making up the bulk of the revenues, historically it wasn’t worth the energy (and sometimes negative publicity) to work on data licensing for the large social platforms. The opportunity cost to working on data licenses would be taking time away from increasing user engagement or building better advertising tools, which historically was more profitable. Simply put, there are more people who want to advertise than there are who want to buy the data for analytical purposes.
Data Licensing Historically
Similar to the big platforms that catered to the general populace, several social sites targeting niche audiences were also created in the late 2000s and early 2010s (think Twitter for financial professionals or Facebook for doctors). While the big platforms focused on maximizing the advertising business, these smaller sites found advertising to be less impactful. There’s only so much content a smaller set of users can create, limiting the opportunities to put ads in feeds without completely over-running the site. Additionally, there’s only so many people who want to advertise to specific niche audiences. Consequently, the smaller platforms more quickly had to find ways outside of advertising to monetize their audiences and turned to subscriptions and data licensing sooner than larger platforms. Most of the data licensing plays came from financially oriented platforms as they had a natural buyer in quant hedge funds, looking to turn data into alpha in the market. Let’s take a look at some:
StockTwits
StockTwits was founded in 2008 and serves as a social platform for investors. Users can search and post directly about companies that they’re considering or invested in creating a feed for each company that’s similar to X. They monetize through advertising, premium accounts, and API access for data licensing. Hedge Funds find the data useful as a measure of sentiment towards specific stocks. As a private company, they haven’t disclosed their revenue breakdown between their three monetization methods, but most recently, they were valued at 210m valuation during a Series B fundraise in 2021.
Estimize
Estimize, though not a social network in the traditional sense, is still powered off of user generated content. The platform allows users to contribute earnings estimates for stocks and users can be anyone - not necessarily wall street analysts. For hedge funds, the value of the data lies in the fact that the wisdom of crowds is on average more accurate than the specialized analysts on Wall Street. The site was founded in 2011 and boasts 120,000 analysts contributing estimates. While they don’t advertise on the site, they do offer premium user accounts and data licenses to institutional investors. Similar to StockTwits, they do not have a publicly available revenue breakdown. Most recently, they sold to ExtractAlpha for an undisclosed amount.
X/Twitter
Of the big social media companies, X (formerly Twitter) has been in the data licensing game the longest, almost since their inception. In 2006, they launched the first public version of their API. Through various twists and turns - authorized resellers, API limit changes, and removed features, they have become one of the biggest social media data licensers today. As evidence, they generated approximately $354m in revenue from data licensing in 2021 (1). They also have deals with some of the largest data platforms, like Bloomberg, to license their data and derive sentiment analytics.
So data licensing isn’t new, however what is new is an increase in the size of the pool of buyers…
Enter LLMs
With the launch of OpenAI’s ChatGPT in November of 2022 and the subsequent launches of several other large language models (LLMs), it was clear that the number of consumers of vast amounts of text data was going up.
At the same time, there were a ton of questions about the sources of data being used by LLMs
As with any step change technological innovation, there was a scramble to understand the implications of the technology better. The default presumption was that LLMs were crawling the internet to source data, much like Google does for search. However, LLMs introduced a new concern: instead of providing the link and driving traffic to the original content like Google does, the generated responses would just spit out text similar or even identical to the original source without citing or linking to the source. Seemingly overnight, a whole new set of fair and licensed use issues appeared and lawsuits began to fly.
Data Sourcing
Outside of text data that’s open source or freely available in the public domain, there’s essentially two and a half ways that LLM companies can get a hold of the large amounts of data needed to train their models.
1.5. Use a publicly available data set that a 3rd party scraped
To be clear, these aren’t mutually exclusive. For example, a company could license news articles from specific media outlets while choosing to scrape recipe blogs.
Scraping
When it comes to scraping, it’s generally considered legal to scrape public pages that anyone could freely view using a web browser (that is without logging in). There are measures that site owners can take to combat crawlers though. First, they can create a file called robots.txt that tells crawlers which pages they are allowed and not allowed to crawl, however onus remains on the crawler to obey that file. A more active measure is to “gate” the content behind a login or paywall. Then services can declare scraping against terms of service which gives them backing to shut down accounts that violate the term.
3rd Party Scraping
While theoretically a 3rd party could build an excellent crawler and obey all of these guidelines and make an enticing product, in practice issues are rampant. Data Provenance Initiative found a number of licensing issues while auditing 1800 publicly available data sources that would be prime candidates for training LLMs. Data provenance issues can have very real consequences, as evidenced by the SEC’s action against App Annie in 2021.
Licensing
While data licensing is the least risky avenue for sourcing data, it’s expensive and can dramatically cut into the profit margin of nascent firms. For the extreme case, see Spotify. They have to pay for all of their licensed content on a per-stream basis, while Spotify subscribers pay a flat fee each month. As a result, they have never turned a yearly profit. In addition, the entire streaming industry routinely faces regulatory scrutiny due to low artists payouts. Deals have started to pop-up though: OpenAI started licensing news stories from AP and Axel Springer. Google struck deals with Stack Overflow and now Reddit to license their content for AI use. Licensing is expensive, and it remains to be seen whether the expense becomes prohibitive and what regulatory pressures hit platforms looking to license out their content.
Social Network Data and Platform Dynamics
Back to Google. If you’re looking for large amounts of social network data for which to train your model, you have basically 4 options: Facebook, LinkedIn, Reddit, and X (other networks - Instagram, Snapchat, Youtube, and TikTok - are not text first, catering to images or videos so I’m putting them to the side for now).
You know how I mentioned that data licensing wasn’t worth the focus for social media companies previously? Unfortunately for some, that decision shows in their APIs and, probably more problematically, the architecture of their underlying social graph and data.
I’m going to assume (maybe naively) that Google is going to try to source data in an above board manner - abiding by robots.txt and terms of service. Let’s take a look at how they would fare with the 4 networks.
Facebook’s network caters to close connections. On your feed, you’ll primarily see posts from your friends or the things that your friends “like” from friends of friends. You’re highly unlikely to see something from a complete stranger in your feed.
This close proximity model carries into the measures they have in place to access data. Scraping is largely out of the question - in order to see any posts you have to be logged in to the site (with few exceptions). Their robots.txt also explicitly prohibits all crawling unless authorized by Facebook and terms of service essentially limits that use to search engines.
Similarly, there isn’t a feed of data available via API that contains posts and comments from all or even a major subset of users. The API requires auth from a single user to interact with content on their feed rather than broad access to data on the platform. In order to source a broad set of data, you’d need to essentially have a large set of users authorize your app to access Facebook and pull data on their behalf. Whether due to the platform structure of the other way around, Facebook does not broadly license data.
You might be thinking “Wait, didn’t Cambridge Analytica get access to broad swathes of Facebook data?”. Even in the Cambridge Analytica scandal, Cambridge Analytica got broad access not because Facebook provided broad access outright. Cambridge Analytica performed a bait and switch on users. They created a survey that required users to authorize access to their Facebook data. Cambridge Analytica then used that access to pull data off the platform, store it, and resell it to political operatives.
The bad press from past access has likely left a bad taste in Facebook’s mouth for these deals. Even if they were willing, the potential cost of trying to entice users to share data is likely not worth the effort for Google.
Much like Facebook, LinkedIn operates a network of close connections. Albeit a bit more broad than Facebook (you’re more likely to see posts from strangers in your feed), LinkedIn still typically requires users to be signed in to see posts. That rules out any scraping effort. Their robots.txt seems to be a bit more liberal but any crawling does need to be pre-authorized by LinkedIn. Moreso, their terms of service specifically limits crawling to the purpose of including content in search indexes (no mention of training AI models).
Also similar to Facebook, LinkedIn’s API operates much the same way. Users have to authorize access on their behalf to pull data from their feeds. There isn’t broad access to data on the platform via API.
领英推荐
LinkedIn is likely out.
X (formerly Twitter)
With X, we’re getting somewhere. Their platform is more open in nature. As a user you’re much more likely to see tweets (Xs?) from people you don’t know personally in your feed, especially with the introduction of the For You feed. What’s more people don’t necessarily need to be logged in to see tweets. Example: try accessing this link in an incognito window:
There is a limitation here though, replies to the tweet don’t show up unless logged in.
That open network structure is not reflected in robots.txt though. With the exception of Google’s search engine bot, any crawling is disallowed. Even Google’s search bot is only allowed to access certain data and it’s likely only allowed due to a pre-existing agreement.
X’s API is somewhat a different story. Unlike Facebook and LinkedIn, X’s open network structure allows API users to pull a broad sample of public tweets. Depending on the subscription tier, subscribers to the API can pull varying volumes of tweets
Drawback: in the grand scheme, 1,000,000 posts per month is not that many when reports suggest that 500,000,000 tweets are sent daily. In effect, Pro users get 0.006% of the total monthly volume. Enterprise presumptively would offer more volume but likely still have limitations.
X has potential but there’s definitely drawbacks.
Reddit is open. Effectively a public message board, posts by any user are visible to other users, regardless of whether that user is logged in or not (2).
Robots.txt reflects this as well. Though stated as limited to search engines, crawlers are allowed to index subreddits and posts. Scraping content is prohibited in the user terms of service though, thereby limiting use for the purposes of collecting data to feed into an LLM model.
Reddit’s API also gives users the ability to pull a list of all subreddits and posts with a given subreddit. This means that broad access to Reddit’s data is available via API. Access limits are in place for free users of the API, but Reddit does note that limits can be lifted for user paying fees to access the data.
Reddit certainly has promise.
Who’s Even Willing to Play Ball?
It’s also worth looking at these platforms and asking who is even willing to license their data to Google. Another way of saying this: who has competitive offerings?
Facebook is working on LLaMa. LinkedIn is owned by Microsoft which is heavily backing OpenAI and ChatGPT. X, though technically a separate company, is closely related to xAI and Grok. Reddit is not known to be working on an LLM.
Given they’re working on their own (or closely related) models, Facebook, LinkedIn, and Twitter are likely out. Reddit again, shows promise.
The $66 million Question
We’ve covered why Reddit’s network structure, platform capabilities and lack of competitive offerings make it attractive and willing to negotiate a deal for the data on it’s platform, but what about the value of the data itself? There is a default option for Google which is to not license any social media data (there’s plenty of other large volumes of text out there). So what is attractive about Reddit’s data that made Google willing to pay $66 million per year for it?
Coming back to the press releases: real-time, structured, and unique. Those are the qualities we’re looking at.
Real-Time
Think about this - how many times has an event happened and you turned to social media to get the latest updates from folks closest to the event? How many times did you turn to Google?
For me, as major breaking news is happening, locally, regionally, nationally or internationally, I frequently get on Twitter, search the topic, and follow the stream of tweets that follow.
Now think about your behavior if you’re searching for a news story that came out a few days ago. Rather than going to social media and seeing the mess of opinions on the topic, I know I’m much more likely to go to Google and find a news source that has a concise summary of the facts related to the story.
Google would really like you to go through Google for all of that, not just the days old stories. They know that user generated content is the best way to keep up to date with stories happening in real time. As opposed to waiting for the news outlets to publish a story and then wait for those stories to get enough traction to feed their traditional search algorithm, imagine if Google could better surface up real-time, user generated posts on the search page as breaking stories are happening?
Quite valuable.
Now, what if their AI generated responses could do the same. At this point, I think most of us are familiar with ChatGPT’s standard “I'm sorry, but I don't have real-time information, including details about recent events or the most recent [insert event]. My training only includes data up until January 2022.” response. Google has a similar one: “I’m still learning how to answer this question. In the meantime, try Google Search.”
Clearly theres’s a gap there that Reddit’s data can fill.
Imagine if, during the State of the Union address, Google could surface up the most recent Reddit comments on the State of the Union thread from the /r/politics subreddit. Search results would be far better than just background on state of the union addresses from Wikipedia and hypothetical analysis articles about what the president may talk about during the address that were published a few days prior. AI responses would be far better than the one shown above. That may even drive user behavior change - rather than go to social media sites directly during real-time events, Google could remain the default.
This is why real-time data is so important to Google.
Structured
Reddit’s data structure is unique among social media companies. Rather than just one giant feed of everything under the sun, posts are split into subreddits and then threads. Subreddits are effectively subjects - you could have them on history, math, or non-school subjects like life in Boston (or your own local community). Threads serve as topics within that subject like France’s Flour War, Algebra, or restaurants to try.
The structure of Reddit’s data is important to Google for both Search and AI.
Why?
Two words: tags and labels.
In both Search and AI, Reddit’s structure of a subject (subreddit) and topic (thread) can act as additional meta data to the comment. In search, this gives additional tags to match keywords in search terms for the post. For AI, this could help with fine tuning. If Google wants to fine tune their LLM on life in specific geographic locations, Reddit has a trove of culturally relevant posts. Google just needs to feed the model /r/boston, /r/nyc, r/london, and so on. The same goes for almost any other subject that Google would want to train Gemini on - there’s a subreddit for that.
Unique
We already talked about how the data structure of Reddit is unique, but so too is the content itself. In a lot of ways.
It’s Conversational
Reddit is designed to be conversational. People are allowed to write full, descriptive posts, of any length. Posts have context, questions, and answers. Threads have back and forth interactions, follow-ups, and colloquial commentary. This stands in contrast to X (limited to 280 or 4000 characters), Facebook (typically personal posts), and LinkedIn (typically professional posts). On Reddit, if you have an interest in a topic, you can post about it without the concern of brevity or the topic being outside your area of expertise.
There’s Moderation
Moderation is also unique on Reddit, contributing to the uniqueness of the data. Each subreddit has their own set of rules and moderators to enforce their rules. Moderators mean that posts on each subreddit typically stay on topic (see benefits of structured data above). In it’s way, it also reduces bias across the platform. How? The separation of topics into their own subreddits can increase psychological safety within each community creating more authentic posts.
As an example: r/cigarettes. There’s actual authentic information for people who choose to smoke. The moderators have prohibited posts on anti-smoking/quitting smoking so there isn’t a pile-on about how unhealthy it is. At the same time, they prohibit posts on drugs, advertisements and advice on starting smoking. Within the subreddit, there’s posts on the flavors of different brands and the labeling requirements in different countries. You likely wouldn’t see these sorts of posts on other platforms because the thread would be overtaken by people talking about how unhealthy smoking is. See the subreddits full list of rules below:
Mods keep irrelevant and non-compliant posts off the subreddit so deeper discussion can happen. If you want to see the other side of an argument, there’s likely a different subreddit: see r/stopsmoking.
It’s SEO Un-optimized
Reddit posts are also very organic. What do I mean by that - have you ever Googled a recipe and gotten a blog post with a full-on novel about the author’s grandmother’s invention of said recipe that uses the name of the dish 6,000 times throughout the post and you, dear reader, struggle to find the actual recipe on the page? Reddit is literally the exact opposite (there’s a picture, the ingredients, and the instructions, done). What I mean by organic is the opposite of SEO optimized. You don’t have to parse through 5 pages of BS meant to rank the post on Google Search in order to get to the information you’re looking for. Recipes are by far the most egregious example of this but the premise does apply to other topics as well.
Google’s Motivations
All of these unique qualities combine to add value to Google from Reddit’s data. With Search, Reddit can help Google cut through a lot of the SEO noise should Google point search results to the right subreddit (e.g. “recipe” keyword includes results from r/recipes).
Google also benefits from using Reddit’s unique data qualities in their AI models. In the broadest terms, Google can fine-tune their model for “conversational tone” based on Reddit’s data. Reddit users use slang. They aren’t limited to a certain number of characters. The platform isn’t personal or professional context dependent. Sub-reddits and moderators to those subreddits keep data clean, relevant to the topic at hand, and with nuanced conversation around touchy subjects, giving models a richer picture of a given topic. Lastly, Reddit’s data comes without a lot of filler that’s typical in SEO optimized posts that populate the internet today.
The Bottom Line
Reddit’s data certainly provides a lot of opportunity to help Google. Does this opportunity equate to $66 million per year - that remains to be seen. But the deal certainly seems to have it’s merits from legal, competitive, and distinctive competency standpoints.
Thank you so much for reading this week. I’m excited to see how Google and Reddit both evolve as a result of the deal. If you enjoyed my analysis, the best way to support me is to Like the post to feed the algorithm.
If you'd rather get future posts in your inbox, you can also subscribe on Substack.
If you have an idea for something you’d like me to Nerd Out on in the future, leave a comment on this post.
(1) $354 million is derived from $571.8 million from Data Licensing and Other stated in their revenue detail minus $217.9 million generated from fees on the MoPub secondary ad market stated in footnotes. See Page 47 of 10-K for FY2021.
(2) Reddit does have private subreddits that require users to be logged in and be a member of that specific subreddit to view them.
Product Manager | SaaS, FinTech, Data
1 年A couple more deals that I missed in the post: Tumblr x OpenAI and Midjourney https://www.theverge.com/2024/2/27/24084884/tumblr-midjourney-openai-training-data-deal-report Yelp x Perplexity https://www.theverge.com/2024/3/12/24098728/perplexity-chatbot-yelp-suggestions-data-ai Deals are gaining traction.
Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer
1 年The deal between Reddit and Google for data licensing underscores the immense value of Reddit's user-generated content and community insights. You talked about the potential impact on content trends, suggesting a shift away from lengthy recipe blog posts. Considering this, how might leveraging Reddit's data in conjunction with AI-driven content analysis tools revolutionize content creation strategies for recipe websites, particularly in optimizing engagement and relevance for diverse audiences?