What can Reddit’s Users Teach Chief Data Officers?

What can Reddit’s Users Teach Chief Data Officers?

Aka How you can fit Reddit into your AI World?

In a recent article in the Wall Street Journal, Steve Huffman, CEO of Reddit, talks about the importance of Reddit data as training data for AI models.?

In his interview, there’s an incredibly important nugget:? “Every post and comment on Reddit starts at zero points and has to earn its visibility”.

This idea maps directly to metadata—a familiar concept for data professionals who work with warehouses and governance.

However, as this relates to GenAI, the scope of metadata will need to change dramatically, and with it practices and processes in your company.

As your organizations start to incorporate LLMs and GenAI into your business practices, your models will learn by incorporating your company's best and most relevant documents. Just like Reddit's voting system, the metadata of these documents will signal what's 'best and most relevant' to your AI models.

We already have established precedent in the way we build applications. The code review & approval process in tools like GitHub and GitLab are models for how this can be accomplished.?

Indeed, in a recent essay, I wrote about how CoPIlot for GitHub will use the code approval process to determine “what’s best” for a company, extending beyond the codebase to inform the creation and management of documents all over the organization.

In the same way, companies need to think about the processes by which they tag these documents so that the GenAI models will know to train on the best, most recent, and most relevant versions of them. The earlier you implement these practices, the stronger your AI models' foundation will be and the sooner your ROI will materialize.

What does this document tagging look like in practice? Fortunately, we can draw from established data management practices. As Data Architects and Chief Data Officers, we already handle similar challenges with databases and data warehouses. Key metadata elements should include:

Document Quality:

  • Source of Truth: Should this be regarded as a source of truth?
  • Maturity: How complete or "finished" is this document?

Lifecycle Management:

  • Age: When was this document created or most recently updated?
  • Validity: For how long is this document valid? When does it expire?
  • Status: Is the document current, deleted, archived, or replaced?

Governance:

  • Ownership: Who is responsible for this document? (by role/title)
  • Scope: What parts of the organization does this document apply to? (e.g., workgroup vs. entire organization)
  • Confidentiality: Who should have access to this document?
  • Sensitivity: Is this document subject to regulatory or audit requirements?

That’s comprehensive. And challenging to implement at the outset.? Remember, that the biggest changemaker is a shift in mindset or, dare I say it, culture.?

As you design your metadata strategy, spend some time on Reddit and pay attention to the voting.?

Reddit implements “best, most relevant, and most recent”? wonderfully well. It begins with a simple starting premise: Does a given comment contribute to the discussion (or not)?? Users upvote, as indicated below.

Example of the Reddit Voting mechanism

From this simple mechanism, Reddit determines the relevance, recency, and authority of this post. Add in the poster’s name, and you have much of the foundation you need. Importantly, it’s user-directed.?

Like Reddit users, the people in your company – not some nameless algorithm or, as I’ve mentioned before, “the average of the internet” – can indicate what’s best and most up-to-date for your needs.?

This represents a significant change in the mindset of the folk who will be asked to do this work.?

As engineers, developers, or analysts, we ordinarily expect that our work products will be used by multiple people, and possibly re-shaped beyond the original intent.?

Essentially, there are three use cases for data that we create.?

  • Humans, looking at a screen.
  • Extracts into Excel, Tableau, Google Sheets, or another business tool.
  • Direct or indirect feeds into application databases (Salesforce, Workaday, etc).

The person creating this data might? be a customer success advocate working in a sales database, a marketing director creating a PowerPoint by a marketing director, or a financial analyst updating projections in a finance spreadsheet

For each one of those use cases, there’s typically a direct customer for immediate use. Moreover, we know what that use will be. However, with GenAI, there’s no immediate user. Those business documents will be fed into the engine and used for a future unspecified purpose.

Further, when we ask "ordinary business users" to consider tagging their documents for unspecified future use, that's a significant mind-shift. Fortunately, social network tools have prepared our people for that work. People vote on content all the time on Instagram, TikTok, Reddit, and Facebook all the time. We do that today on our phones when we give someone a thumbs-up in a text thread.

Although you may not be ready for a full-enterprise GenAI rollout, you should start thinking about how to extend your processes to gather and manage this metadata. It seems likely that the most durable metadata will be that entered by the people who create and consume the types of documents and data that you will want to use in your GenAI models.

Ultimately, this is a cultural change in your organization: recognizing that spreadsheets, emails, and other business documents will be used as input for future applications.?


No man is an island, and no author writes alone. Thanks to Jason Shaeffer , Trent Lowe , Alexander Williams and Mike Manoske PCC for feedback on early drafts of this essay.


What do you think? Where am I wrong? Debate is where I learn the most. I'd love to hear from you; and here's my calendar.

要查看或添加评论,请登录

Dave Holmes-Kinsella的更多文章

社区洞察

其他会员也浏览了