登录查看更多内容

What can Reddit’s Users Teach Chief Data Officers?

Dave Holmes-Kinsella

Builder | Analytics & Data Leader: Strategy, Architecture, Build, Launch | From pre-A to post-IPO | 2 Exits | Former Synctera, Facebook

发布日期: 2024年10月30日

Aka How you can fit Reddit into your AI World?

In a recent article in the Wall Street Journal, Steve Huffman, CEO of Reddit, talks about the importance of Reddit data as training data for AI models.?

In his interview, there’s an incredibly important nugget:? “Every post and comment on Reddit starts at zero points and has to earn its visibility”.

This idea maps directly to metadata—a familiar concept for data professionals who work with warehouses and governance.

However, as this relates to GenAI, the scope of metadata will need to change dramatically, and with it practices and processes in your company.

As your organizations start to incorporate LLMs and GenAI into your business practices, your models will learn by incorporating your company's best and most relevant documents. Just like Reddit's voting system, the metadata of these documents will signal what's 'best and most relevant' to your AI models.

We already have established precedent in the way we build applications. The code review & approval process in tools like GitHub and GitLab are models for how this can be accomplished.?

Indeed, in a recent essay, I wrote about how CoPIlot for GitHub will use the code approval process to determine “what’s best” for a company, extending beyond the codebase to inform the creation and management of documents all over the organization.

In the same way, companies need to think about the processes by which they tag these documents so that the GenAI models will know to train on the best, most recent, and most relevant versions of them. The earlier you implement these practices, the stronger your AI models' foundation will be and the sooner your ROI will materialize.

What does this document tagging look like in practice? Fortunately, we can draw from established data management practices. As Data Architects and Chief Data Officers, we already handle similar challenges with databases and data warehouses. Key metadata elements should include:

Document Quality:

Source of Truth: Should this be regarded as a source of truth?
Maturity: How complete or "finished" is this document?

Lifecycle Management:

Age: When was this document created or most recently updated?
Validity: For how long is this document valid? When does it expire?
Status: Is the document current, deleted, archived, or replaced?

Governance:

Ownership: Who is responsible for this document? (by role/title)
Scope: What parts of the organization does this document apply to? (e.g., workgroup vs. entire organization)
Confidentiality: Who should have access to this document?
Sensitivity: Is this document subject to regulatory or audit requirements?

That’s comprehensive. And challenging to implement at the outset.? Remember, that the biggest changemaker is a shift in mindset or, dare I say it, culture.?

As you design your metadata strategy, spend some time on Reddit and pay attention to the voting.?

领英推荐

Artefact Data & AI Digest - All about Data Governance…

Artefact 11 个月前

“Data Leader” interview about Data Governance with…

Artefact 10 个月前

Top Priorities for Chief Data Officers

Emergent Africa 9 个月前

Reddit implements “best, most relevant, and most recent”? wonderfully well. It begins with a simple starting premise: Does a given comment contribute to the discussion (or not)?? Users upvote, as indicated below.

From this simple mechanism, Reddit determines the relevance, recency, and authority of this post. Add in the poster’s name, and you have much of the foundation you need. Importantly, it’s user-directed.?

Like Reddit users, the people in your company – not some nameless algorithm or, as I’ve mentioned before, “the average of the internet” – can indicate what’s best and most up-to-date for your needs.?

This represents a significant change in the mindset of the folk who will be asked to do this work.?

As engineers, developers, or analysts, we ordinarily expect that our work products will be used by multiple people, and possibly re-shaped beyond the original intent.?

Essentially, there are three use cases for data that we create.?

Humans, looking at a screen.
Extracts into Excel, Tableau, Google Sheets, or another business tool.
Direct or indirect feeds into application databases (Salesforce, Workaday, etc).

The person creating this data might? be a customer success advocate working in a sales database, a marketing director creating a PowerPoint by a marketing director, or a financial analyst updating projections in a finance spreadsheet

For each one of those use cases, there’s typically a direct customer for immediate use. Moreover, we know what that use will be. However, with GenAI, there’s no immediate user. Those business documents will be fed into the engine and used for a future unspecified purpose.

Further, when we ask "ordinary business users" to consider tagging their documents for unspecified future use, that's a significant mind-shift. Fortunately, social network tools have prepared our people for that work. People vote on content all the time on Instagram, TikTok, Reddit, and Facebook all the time. We do that today on our phones when we give someone a thumbs-up in a text thread.

Although you may not be ready for a full-enterprise GenAI rollout, you should start thinking about how to extend your processes to gather and manage this metadata. It seems likely that the most durable metadata will be that entered by the people who create and consume the types of documents and data that you will want to use in your GenAI models.

Ultimately, this is a cultural change in your organization: recognizing that spreadsheets, emails, and other business documents will be used as input for future applications.?

No man is an island, and no author writes alone. Thanks to Jason Shaeffer , Trent Lowe , Alexander Williams and Mike Manoske PCC for feedback on early drafts of this essay.

What do you think? Where am I wrong? Debate is where I learn the most. I'd love to hear from you; and here's my calendar.

要查看或添加评论，请登录

Dave Holmes-Kinsella的更多文章

Behind the Numbers: The Hidden Perils of Small Counts

2024年11月15日

Behind the Numbers: The Hidden Perils of Small Counts

Behind the Numbers: The Hidden Perils of Small Counts : What we can learn from Falacia ad Verbosium I read this…
A New Hire Orientation Framework: from Emotion to Impact

2024年11月12日

A New Hire Orientation Framework: from Emotion to Impact

A Visual Framework for Candid Conversation about a New Hire's Journey Joining a new venture is hard. For you, the…

2 条评论
Microsoft has released a not-so-secret weapon that will reinvent Excel, improve code quality, and leverage IP from all over your organization.

2024年10月15日

Microsoft has released a not-so-secret weapon that will reinvent Excel, improve code quality, and leverage IP from all over your organization.

Summary Arguably, Excel is the world’s #1 programming platform. It’s virtually ubiquitous in Fortune 1000 companies…

8 条评论
Napkin Analytics: the first tool to use

2024年9月27日

Napkin Analytics: the first tool to use

About Napkin Analytics One of the most important tools in a leader’s toolkit is the ability to perform Napkin…
Beyond Bias: AI's Promise for a Fairer Workplace

2024年9月27日

Beyond Bias: AI's Promise for a Fairer Workplace

Introduction This LinkedIn post by Allie Miller shows how “pacesetter” organizations are outstripping the adoption of…

2 条评论
LLMs and the Strawberry Riddle: Lessons for Business and Technical Leaders

2024年9月16日

LLMs and the Strawberry Riddle: Lessons for Business and Technical Leaders

The Strawberry Riddle There’s a meme of sorts that's been happening in the AI/GenAI/LLM Community right now: ask your…

5 条评论
AI and I built a tool in :15. What does that mean for you?

2024年9月13日

AI and I built a tool in :15. What does that mean for you?

I just built a nifty LinkedIn Connection Tool in 15 minutes, with zero manual coding, using Claude AI's Artifact…

4 条评论
System Down, Team Up: Building Confidence and Capability Through Crisis Communications

2024年8月12日

System Down, Team Up: Building Confidence and Capability Through Crisis Communications

It was a lovely summer’s morning back in July 2023. My morning coffee was rudely interrupted when one of our systems…

3 条评论
Networking Alchemy: Why Connecting is Golden

2024年8月2日

Networking Alchemy: Why Connecting is Golden

Please read this if you're looking to connect with people; and you feel trepidation about that. Still reading? Chances…

3 条评论
I trained because I didn't want to lose. I raced because I wanted to win.

2024年7月25日

I trained because I didn't want to lose. I raced because I wanted to win.

Most athletes will tell you that the event itself is the easy part. It's the hours and days and months and years of…

7 条评论

See all articles

What can Reddit’s Users Teach Chief Data Officers?

Dave Holmes-Kinsella

Builder | Analytics & Data Leader: Strategy, Architecture, Build, Launch | From pre-A to post-IPO | 2 Exits | Former Synctera, Facebook

领英推荐

Dave Holmes-Kinsella的更多文章

社区洞察

其他会员也浏览了

Pioneering Efficiency: A Leader’s Guide to Data Science Efficiency

Data + LLM News - September 2024

How To Choose Data Science Consultants (Buyer’s Guide)

How Data Roles Will Change in 2023?

July Data News

Connect AI and Data Governance! You Won’t Be Sorry (Unless You Don’t)

Beyond Models: Building a Full-Stack Data Science Pipeline That Drives Impact

AI-Powered Data Governance: Transforming Challenges into Opportunities

With Data … Being Dead is Dead: Give It a Rest

Vector Search: Unlocking the Power of Unstructured Data

领英推荐

Dave Holmes-Kinsella的更多文章

Behind the Numbers: The Hidden Perils of Small Counts

A New Hire Orientation Framework: from Emotion to Impact

Microsoft has released a not-so-secret weapon that will reinvent Excel, improve code quality, and leverage IP from all over your organization.

Napkin Analytics: the first tool to use

Beyond Bias: AI's Promise for a Fairer Workplace

LLMs and the Strawberry Riddle: Lessons for Business and Technical Leaders

AI and I built a tool in :15. What does that mean for you?

System Down, Team Up: Building Confidence and Capability Through Crisis Communications

Networking Alchemy: Why Connecting is Golden

I trained because I didn't want to lose. I raced because I wanted to win.

社区洞察

其他会员也浏览了

Pioneering Efficiency: A Leader’s Guide to Data Science Efficiency

Data + LLM News - September 2024

How To Choose Data Science Consultants (Buyer’s Guide)

How Data Roles Will Change in 2023?

July Data News

Connect AI and Data Governance! You Won’t Be Sorry (Unless You Don’t)

Beyond Models: Building a Full-Stack Data Science Pipeline That Drives Impact

AI-Powered Data Governance: Transforming Challenges into Opportunities

With Data … Being Dead is Dead: Give It a Rest

Vector Search: Unlocking the Power of Unstructured Data