登录查看更多内容

5 Reasons They're All Wrong About Monetizing Data for Generative AI Training

Jake Moskowitz

发布日期: 2023年7月19日

As society has gotten smarter about how generative AI systems like ChatGPT work, one result has been a sharp increase in scrutiny about how generative AI companies like OpenAI, Meta, and Facebook gain access to training data. Recently there have been high profile lawsuits, such as by Sarah Silverman, against these and other companies alleging that copyrighted material was effectively stolen to train these generative AI models, also known as LLMs. Some companies such as Twitter have taken steps to limit access to data for model training, while others such as Reddit have attempted to monetize the use of their text data for model training. How this will all play out has become one of the most central and publicized uncertainties in this nascent generative AI space.

But the focus on model training is focused in the wrong place, and as a result we are ignoring a potentially far cleaner and more logical way this can all play out, namely via the use of vector databases. Below are five reasons why this is a far more promising solution for how data owners (or managers) can monetize data that’s used as knowledge for LLMs.

1.??????One of the most widely discussed limitations of generative AI is the problem of “hallucination”, wherein a model will make up incorrect information as part of a plausible sounding response to a prompt. But as Nick Frosst, co-founder of Cohere AI, has put it, hallucinating is the only thing an LLM ever does; it just so happens some hallucinations happen to correspond with the truth. But one way to cut down on hallucinations is to separate knowledge from language generation. The goal of an LLM is not actually to have knowledge, but rather simply to create language. The knowledge needs to come from elsewhere. As Harvard Business Review recently described, separating knowledge into vector databases that LLMs can access as needed to acquire contextual knowledge to answer specific prompts is becoming a popular way to customize models for vertical-specific use cases.

2.??????Vector Databases are more controllable and privacy friendly than training data. LLMs cannot be untrained; once data has been used to train an LLM, it’s impossible to “untrain” the model, or pinpoint how a particular element of training data is impacting LLM responses. Vector databases are fully updatable. For instance, if usage rights for a particular element of data are lost, such as because a consumer opted out, that data can simply be removed from a vector database, and it’s instantly as if it never existed.

领英推荐

This AI newsletter is all you need #55

Towards AI 1 年前

The Digby Wells Journey of Generative AI

Digby Wells Environmental 2 个月前

What is an AI Agent? A Beginners Guide

Clockwise 4 个月前

3.??????They are more transparent. One of the big concerns with generative AI is that they are currently black boxes. We don’t know exactly how responses were generated, and what training data was used. The entire use case for a vector database is to pinpoint the “nearest neighbors”, or the specific data elements that are most relevant to a prompt. Thus it is far more feasible to provide a higher level of transparency to model users about what historic information was used to create a given response.

4.??????They present a better and cleaner monetization model for data owners. For one, it’s simpler, because a single vector database can serve any foundation model, so it obviates the need to integrate with every single LLM company independently. It isn’t forever; if a data owner does a deal to enable an LLM company to use its database for model training, the cat is out of the bag. If instead it builds vector databases with its data, it can turn on and off access to any entity at any time. It also allows for monetization at the LLM user level, whereas enabling data for model training means any user of that LLM can leverage it for no direct incremental fee.

5.??????Lastly, they expand the types of data that can be monetized. Today, LLMs typically focus on training based on text, like message boards and books and articles, or images. But with a layer of intelligence and analytics, even tabular data can be the basis of highly valuable vector databases. It takes some expertise to execute this, but so much monetizable data today is quantitative in form, and isn’t conducive to today’s foundation LLM models. Vector databases tap into this potential.

Vector databases are an emerging way to customize generative AI models for vertical-specific use cases in a scalable and cost-efficient way. Oftentimes they are associated with first party proprietary data for internal use cases, and that’s certainly a great use case. But what’s much less discussed is the exciting opportunity for vector databases to serve as a clean, controllable, privacy-compliant, flexible, and high potential way for data owners to monetize data for third parties.

B2B Tech Thought Leadership

384 位关注者

Ryan Scott

Payment Systems for AI Voice Agents

1 年

Good article, Jake. I agree there's an opportunity for business to leverage their accumulated knowledge internally and possibly for external monetization.

Vas Bakopoulos

1 年

Thanks for sharing Jake. As a layman, I struggle to visualize embeddings and vectors, let alone follow the implications of vector databases, but this is good food for thought.

1 次回应

查看更多评论

要查看或添加评论，请登录

Jake Moskowitz的更多文章

The Most Important Statistic in the 2024 Election, and the role of Paid vs. Earned Media

2024年11月10日

The Most Important Statistic in the 2024 Election, and the role of Paid vs. Earned Media

The most telling way to look at the 2024 election is the shift toward Donald Trump by state. Swing states had…

1 条评论
AI’s Awkward Acne and Crackly Voice Phase

2024年10月21日

AI’s Awkward Acne and Crackly Voice Phase

We’re in an awkward in-between period of the evolution of generative AI, and one day we’ll look back and chuckle at…

2 条评论
AI Agent Consumer Simulations

2024年10月18日

AI Agent Consumer Simulations

One of the biggest buzzwords in tech these days is “AI Agents”. Mark Benioff just said he’s “never been more excited…

6 条评论
Marketers' Generative AI Problem: Numbers vs. Words

2023年10月2日

Marketers' Generative AI Problem: Numbers vs. Words

2023 has been the year of generative AI hype, centered around large language models (LLMs). Despite an otherwise cool…
Google’s SGE Generative AI Search is…. Disappointing?

2023年6月7日

Google’s SGE Generative AI Search is…. Disappointing?

Generative AI is all the rage. Google made a big announcement in May that it was integrating it into its primary search…

1 条评论
How a Banana Bread Recipe Foreshadows the Death of Content Marketing

2023年5月22日

How a Banana Bread Recipe Foreshadows the Death of Content Marketing

The other day we had some overripe bananas so I looked up a banana bread recipe. As is my recent habit, I tried it both…

3 条评论
Don’t Let ChatGPT Usher in an Era of Marketing Mediocrity!

2023年5月4日

Don’t Let ChatGPT Usher in an Era of Marketing Mediocrity!

Generative AI tools like Midjourney, and especially Large Language Models (LLMs) like ChatGPT, are such a functional…

6 条评论
How to Protect your Business from Generative AI Disruption

2023年3月20日

How to Protect your Business from Generative AI Disruption

I’ll reluctantly admit that I did not immediately grasp the dramatic impact the launch of Chat-GPT would mean for B2B…

1 条评论
“Why” is the Missing Piece for Effective Positioning

2023年3月13日

“Why” is the Missing Piece for Effective Positioning

To understand a tech company’s positioning strategy, and (more importantly) how it executes on its strategy, scroll…

1 条评论
The New Key to Product Positioning is Data Objectivity... and Apple Sauce

2023年2月14日

The New Key to Product Positioning is Data Objectivity... and Apple Sauce

When I was a kid, my parents used to grind up my vitamins and medicine and hide it in apple sauce so I would eat it…

2 条评论

See all articles

5 Reasons They're All Wrong About Monetizing Data for Generative AI Training

Jake Moskowitz

领英推荐

B2B Tech Thought Leadership

384 位关注者

Jake Moskowitz的更多文章

社区洞察

其他会员也浏览了

Navigating the Challenges and Opportunities of Generative AI

Lyssn’s AI Guide for Beginners

The State of AI and ML in Document Capture: Moving Toward a Completely Template-less Future

Introducing GPT-4o: OpenAI's Powerful New AI Model

How non-supervised AI methods will enable B2B AI adoption

Tainted Training: The Unseen Bias in AI Tools

Should your Company ban Generative AI?

A Beginner's Guide to Generative AI

Intelligent Automation Newsletter #137

Cracking the Code: Distinguishing Authentic AI from Imitations

领英推荐

B2B Tech Thought Leadership

384 位关注者

Jake Moskowitz的更多文章

The Most Important Statistic in the 2024 Election, and the role of Paid vs. Earned Media

AI’s Awkward Acne and Crackly Voice Phase

AI Agent Consumer Simulations

Marketers' Generative AI Problem: Numbers vs. Words

Google’s SGE Generative AI Search is…. Disappointing?

How a Banana Bread Recipe Foreshadows the Death of Content Marketing

Don’t Let ChatGPT Usher in an Era of Marketing Mediocrity!

How to Protect your Business from Generative AI Disruption

“Why” is the Missing Piece for Effective Positioning

The New Key to Product Positioning is Data Objectivity... and Apple Sauce

社区洞察

其他会员也浏览了

Navigating the Challenges and Opportunities of Generative AI

Lyssn’s AI Guide for Beginners

The State of AI and ML in Document Capture: Moving Toward a Completely Template-less Future

Introducing GPT-4o: OpenAI's Powerful New AI Model

How non-supervised AI methods will enable B2B AI adoption

Tainted Training: The Unseen Bias in AI Tools

Should your Company ban Generative AI?

A Beginner's Guide to Generative AI

Intelligent Automation Newsletter #137

Cracking the Code: Distinguishing Authentic AI from Imitations