5 Reasons They're All Wrong About Monetizing Data for Generative AI Training
Credit: MIT Sloan Management Review

5 Reasons They're All Wrong About Monetizing Data for Generative AI Training

As society has gotten smarter about how generative AI systems like ChatGPT work, one result has been a sharp increase in scrutiny about how generative AI companies like OpenAI, Meta, and Facebook gain access to training data. Recently there have been high profile lawsuits, such as by Sarah Silverman, against these and other companies alleging that copyrighted material was effectively stolen to train these generative AI models, also known as LLMs. Some companies such as Twitter have taken steps to limit access to data for model training, while others such as Reddit have attempted to monetize the use of their text data for model training. How this will all play out has become one of the most central and publicized uncertainties in this nascent generative AI space.

But the focus on model training is focused in the wrong place, and as a result we are ignoring a potentially far cleaner and more logical way this can all play out, namely via the use of vector databases. Below are five reasons why this is a far more promising solution for how data owners (or managers) can monetize data that’s used as knowledge for LLMs.

1.??????One of the most widely discussed limitations of generative AI is the problem of “hallucination”, wherein a model will make up incorrect information as part of a plausible sounding response to a prompt. But as Nick Frosst, co-founder of Cohere AI, has put it, hallucinating is the only thing an LLM ever does; it just so happens some hallucinations happen to correspond with the truth. But one way to cut down on hallucinations is to separate knowledge from language generation. The goal of an LLM is not actually to have knowledge, but rather simply to create language. The knowledge needs to come from elsewhere. As Harvard Business Review recently described, separating knowledge into vector databases that LLMs can access as needed to acquire contextual knowledge to answer specific prompts is becoming a popular way to customize models for vertical-specific use cases.

2.??????Vector Databases are more controllable and privacy friendly than training data. LLMs cannot be untrained; once data has been used to train an LLM, it’s impossible to “untrain” the model, or pinpoint how a particular element of training data is impacting LLM responses. Vector databases are fully updatable. For instance, if usage rights for a particular element of data are lost, such as because a consumer opted out, that data can simply be removed from a vector database, and it’s instantly as if it never existed.

3.??????They are more transparent. One of the big concerns with generative AI is that they are currently black boxes. We don’t know exactly how responses were generated, and what training data was used. The entire use case for a vector database is to pinpoint the “nearest neighbors”, or the specific data elements that are most relevant to a prompt. Thus it is far more feasible to provide a higher level of transparency to model users about what historic information was used to create a given response.

4.??????They present a better and cleaner monetization model for data owners. For one, it’s simpler, because a single vector database can serve any foundation model, so it obviates the need to integrate with every single LLM company independently. It isn’t forever; if a data owner does a deal to enable an LLM company to use its database for model training, the cat is out of the bag. If instead it builds vector databases with its data, it can turn on and off access to any entity at any time. It also allows for monetization at the LLM user level, whereas enabling data for model training means any user of that LLM can leverage it for no direct incremental fee.

5.??????Lastly, they expand the types of data that can be monetized. Today, LLMs typically focus on training based on text, like message boards and books and articles, or images. But with a layer of intelligence and analytics, even tabular data can be the basis of highly valuable vector databases. It takes some expertise to execute this, but so much monetizable data today is quantitative in form, and isn’t conducive to today’s foundation LLM models. Vector databases tap into this potential.

Vector databases are an emerging way to customize generative AI models for vertical-specific use cases in a scalable and cost-efficient way. Oftentimes they are associated with first party proprietary data for internal use cases, and that’s certainly a great use case. But what’s much less discussed is the exciting opportunity for vector databases to serve as a clean, controllable, privacy-compliant, flexible, and high potential way for data owners to monetize data for third parties.

Ryan Scott

Payment Systems for AI Voice Agents

1 年

Good article, Jake. I agree there's an opportunity for business to leverage their accumulated knowledge internally and possibly for external monetization.

回复
Vas Bakopoulos

SVP | Brand Strategy, Data & Attribution, Marketing Insights | MMA, Possible, Digitas, Kantar, ARF, I-com | Instructor at NYU | Keynote speaker |

1 年

Thanks for sharing Jake. As a layman, I struggle to visualize embeddings and vectors, let alone follow the implications of vector databases, but this is good food for thought.

要查看或添加评论,请登录

Jake Moskowitz的更多文章

社区洞察

其他会员也浏览了