The AI Plateau Is Real — How We Jump To The Next Breakthrough
Emergence Capital
We invest in people who change the way the world works. *We are hiring!*
By Gordon Ritter and Wendy Lu
Imagine asking an LLM for advice on making the perfect pizza, only to have it suggest using glue to help your cheese stick — or watching it fumble basic arithmetic that wouldn’t trip up your average middle school student. Such are the limitations and quirks of generative AI models today.
History tells us that technological advancement never happens in a straight line. An almost undetectable buildup of knowledge and craft is met with a spark, resulting in an explosion of innovation, and eventually reaching a plateau. These innovations, across centuries, share a common pattern: The S-Curve. For example:
The AI Plateau
We have just witnessed this exact pattern occur in the AI revolution. Alan Turing was one of the first computer science pioneers to explore how to actually build a thinking machine in a 1950 paper.?
Over seventy years later, OpenAI seized on decades of sporadic progress to create a Large Language Model that arguably beats the Turing Test, answering questions in a way that is indistinguishable from a human. (Albeit, still far from perfect.)
When ChatGPT was first released in November 2022, it created a global shockwave. For a while, each subsequent release, and releases of models from other companies like Anthropic, Google and Meta, offered drastic improvements.
These days, the incremental progress of each new LLM release is limited. Consider this chart of performance increases of OpenAI’s flagship model:
Although every benchmarking system has shortcomings, clearly the pace of change is no longer setting the world on fire. What’s needed now, and what we hope is coming, is the jump to the next S-Curve.
We believe we understand what’s caused AI to plateau, and what is needed to make the next jump: Access to the next frontier of data.
The Next Curve: Proprietary Business Data
Today’s large language models are trained on public data from the internet. But public textual training data on the internet has long been harvested (think Github, Reddit, Wordpress and other public web pages). This has forced AI companies to scavenge other sources. For example, with Whisper, OpenAI has transcribed a million hours of YouTube videos for GPT-4. Yet another tactic is to employ low-cost, offshored human “labelers” through services like ScaleAI.
Model providers can keep following this path (there are an estimated 150 million hours of YouTube videos, after all), but they won’t escape the flattening S-Curve this way. Improvements are likely to be marginal; returns, declining. Synthetic data is another pathway, but it has its own limitations and weaknesses.
We believe the real breakthrough that will allow humanity to jump to the next S-Curve is data produced at work. Workplace data is of far higher quality than what’s left of public data for training purposes, especially compared to running the dregs of the internet through the transformer mill. (The results of which may be why a lot of AI-generated content is already being called “slop.”)
A product spec, a sales deck, or a medical study produced in a work context is many times more valuable than an unverified Wikipedia page or Reddit post. Even better is when this work comes from an expert at the top of their field.
Startups that unlock the world’s business data will be poised to create many more multiples of value. As a proxy, we compared the average revenue per user (ARPU) of top consumer apps against the per-seat price of select B2B apps. Even the most “consumer-oriented” business apps, such as Notion, still earn much more revenue per user than consumer tech companies:
The math is simple. The value proposition of AI for B2B is vast, and, we believe, still largely untapped.
Meanwhile, knowledge workers continuously produce business data at an incredible cadence:
Data produced in a work context will drive the next S-Curve.
A Slippery Slope
As LLM providers start to tackle business use cases (see OpenAI’s Rockset acquisition and Anthropic’s recent launch), enterprises are right to be wary. OpenAI and Anthropic today claim that they do not train models on data from business tier subscriptions. History tells us that the pressures of growing their business might force them to backtrack.?
Take Facebook as an example. Meta long claimed to ignore users’ activity on partner websites while they were logged out. $725M in privacy-related settlements later, it’s still gobbling up consumer data at a massive rate. As a cloud software pioneer, Salesforce originally committed that all customer data would not be shared with third-parties. Their current privacy policy negates this.
History repeats itself, but this time, the stakes are higher. With the rise of cloud, SaaS applications were primarily used in “non-core processes” – anything absolutely core to a business would be built in house. With AI, data being fed into closed-source models could include everything from a company’s knowledge bases to its internal processes, contracts, PII, and a host of other proprietary, sensitive data.?
All this rich context composes the sustainable competitive advantage for businesses. In order for businesses to protect what’s theirs, we believe that they need to own their own proprietary models.
Just as the New York Times is fighting to protect its IP, businesses should resist the big AI companies’ appetite to harvest their proprietary data in the manner they did with public data.
To fully leverage the brilliance within their organizations, businesses should own their models. Owning their models allows them to continuously improve while sustaining their competitive advantage. We believe this is the right way to make the jump to the next S-Curve.?
Big AI companies are rapidly becoming incumbents, but all is not lost. We have identified four areas of opportunity for new startups to solve the AI plateau in a way that is compatible with the needs and imperatives companies face.
Four Key Opportunities
These are the four key areas of opportunity emerging for new startups. We’re already seeing each of these areas experience a large market pull, making them fertile ground for new disruptors.
1. Engage Experts
There is a large opportunity to create novel ways to source AI training data. The highest quality data will come from experts in each field, vs today’s services which primarily utilize crowdsourced human labelers.
Opportunities
2. Leverage Latent Data
A treasure trove of data already exists in an organization’s business apps (think Salesforce, Notion and Slack). There is an opportunity to help enterprises prepare this data for model training or inference. OpenAI’s recent acquisition of Rockset, which will power the retrieval infrastructure of the ChatGPT's enterprise products, speaks to increasing investment in this area.
Opportunities
3. Capture in Context
Allow businesses to capture net new data that is continually generated each day. Do this without interruption to employees’ normal workflows, instead of through out-of-context data tagging initiatives.
Opportunities
4. Secure the Secret Sauce
Help enterprises to create and deploy their own custom models, letting them stay in control and protect proprietary IP.
Opportunities
This is just the tip of the iceberg: There are myriad opportunities to solve the AI plateau, leaping to the next S-Curve in AI performance. This is just the latest chapter in the ongoing story of human technological advancement.
It’s crucial that this next wave of technology looks like those that came before it. Advancement in AI should be based on human discovery and knowledge, crafted with human-centric attributes of privacy and quality in mind. In the dance of co-creation with AI, humans need to lead.
If you’re building an enterprise-focused AI tool that is attempting to solve these problems, we’d love to hear from you. Send a note anytime to [email protected] or [email protected].
co-founder & ceo of Guru
8 个月really resonates. I'd say our customers view any possible chance of their enterprise data being leveraged by LLM for training or even storage is a non-starter. and at the same time, to get meaningful value from generative AI at work, connecting to your existing content and info is critical to creating something truly useful to a functional team.
R&D Director/CPO&CTO/
8 个月Really great analytics as usually estimated by EmCap, many thanks! "The Next Curve: Proprietary Business Data" is the next focus, sure: private data proceeding with AI is typically "ignored" by big AI/LLM players interested in SaaS pay-as-you-go cash cow model instead of AI (e.g. LLM) deployment in the enterprise on-prem landscape. But information security and governance policies are about data control as well as particular business specific data usage. Thanks for your point on that. And what is interested - AI S-curves innovation model reflects another one view on business growth model articulated by McKinsey known as "The three horizons of growth" framework https://www.mckinsey.com/capabilities/strategy-and-corporate-finance/our-insights/enduring-ideas-the-three-horizons-of-growth.