Why I think generative AI is overhyped

Why I think generative AI is overhyped

I think we're at the peak of the hype cycle around LLMs/transformer models/"generative AI", and the trough of disillusionment is coming.

I realize that I risk looking like a complete fool by putting this prediction out there. I'm OK with that. I have always had a contrarian streak. In a couple of years, if I'm wrong, we can look back on this post and laugh.

The TL;DR version of my argument is: Investors are giving away far too much money (part 1) pursuing business use cases which are far from proven (part 2) based on assumptions of exponential future improvement in LLM technology (part 3), which are about to run into the fundamental limitations of the approach (part 4).

I believe the transformer-based architecture is a real breakthrough technology, but there is a hype bubble being built around it based on promises that it can't fulfill.

(This is distinct from the moral argument that we should slow down AI development because of the harm it's doing to society. Those harms are real, and already happening, and I care about them, but that's not what this post is about. This post is purely about the cold mercenary question of whether it's got enough of a business use case to cover its development costs.)

1. Too much money

An insane amount of money is being pumped into data center hardware and model training. Sam Altman says he needs, what, 7 trillion, trillion-with-a-T, dollars? And right now the "product" is being given away for free, or almost free. This should remind you of the hype around companies like WeWork, Uber, and DoorDash ten years ago when they were showing exponential growth. It's easy to show exponential growth through the simple but unsustainable trick of giving money away. It can't continue. This is when we know we're in a bubble.

Investors don't do this because they love giving money away. It's because they expect to cash out at trillion-dollar valuations. At some point the bill is going to come due. OpenAI and other companies will have to start charging users for ChatGPT conversations or DALL-E images in a way that can pay back the investment.

Then we'll find out what these models actually cost to run. Will people still be excited about ChatGPT if it costs $10 per conversation?

2. Business use cases that are far from proven

I've tried using ChatGPT to write code. It initially wowed me because it produced correct code for toy problems that are similar enough to its training set. But then I poked a little further. And noticed it spits out code that is plausible, well-documented, cleanly formatted -- and completely wrong.

I've seen a lot of breathless articles on Medium and Substack where people say "look at this Python code that ChatGPT generated from my simple prompt! We don't need software engineers anymore!" But then if you take their generated Python code and try running it, it's wrong. (That's right, the blogger never even bothered to check whether it worked.)

Statistical text prediction is good at predicting the grammar of a programming language. It makes syntactically valid code that will run. But there's no guaranteed connection between that and the reality of what you want to do. The code is superficially plausible, but might produce no output, or worse, plausible-but-incorrect output.

LLMs are useful as a fancy autocomplete - sort of an automated way of copy-pasting code from Stack Overflow. But you have to code review its output carefully. The further you get from the training set, the more likely the code has subtle bugs in it. And if you're working on unsolved problems, you're by definition going to be far from the training set.

And I think that generating code is one of the best use cases. At most other use cases, it's worse.

An LLM cannot be relied upon to tell the truth. If you try to use it for legal research, it makes up nonexistent cases. If you let a chatbot talk to your customers, it offers them nonexistent deals. It will not work for use cases that require factual accuracy.

It's also never going to come up with anything original. (I'm not making some nebulous argument about a machine lacking an artist's soul or whatever. Rather, it gives you something like a statistical average of its training set. It's like taking everything that humans have written about the topic on the internet, and then taking the middle of those responses.) It says the most obvious thing, the cliche, the conventional wisdom. In a word, it's mid.

What does that leave? Use cases that don't require originality or correct details. That means, well, advertising for one thing. Propaganda. Generic web copy. It's good at role-playing. Generic filler dialogue between fictional characters. On the image-generation side, it's good when you just need an illustration to accompany a post. Something to replace clip art or stock photography. When a company wants a web design that just looks like every other company's web design.

It's good at vibes. If all that matters is the vibe of text or the vibe of an image, then ChatGPT can replace a human worker. Copywriters and illustrators are probably going to need to be looking for new work.

Anything else, the output of an LLM needs to be supervised, fact-checked, edited, and re-written by a human to the extent that it's questionable how much time is actually being saved.

So, speeding up your coders and firing your copywriters/illustrators. Companies will pay for that. But will they pay enough for it to pay back trillion-dollar investments?

3. The assumption of exponential future improvement

Whenever I point out a limitation in generative-AI, someone tries to sweep away my objection with "Well it's still in its infancy, it's going to keep getting better". This has become a thought-stopping cliche.

Everybody seems to be abandoning their skepticism and buying into a narrative of exponential improvement. But is that supported by evidence? Or are we just looking at the difference between ChatGPT 2 and 3, and between 3 and 4, and extrapolating an exponential curve from three data points?

(This reminds me of the argument in 2021 that NFTs were going to exponentially increase in value. We saw how that turned out.)

Exponential growth never lasts forever. Whenever you think you're seeing exponential growth, zoom out and you'll find that you're looking at the bottom of an S-curve. The question about transformer architectures is whether we're closer to the beginning or the end of the S-curve.

Incremental improvements in LLMs haven't been through new algorithmic breakthroughs or deeper understanding, but through brute force - throwing bigger and bigger data sets at models with more and more parameters (and more and more Nvidia GPUs). The slogan is "Scale is all you need". But at a certain point, we can't actually increase the size of the training set by any more orders of magnitude. Once we've trained a model on the entire text of the internet, it's not like we have ten more internets to train the next one on.

Signs that we're closer to the end of the S-curve would be that exponentially increasing inputs (size of training data sets, number of GPUs, energy consumption of a data center) are required to continue making linear progress. Doesn't that sound like what's happening?

4. The fundamental limitations of the approach

Many proposed business use cases depend on the idea that fundamental problems such as "hallucinations" will be fixed any day now.

I don't think we should use the word "hallucinations", as if they were a deviation from the LLM's normal functioning. An LLM is a statistical model for predicting the next word. There's no concept of truth or falsehood. Far from being a deviation, composing fiction is what it does when working as designed. Some of its fiction happens to contain true facts, basically by accident.

Fine-tuning and RLHF (reinforcement learning from human feedback) paper over this problem without solving it. There's still no concept of truth or falsehood, only an increased chance of producing certain output if your question is similar to something from the RLHF set.

We still can't explain why an LLM produced a particular output, in terms of model weights or internal cause-and-effect. It's still a black box.

It's not clear that it's possible to create a model within the transformer-based architecture where the output is constrained to have some relationship to reality. If it is possible, it will probably come from RAG (retrieval-augmented generation), eventually. But that's still a research program.

Meanwhile we're rushing stuff out to market that probably still belongs in a research lab.

Conclusion:

The Sora demo makes people say "Wow! It generated a plausible video". It is amazing (as long as you don't look too close at the background. Or try to read any of the text). But we know that demos cherry-pick the best results they were able to achieve under controlled conditions. To move from demos to production, you don't need something that makes a video, you need it to make the video that your business needs.

ChatGPT etc. are like a coworker who's much better at "convincing the boss they've done the work" than "actually doing the work". As a text generator that always tells you exactly what you want to hear, it's the perfect investor bait -- a product that literally writes its own pitch. This has fooled a lot of people into thinking that ChatGPT is much more production-ready than it actually is. Especially if they go in with a bias towards believing it - ChatGPT is very good at confirming your existing biases.

Unlike the NFT bubble, there is a genuine new technology here. The transformer/attention model is a real advancement, a breakthrough discovery about the mathematical structures underlying language. But both things can be true. There can be a real advancement at the core with a hype bubble built around it. After we go through the valley of disappointment, this technology will eventually find an appropriate, useful niche. But it might not be any of the niches we expect today.

In the meantime, keep your wits about you, don't get caught up in FOMO or groupthink, and don't throw away your skepticism.


I think it will just add up to the things its not something very great

回复
Clea Jones Content Marketing Specialist

Let's get you noticed and get you leads

8 个月

You nailed it.

回复
Oliver Ford

Software Engineer at DXC

8 个月

In terms of code has any LLM been tested for semantic understanding? Can any, say, read the 2014 version of OpenSSL and recognize Heartbleed? If not, no way is it replacing humans.

回复

Disillusionment is certainly possible... everything is harder than it first seems and takes longer than it first seems. That's why the Gartner Curve keeps ringing true. But I really don't agree with the reductionist take here. LLMs aren't just autocompletion. If I gave you a ton of multiplication problems to complete you might just memorize the completions... or you might learn how to do multiplication. Completion as a form of training is prediction, at a certain point prediction requires making models and building understanding. LLMs got past simple statistical completion. They aren't just completion, but it's very hard to describe what they are. We honestly don't know, that's part of the struggle to develop with them. They solve tasks that we previously thought were unsolvable by computers, so no one was asking for this functionality. The things LLMs can do are things technologists would previously self-censor as pipedreams.

要查看或添加评论,请登录

Jonathan Xia的更多文章

  • Fighting Feature Creep

    Fighting Feature Creep

    "How do you manage project scope to prevent feature creep?", LinkedIn prompted me (marking the first time ever that its…

    2 条评论
  • How Software Is - and Isn't - like Woodworking

    How Software Is - and Isn't - like Woodworking

    My taiko drumming group needs new drums, and stands for the drums, so I've been learning woodworking lately to help…

    2 条评论

社区洞察

其他会员也浏览了