What happens when AI has read everything?

What happens when AI has read everything?

Are you a writer scared that you might soon be without a job? Fear not, according to this article in the Atlantic, it seems like we may need more books now than ever. According to a reserach team led by Pablo Villalobos, as soon as 2023, we may already run out of high-quality language data, and somewhere between 2030 and 2070, we'll be likely to have run out of vision data (source).

LLMs perform better when trained on books

Now, whereas we're pretty much generating labeled visual data every day on our social media feeds, we're considerably slower to replenish our written language data. Because writing takes time. And LLMs are picky readers: "Large language models trained on books are much better writers than those trained on huge batches of social-media posts". The Villalobos research paper itself mentions high quality data typically being "composed of 50% scraped user-generated content (Pile-CC, OpenWebText2, social media conversations, filtered webpages, MassiveWeb, C4), 15-20% books, 10-20% scientific papers, <10% code and <10% news. In addition, they all incorporate known small very-high-quality datasets like Wikipedia".

The Atlantic article then goes on to explain that we might soon run into a shortage of these high quality data sources, especially books. They mention Google researchers estimating that from the more than 125 million books that have been published since Gutenberg brought printing to Western Europe, between 10 to 30 million have already been digitized, and therefore possibly already be in AI's training data. But that's just a fraction of what future LLM's will be able ingest.

Let's put dongles around our neck and record speech acts

The speculative solutions that are mentioned in the Atlantic article sound rather ominous, to be honest:

  • AI could create synthetic training data itself, where an LLM could be "like the proverbial monkeys with typewriters, only smarter and possessed of functionally infinite energy. They could pump out billions of new novels, each of Tolstoyan length".
  • We humans could provide data to the AI: "we could all wear dongles around our necks that record our every speech act", or we could harvest our text messages or record the keystrokes of all whitecollar workers.

But why?

Even though the researchers remark that these solutions are currently not feasible nor acceptable, I keep wondering.

Suppose that AI is able to create billions of novels, what's the point if they'll never be read by humans?

Sure, I get it, from a machine learning point of view, it may sound logical that if LLMs perform best on books, we should give them books. But from a humanistic point of view, I really wonder what it means to basically have no human in the loop in adding to what's possibly the most important data set out there: our collective human knowledge.

What would it mean to generate billions of novels and feed them to an AI, without any human having seen, read, or evaluated those texts? Especially when we don't know yet how AI will be used in the future? Right now, ChatGPT is clearly recognisable as an app. And you can choose to use it or not. But what if these AIs get more integrated in our lives? More ambient? Would you like to base your information consumation, your decision making, and disclosure of your personal data based on a background algorithm that's basically feeding home-grown fiction back into itself?

No alt text provided for this image
Fantasia - Disney (1940)

And what does it say about us humans, when we're willing to let algorithms produce the very thing that writers struggle to earn a living with, and fewer and fewer people read for their own pleasure, benefit or learning? Books as cheap, replacable mass fodder for machines. Is this really something we should aspire?

That is, if it's even possible at all, as my 15-year old nephew remarked yesterday when we were discussing this topic. 'If AI can't generate anything new, how can it write something original? Or invent a new literary genre? Create its own style?' (yes, I know, I've got an amazing nephew. He's teaching me to build mechanical keyboards too).

No alt text provided for this image
Is this really what we want? - Source unknown


Why are books so good for training LLMs in the first place?

That begs the question: what exactly is it in books that makes LLMs do better? Is it the larger volume of text? Larger and more varied vocabulary? The higher chance of cohesion & coherence, rethoric devices and stylistic choices that are inherent in creating longer texts? Undoubtedly these text-related aspects play a role.

But I wonder whether it might it be also something more fundamentally human? Writing a book takes effort, it takes time, and it takes someone who basically does the thinking for you so you don't have to do that. A well-written text effortlessly guides you through a topic, an instruction or through the writer's mind, heart and soul.

No alt text provided for this image
In medieval times, books were so precious that they were chained to their shelves to prevent them from being stolen - Chained library at Hereford cathedral - Maaike Groenewege

That means that the writer needs to know things on a deeper level than her reader. For that, she needs to know who she's writing for. She needs to find the common ground where both writer and reader can meet, not only in terms of knowledge, but also in an emotional and often spiritual sense. In many cases, that's a result of lived experience. That, to me, is the beating heart of the writing process. Transferring, sharing and rejoicing in lived experience.

Might it be that book-trained LLM perform better because books were written by people who feel and think?

Nick Cave: a grotesque mockery of what is to be human

Two weeks ago, Nick Cave made the news with his blog on a ChatGPT song "written in the style of Nick Cave”, calling it “a grotesque mockery of what it is to be human”.

This quote says it all much more eloquent that I ever can:

"Songs arise out of suffering, by which I mean they are predicated upon the complex, internal human struggle of creation and, well, as far as I know, algorithms don’t feel. Data doesn’t suffer. ChatGPT has no inner being, it has been nowhere, it has endured nothing, it has not had the audacity to reach beyond its limitations, and hence it doesn’t have the capacity for a shared transcendent experience, as it has no limitations from which to transcend. ChatGPT’s melancholy role is that it is destined to imitate and can never have an authentic human experience, no matter how devalued and inconsequential the human experience may in time become."

[...]

"What makes a great song great is not its close resemblance to a recognizable work. Writing a good song is not mimicry, or replication, or pastiche, it is the opposite. It is an act of self-murder that destroys all one has strived to produce in the past."

So now what?

It seems hard to stop this movement. Then again, we've had AI winters before. But imagine...just an idea...if LLM's benefit from books, if ML needs us to be language savvy, if we feel this is important, can't we make this a shared opportunity for both ML and humanity and increase the creation of human-written books? What if tech would re-invent itself as the mecenas of literature? With funds for authors of all genres, initiatives to get people reading again (hey, we'll be out of jobs soon anyway, right? I'd be totally fine reading books 40 hours per week!) And rankings of which books should make into the ML canon for algorithms? I guess I'm hopelessly old-fashioned and idealistic, but one can dream ??

No alt text provided for this image
Cemetry at Warham Percy, deserted medieval village in Yorkshire - Maaike Groenewege




About Maaike Groenewege

Maaike is a conversation design lead, linguist and DesignOps coach with?Convocat conversational expertise. She helps both starting and more experienced conversational teams in optimising their conversation design practise, NLU analyses and team communication. Her main focus right now is on how LLMs can benefit enterprise conversational AI.

  • Join?Convoclub, my boutique community for conversation designers, linguists and language lovers. We meet every other Monday at 18:00 CET, find the invites in the Convoclub Events section, or keep an eye out on LinkedIn!
  • Find your next book on conversational AI in my?Convocat library

William Tadeu

Ajudo coaches, consultores, mentores e pequenos empresários. Juntos, criamos sistemas de vendas inteligentes. Diariamente, geramos potenciais clientes qualificados. Tudo isso com menos esfor?o e mais previsibilidade.

1 年

????????

回复
Julie Daniel Davis, CETL

Educational Technologist

1 年

THIS! We, as humanity, have to value the beauty of originality and creativity. This is the first post I’ve seen on the existential crisis that could be looming ahead for many. Is anything sacred to us? Is anything worthy of the human touch? While I am all for advancement of mankind, I can’t help but wonder “at what cost?” when I consider posts like yours.

Michael Novak

Responsible AI Business Exec | Leveraging ChatGPT AI, Digital Identity & Web3 to drive value.

1 年

Maaike: Hoihoi! This is one of my favorite posts. As a fellow bibliophile, your post raises an issue that I’ve just begun pondering, and you raise several questions about the nature of the relationship between “us” humans and “them” AI tools. Will we need to create a standard clause, a “machineright”, to indicate that more than x percent (10? 30?) of a original published work was created by an AI tool? Not used as a reference source such as Wikipedia, but generated. Would a machineright change how we trust primary sources? If indeed AI “reads everything”, does that make it an “expert”? History is full of social and scientific advances that were the product of serendipity by uneducated amateurs. Forward Convocat convoclub ! (this post self-certified as being written by a human)

要查看或添加评论,请登录

社区洞察

其他会员也浏览了