登录查看更多内容

What happens when AI has read everything?

Maaike Groenewege

Conversation design | Generative AI | Linguistics | NLU/NLP | technical communication | I build communities

发布日期: 2023年1月31日

Are you a writer scared that you might soon be without a job? Fear not, according to this article in the Atlantic, it seems like we may need more books now than ever. According to a reserach team led by Pablo Villalobos, as soon as 2023, we may already run out of high-quality language data, and somewhere between 2030 and 2070, we'll be likely to have run out of vision data (source).

LLMs perform better when trained on books

Now, whereas we're pretty much generating labeled visual data every day on our social media feeds, we're considerably slower to replenish our written language data. Because writing takes time. And LLMs are picky readers: "Large language models trained on books are much better writers than those trained on huge batches of social-media posts". The Villalobos research paper itself mentions high quality data typically being "composed of 50% scraped user-generated content (Pile-CC, OpenWebText2, social media conversations, filtered webpages, MassiveWeb, C4), 15-20% books, 10-20% scientific papers, <10% code and <10% news. In addition, they all incorporate known small very-high-quality datasets like Wikipedia".

The Atlantic article then goes on to explain that we might soon run into a shortage of these high quality data sources, especially books. They mention Google researchers estimating that from the more than 125 million books that have been published since Gutenberg brought printing to Western Europe, between 10 to 30 million have already been digitized, and therefore possibly already be in AI's training data. But that's just a fraction of what future LLM's will be able ingest.

Let's put dongles around our neck and record speech acts

The speculative solutions that are mentioned in the Atlantic article sound rather ominous, to be honest:

AI could create synthetic training data itself, where an LLM could be "like the proverbial monkeys with typewriters, only smarter and possessed of functionally infinite energy. They could pump out billions of new novels, each of Tolstoyan length".
We humans could provide data to the AI: "we could all wear dongles around our necks that record our every speech act", or we could harvest our text messages or record the keystrokes of all whitecollar workers.

But why?

Even though the researchers remark that these solutions are currently not feasible nor acceptable, I keep wondering.

Suppose that AI is able to create billions of novels, what's the point if they'll never be read by humans?

Sure, I get it, from a machine learning point of view, it may sound logical that if LLMs perform best on books, we should give them books. But from a humanistic point of view, I really wonder what it means to basically have no human in the loop in adding to what's possibly the most important data set out there: our collective human knowledge.

What would it mean to generate billions of novels and feed them to an AI, without any human having seen, read, or evaluated those texts? Especially when we don't know yet how AI will be used in the future? Right now, ChatGPT is clearly recognisable as an app. And you can choose to use it or not. But what if these AIs get more integrated in our lives? More ambient? Would you like to base your information consumation, your decision making, and disclosure of your personal data based on a background algorithm that's basically feeding home-grown fiction back into itself?

No alt text provided for this image — Fantasia - Disney (1940)

And what does it say about us humans, when we're willing to let algorithms produce the very thing that writers struggle to earn a living with, and fewer and fewer people read for their own pleasure, benefit or learning? Books as cheap, replacable mass fodder for machines. Is this really something we should aspire?

That is, if it's even possible at all, as my 15-year old nephew remarked yesterday when we were discussing this topic. 'If AI can't generate anything new, how can it write something original? Or invent a new literary genre? Create its own style?' (yes, I know, I've got an amazing nephew. He's teaching me to build mechanical keyboards too).

Why are books so good for training LLMs in the first place?

That begs the question: what exactly is it in books that makes LLMs do better? Is it the larger volume of text? Larger and more varied vocabulary? The higher chance of cohesion & coherence, rethoric devices and stylistic choices that are inherent in creating longer texts? Undoubtedly these text-related aspects play a role.

Anna Y. 6 个月前

How to Bypass Originality AI Detection? [10 Useful…

Parul Gautam 6 个月前

This AI newsletter is all you need #10

Towards AI 2 年前

But I wonder whether it might it be also something more fundamentally human? Writing a book takes effort, it takes time, and it takes someone who basically does the thinking for you so you don't have to do that. A well-written text effortlessly guides you through a topic, an instruction or through the writer's mind, heart and soul.

That means that the writer needs to know things on a deeper level than her reader. For that, she needs to know who she's writing for. She needs to find the common ground where both writer and reader can meet, not only in terms of knowledge, but also in an emotional and often spiritual sense. In many cases, that's a result of lived experience. That, to me, is the beating heart of the writing process. Transferring, sharing and rejoicing in lived experience.

Might it be that book-trained LLM perform better because books were written by people who feel and think?

Nick Cave: a grotesque mockery of what is to be human

Two weeks ago, Nick Cave made the news with his blog on a ChatGPT song "written in the style of Nick Cave”, calling it “a grotesque mockery of what it is to be human”.

This quote says it all much more eloquent that I ever can:

"Songs arise out of suffering, by which I mean they are predicated upon the complex, internal human struggle of creation and, well, as far as I know, algorithms don’t feel. Data doesn’t suffer. ChatGPT has no inner being, it has been nowhere, it has endured nothing, it has not had the audacity to reach beyond its limitations, and hence it doesn’t have the capacity for a shared transcendent experience, as it has no limitations from which to transcend. ChatGPT’s melancholy role is that it is destined to imitate and can never have an authentic human experience, no matter how devalued and inconsequential the human experience may in time become."

[...]

"What makes a great song great is not its close resemblance to a recognizable work. Writing a good song is not mimicry, or replication, or pastiche, it is the opposite. It is an act of self-murder that destroys all one has strived to produce in the past."

So now what?

It seems hard to stop this movement. Then again, we've had AI winters before. But imagine...just an idea...if LLM's benefit from books, if ML needs us to be language savvy, if we feel this is important, can't we make this a shared opportunity for both ML and humanity and increase the creation of human-written books? What if tech would re-invent itself as the mecenas of literature? With funds for authors of all genres, initiatives to get people reading again (hey, we'll be out of jobs soon anyway, right? I'd be totally fine reading books 40 hours per week!) And rankings of which books should make into the ML canon for algorithms? I guess I'm hopelessly old-fashioned and idealistic, but one can dream ??

About Maaike Groenewege

Maaike is a conversation design lead, linguist and DesignOps coach with?Convocat conversational expertise. She helps both starting and more experienced conversational teams in optimising their conversation design practise, NLU analyses and team communication. Her main focus right now is on how LLMs can benefit enterprise conversational AI.

Join?Convoclub, my boutique community for conversation designers, linguists and language lovers. We meet every other Monday at 18:00 CET, find the invites in the Convoclub Events section, or keep an eye out on LinkedIn!
Find your next book on conversational AI in my?Convocat library

Creative prompts

3,348 位关注者

William Tadeu

Ajudo coaches, consultores, mentores e pequenos empresários. Juntos, criamos sistemas de vendas inteligentes. Diariamente, geramos potenciais clientes qualificados. Tudo isso com menos esfor?o e mais previsibilidade.

1 年

????????

Julie Daniel Davis, CETL

Educational Technologist

1 年

THIS! We, as humanity, have to value the beauty of originality and creativity. This is the first post I’ve seen on the existential crisis that could be looming ahead for many. Is anything sacred to us? Is anything worthy of the human touch? While I am all for advancement of mankind, I can’t help but wonder “at what cost?” when I consider posts like yours.

2 次回应

Michael Novak

Responsible AI Business Exec | Leveraging ChatGPT AI, Digital Identity & Web3 to drive value.

1 年

Maaike: Hoihoi! This is one of my favorite posts. As a fellow bibliophile, your post raises an issue that I’ve just begun pondering, and you raise several questions about the nature of the relationship between “us” humans and “them” AI tools. Will we need to create a standard clause, a “machineright”, to indicate that more than x percent (10? 30?) of a original published work was created by an AI tool? Not used as a reference source such as Wikipedia, but generated. Would a machineright change how we trust primary sources? If indeed AI “reads everything”, does that make it an “expert”? History is full of social and scientific advances that were the product of serendipity by uneducated amateurs. Forward Convocat convoclub ! (this post self-certified as being written by a human)

3 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

What happens when AI has read everything?

Maaike Groenewege

Conversation design | Generative AI | Linguistics | NLU/NLP | technical communication | I build communities

LLMs perform better when trained on books

Let's put dongles around our neck and record speech acts

But why?

Why are books so good for training LLMs in the first place?

领英推荐

Nick Cave: a grotesque mockery of what is to be human

So now what?

About Maaike Groenewege

Creative prompts

3,348 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

This AI newsletter is all you need #10

Bypass ZeroGPT: 10 Tested Methods to Bypass ZeroGPT AI Detection

Gen-AI may be massively hyped, but the potential is huge: Here are ten big technological shifts creating the disruptive opportunity of GPT-4

Everything AI?

Insider's Edit: OpenAI's Tips for Writing Better Prompts

Ai vs Hard Copies

Enhancing Response Synthesis in Retrieval-Augmented Generation (RAG) Systems

Can AI Really Write My Book? The Pros + Cons of Using AI For Book Writing

GPT-3 writes like a writer, programs like a programmer, and can be ... dangerous

LLM Paper Reading Notes - April 2024

LLMs perform better when trained on books

Let's put dongles around our neck and record speech acts

But why?

Why are books so good for training LLMs in the first place?

领英推荐

Nick Cave: a grotesque mockery of what is to be human

So now what?

About Maaike Groenewege

Creative prompts

3,348 位关注者

2 years and 700 members: insights from Convoclub

2023年8月16日

Off to a good start: onboarding checklist for conversation designers

2023年7月30日

Putting the design in prompt design: blind prompting vs. prompt engineering

2023年5月22日

Let's get building!

2023年5月8日

Stay calm & keep learning

2023年4月11日

Learn about AI developments with me ...hello from my hyperniche!

2023年3月27日

Microsoft: "you're using Bing wrong! You're not supposed to talk with it!"

2023年3月5日

Spotlight on...Dutch conversational AI and speech tech!

2023年2月18日

Why choose when you can have both OpenAI and Anthropic? - An evening with Poe - part 1

2023年2月12日

Stay calm and keep thinking for yourself

2023年1月22日

社区洞察

其他会员也浏览了

This AI newsletter is all you need #10

Bypass ZeroGPT: 10 Tested Methods to Bypass ZeroGPT AI Detection

Gen-AI may be massively hyped, but the potential is huge: Here are ten big technological shifts creating the disruptive opportunity of GPT-4

Everything AI?

Insider's Edit: OpenAI's Tips for Writing Better Prompts

Ai vs Hard Copies

Enhancing Response Synthesis in Retrieval-Augmented Generation (RAG) Systems

Can AI Really Write My Book? The Pros + Cons of Using AI For Book Writing

GPT-3 writes like a writer, programs like a programmer, and can be ... dangerous

LLM Paper Reading Notes - April 2024