2025 Funday Friday #10: The Future of LLMs is Song
Jon Batiste.

2025 Funday Friday #10: The Future of LLMs is Song

tl;dr-


How do humans understand information?

Text - written language - has become ubiquitous because it is the most efficient method. It doesn't require major specialist knowledge that other forms of communication may require. It also doesn't demand synchronicity that methods such as live communication need to function. It's easily stored relatively speaking. The more I think about it, the more it makes sense that text has taken the world by storm in a sense as the primary way we understand each other and store those insights for the future.

But conversion from a concept to text is not a lossless compression algorithm. To boil a song, for instance, with all of its history, meaning, and even the contours of the sound itself across an infinite amount of frequencies, into a set of words is to remove a great deal of information that impacts how someone experiences and reacts to music. Often, this reaction to a musical piece is highly contextual on the individual. Accordingly, current approaches to generative music suffer a 'double cut' from data loss - one from the unique way in which someone listening to a piece may describe it in text, and two from the innate loss of information converting anything abstract object into text.

Solving for the second cut above - this idea of text not being a lossless compression of an underlying idea - is philosophically what I believe will enable AI to create more 'human-like' outputs. Conceptually, it's quite simple; by designing how a synthetic model 'understands' information - intake, contextualization, storage - to be as human-like as possible, we are likely to create outputs that consequently, are more human-like than if the model is processing information in a more 'artifical' way. The fact that current LLMs have achieved their level of performance feeding off of language- a very 'human-like' way of communicating - lends credence to this fact.

It also means that at some point, AI will have to feed off of information that isn't in textual form - and I can't think of a better place to start exploring that dynamic than generative music. How music serves as a communication channel from people to people is completely independent of text; it is a process in which auditory soundwaves across different frequencies and patterns elicit a chemical reaction that invokes some kind of biological response, all without 'the spoken word' ever being used. In fact, introducing the dynamic of 'paired text' likely overcomplicated this process because people, frankly, aren't experience the music 'in textual form.' Yet, LLMs currently can't function without paired-text because the way an LLM 'understands' success - the reward function or RLHF, etc. - are all based on textual inputs.

What other measure could you use? Well, I like the one you see in this video here with award-winning genius musician Jon Batiste; you measure what someone does with the music. Do they listen to it again? Do they riff off of it? Do they incorporate it? Do they take a meaningful action - because the action 'fuzzily' implies the music has changed something in them.

So now, instead of having an LLM generate music and be rewarded because it associates the generated piece with the word 'cool' - and that matches a human also associating that piece with the word 'cool,' you change it up. The LLM generates music and becomes rewarded because it's associated that piece with the action of 'listen to it again' - and that matches the human actually listening to it again. Seems quite a bit like designing a feedback loop for a video game, doesn't it?

Take it a step further - we know that music can result in physiological changes in the human brain. Imagine the reward system for an AI you can build when, instead of being limited to text and language, you are training it to understand feedback in a whole other dimension.

Maybe then, you train it to understand information across multiple whole other dimensions without parameterizing it all through text! Because, after all, we humans don't just take in feedback from our eyes or ears, right?

I wonder if that future will be more a Sabrina Carpenter song, or a 'dun dun dun'? Perhaps I may even live to find out.

Talk soon.

-WY

要查看或添加评论,请登录

Wah Yan的更多文章