Will GPT 4o kick of the voice interface revolution? I remain skeptical.
Prof. Peter Kabel
Digital from the outset. Futurist, Consultant, thought leader, speaker, investor and teaching professor. Interested in everything creativeAI. Co-founder of cogniWerk.ai
I have literally been a member of the human-computer interface (HCI) community for decades. As a professor of design, I have researched this topic intensively and repeatedly. I even published a book on Conversational User Interfaces in 2019
In the distant past, machines were fed with punch cards, later with complicated and incomprehensible command lines, everything changed with graphical user interfaces and the triumphant advance of PCs began, which led to the touch interfaces we take for granted today on smart phones. Human input continued to evolve, while the response of machines remained essentially visual. Computer visuals became more and more diverse, the resolution of the screens made true wonders possible - we know that.
In the years 2013-14, natural language processing developed rapidly. Machines were able to reliably convert spoken language from audio to text and, after the text processing, from text to a realistic-sounding output with a synthetic voice.
The presentation of GPT 4o last Monday marked a preliminary high point: Wow - you can talk to the AI model just like that, the model processes the voice input in the demo without any noticeable delay and responds in any desired pitch. The model can even sing the words.
If the reality is actually as good as the demo, it will be sensational!
The Alexa smart speakers - just 6 years old - were stone age by comparison. The low latency is a key factor, because this real-time response time also enables what is part of every human conversation: interrupting each other and thus steering the conversation in a different direction. In combination with the impressive speech melody and prosody of the demo and the semantic capabilities of the LLMs, this comes very close to the science fiction model of voice interaction à la "Her".
Most structural problems of speech interfaces are still unsolved, though. I will mention just a few:
1. people of different ages and at different stages of life feel differently comfortable talking to machines. Young people may be more at ease (and people in Asia too, because their non-Latin language notations are often very difficult to enter in the mobile use case of small screen keyboards). But talking to a smart assistant in a crowded subway about potentially confidential topic - that's something else even for the younger generations. Research shows that reservations exist even when sitting in a room with only one or a few other people present.
领英推荐
2. as soon as user input and computer output are provided via voice and without visuals, many other problems come into play: the answers can only ever be very short and offer a maximum of 2-3 (!) choices. If only because perceptual psychology has shown that people cannot remember more choices without a visual representation. "Which ice cream you want: Strawberry, Lemon, Chocolate?" "Wait- can you repeat the second one...." This already severely limits the number of Voice2Voice use cases. A list or a complex answer are not possible.
The machine's answer can only be a so-called "oracle answer": The human asks the oracle of Delphi and the answer is simple and without alternative. Take it - or leave it.
This in turn requires a considerable degree of trust that the user must place in the provider of the Voice2Voice interfaces. I have to trust that the oracle has created its answer in the best interests of me - the user.
The question that arises in my mind at this point is whether people really have such trust in a single company - Open.ai - of all companies - and whether, if they do have such blind trust (which, based on experience in comparable cases, is overwhelmingly likely), such trust would not be illegitimate and extremely problematic and dangerous for the individual and also for society as a whole?
This is a discussion that we need to have seriously now at the latest, because as we all know, nothing is more difficult to correct than dependence on something that is harmful but convenient.
Finally, the sensational performance of GPT 4o has not changed one of the most fundamental problems of voice interfaces: As long as it is not possible to trigger real transactions via voice interfaces and not only ask for the best train connection, but also buy the right train ticket right away it will not lift off.
This in turn requires the integration of booking and payment systems, which can be very cumbersome and time-consuming. Amazon Alexa and its competitors ultimately failed not because their voice skills weren't good enough, but because in the end you couldn't do more with it than ask a question or set a timer. I don't currently see how this dilemma could be solved by GPT 4o alone.
The operators of the Asian super apps are clearly one step ahead here: they have already integrated payment and have already implemented many of the dialogic business logics with their suppliers and customers via type chat interfaces.
Even the most intelligent AI will find it difficult to implement this type of meaningful transaction integration in a timely manner. That's why I think the GPT demo is great, of course, but I'm skeptical that we will experience a landslide towards voice interfaces as a result of this alone and very skeptical about the impending dependence on a very one-dimensional access to content and knowledge that is intransparently controlled by one company and gatekeeper.
Product Strategy and Concept Development and Training
10 个月I would agree with you and expand upon that thought. Most of who have an Alexa device (or comparable competing product) would agree with you. If, however, audio was just one of the forms of input, then it becomes much more interesting. On my laptop, I have a keyboard, trackpad, camera, microphone and data as input methods. Knowing how to use audio as an augmentation of existing methods of interaction becomes much more interesting. Back in the '90s, IBM created a research effort called "Put that there", it was a gesture recognition system that saw where your hand was pointing in relationship to a projected image and then use the voice commands as a complementary tool. With it, it would understand what object you were referring to and what you wanted to have done with it. That being said, I have found myself using the voice interface of ChatGPT on longer car drives. It allows me to explore topics that I am trying to understand or formulate. The conversational interface is a wonderful hands-free way of doing web research and taking notes. I just wish that there were a better way of taking real notes. For example, I also listen to a lot of audio books in the car on long drives and I would like to capture key ideas as notes.