How to make machines actually talk like humans
Image source: Aideal Hwa, Upsplash

How to make machines actually talk like humans

One of the promises of AI is that, while it's able to answer any question we can think of, or do any task on our behalf, it does so by being instructed through the most intuitive interface yet; natural language.?

Conversational interfaces are, and will be, one of the most common ways to access AI systems. The front end of AI is conversational.?

And while there's been some great progress throughout the conversational AI stack, from speech recognition to natural language understanding and speech synthesis , it's the speech synthesis part that users of voice-based AI systems experience.

The speech synthesis is the interface for most, and here is where there's work to be done, but where the opportunities for creating truly human-like AI experiences exist.

Human expectations of conversations

Humans are hard wired for conversations. It's how we've exchanged information since as far back as human history goes.?

It's only natural, then, for us to have a high level of expectation when it comes to how we conduct ourselves during a conversation, whether those conversations are with other humans, or with computers. Teaching kids to say please and thank you to Alexa is a prime example of this.?

And those expectations are mapped onto AI interfaces, as long as there's a conversational component involved.

Human-like AI experiences

In order for AI to be as easy to use as it ought to be, or could be, then, it's often said that AI and conversational interfaces should be human-like. As human-like as possible, stopping shy of being indistinguishable from human-to-human interactions.?

However, for conversational interfaces to be truly human-like, we still have a long way to go.

The journey to human-like conversations with AI

Right now, the best examples of artificial intelligence systems that are able to hold human-like conversations are, arguably, voice assistants, such as Google Assistant and Amazon Alexa or some of the robots that we see such as Hanson's Sophia , Moxie or those coming out of Furhat or Intuition .

The companies creating these advancements in artificial intelligence are certainly pushing the boundaries of natural language processing, but we are still miles away from artificial intelligence being able to communicate fully 'like a human'.

One of the biggest limitations is in the spoken voices produced by AI systems.

Understanding the human voice

Within the human voice, there is so much data. Data that tells you a whole bunch of information about the speaker. Data that, according to the authors of Wired for Speech , can be used in order to make all kinds of assumptions about the speaker. Things such as gender, social status, age, origin of birth, and a whole bunch of other data points are automatically processed by the human brain, in order to make decisions and form opinions about the person who is speaking.

A human can tell when another human is sad, happy, anxious, scared, angry, all by the sound of their voice.?

Not only can we receive that signal to understand the emotional state of our conversational partner, but we can also mirror that same tone in our responses to build rapport.?

Machines have a hard time doing the same

That’s not to say that there isn’t great advancements in human computer interaction and natural language understanding. On the contrary, some of the advancements are mind blowing.

Canary Speech can analyse the 1500 data points that exist within somebody’s voice and diagnose early onset dementia.?

Voca AI (before it was acquired by Snap Inc.) piloted a solution with the ability to recognise the sound of a cough and use it to profile whether somebody has Covid.?

Pindrop create voiceprints of speakers and use that unique voice print, unique to each individual, to authenticate people in call centres and to prevent fraud.?

There has been great advancements in speech recognition technology, which increases the likelihood of a machine understanding what somebody has said.?

There has also been great advancements in natural language understanding systems, which increases the chance of a machine understanding the meaning behind what someone has said.?

There has been great advancements in synthesised voice technology, which enables machines to produce realistic sounding voices in response to user queries.

All of the component parts of AI systems are advancing and improving, but all still have gaps to plug and ways to improve.?

The gaps in synthesised voices

The synthetic voice element is arguably one of the most important and often overlooked part of the interface. It's the only thing users actually interact with. It is their experience.

And while there's been advancements in the voices today being more human-like, there's a long way to go in how these voices are applied.

How humans speak in context

You see, people speak differently depending on who they speak to, what they are speaking about, where they are when they are speaking and a whole bunch of other contextual variables.?

If you think about how a customer service operative speaks to you on the phone, chances are they are happy. They are being helpful, chirpy, and engaging. They will use cues to signal to you that they are listening. Confirmation words like "yes", "mm hmm", and "I see".?

The way that they will speak those words, tonally, is unique to that situation as well.?

Compare that to a news reader that is reading new news headlines on the 6 o’clock news. Two completely different ways of speaking.

Compare that to small talk that you might have with a friend or family member over lunch.?

Compare that to an emotionally fuelled conversation that you might have with your manager at work.?

And compare that to a conversation you might have with a friend in a library.

They all require completely different tonal variations, intonation, delivery patterns, cadences, levels of animation, volume and velocity. And humans don't even think about it. We just do it.

Applications of synthesised speech styles

Most machines that use synthesised speech to speak to customers, such as Amazon Alexa and Sophia, use text-to-speech systems and text-to-speech voices that are created for specific use cases.

Alexa's standard voice is good for quick, short, sharp bursts of interactivity; telling you the weather, announcing a timer or reading steps in a recipe. But there is a reason why Amazon released the news reader voice for Polly, and that is because the style of a news reader is completely different to that required for short sharp interactions.

Creating a synthesised voice for a specific use case

In order to create the news reader voice, or any synthetic voice for that matter, you need to absorb training data from a voice actor speaking in a particular style.

Hours and hours of training data from a speaker speaking in that style is used to create a synthetic voice that can be used to read content in the same style.

Therefore, in order for a voice assistant, a robot or any computerised artificial intelligence system with a conversational interface to be able to effectively respond and communicate truly like a human, they need to be able to generate voices that are fit for every type of conversation they're likely to have.?

They need to be able to generate voices that are just as good at speaking like a customer service operative as they are at speaking like a news reader or speaking with the level of empathy that you would expect from a confiding conversation with a friend.

The complexities of programming human-like dialogue?

Deciding when a computer should respond with empathy versus when it should respond with chirpiness, versus when it should respond with a long-form style dialogue, in order to determine that, you need to understand a lot of variables.?

You need to understand the intent of the user, and what intent matches which type of speaking style. You might also need to match the sentiment of the user or the conversational style of the user.?

That means that, as well as having a whole bunch of different types of voices that are specific for different types of use cases available (which is a limitation in its own right), you also need to feed off the relevant contextual information you can gather within a conversation (such as intent, sentiment, speaker profiling, authentication status etc) and use that data to formulate the right kind of speaking style in your response.?

This is where the limitations exist today.

That's not because it's impossible to do, but because it is very complex.

It's easy to pick off the shelf

It's hard enough to build a simple conversational interface that works well in a narrowly defined use case in the first place, without the complications of multimodal, and without all of these highly complex variables and nuances of human conversations.

This is why most organisations investing in conversational AI solutions pick an off the shelf TTS voice. They've put so much effort into the NLU and dialogue management that they either don't have the time or budget to figure out how to put the icing on the cake.

Hardly any brands have gotten to the point where they've invested in their own brand voice, let alone variations of that voice that can respond to users with varying speaking styles.

Today, it's going to be cost prohibitive. However, if we want to create real human-like conversations, then this problem (or a version of it) will need to be tacked at some point.

The future

The future of human computer interaction is only going to become more humanised and more natural, but the fusion of different styled speaking voices, matching that with the intent, sentiment and speaking style of the user, all combined, could lead to some really natural and engaging interactions and bring about the true promise of human-like conversations with AI.

But we’re not quite there yet.

---

Presented by Symbl.ai and Deepgram

Learn more about Symbl.ai here .

Learn more about Deepgram here .

---

VUX World is the front door to the world of AI-powered customer experience, helping business leaders and teams understand why voice, conversational AI and NLP technologies are revolutionising how we live, work and get things done.

We help executives formulate the future of customer experience strategies and guide teams in designing, building and implementing revolutionary products and services built on emerging AI and NLP technologies.

Find out more about how VUX World can help your brand succeed with AI.

---

Image source: Unsplash

Jan G?rgen

Speech Service Program Manager & Conversational AI Enthusiast @ Microsoft

2 年

Interesting read. Regarding the different TTS styles there is some progress, when you look at the upcoming more and more voice styles, even in standard TTS voices. The context and complexity issue on the other hand is indeed still unsolved and requires strong investment in the self-learning/testing&training systems. But I agree there is still a lot to learn and some way to go to create a general human-like conversational system. But let’s not forget that regardless of where we are on the road to a broader vision, there are already tons of use cases which can be tackled with conversational AI systems, which create a lot of value to users and businesses.

Tony Ramos

Conversation Design Leader | Full-Stack Expert in UX & NLU for Voice AI & Chatbots | Radical Empath

2 年

Jason F. Gilbert's Moxie robot video kind of blew my mind. As voice synthesis & conversational AI continue to improve, I find myself slightly creeped out by the potential for bad actors. HUMAN is doing some really interesting work in human verification.

Jason Pautard

Rendez vos contenus plus inclusifs et accessibles grace à la voix

2 年

Totally agree, this is definitely the next step in conversational AI, to be able to adapt the voice to the context and with the right emotion!

Michael Novak

Responsible AI Business Exec | Leveraging ChatGPT AI, Digital Identity & Web3 to drive value.

2 年

Kane You’ve so right (sneer). Always (guffaw). Humans still communicate using archaic 1870’s QWERTY machine code that efficiently transmits data values (“100”, “I’m fine”) - but lacks ability to capture speaker’s context values (mood, age, gender, race, in a hurry) without the use of artifacts (emoji’s or clever keyboard hieroglyphics). It takes human children a decade or more to interpret context values correctly. And even then, not always correct all the time. (insert sitcom plot here). #Conversational #AI technology still improving as you noted. But as business demands improvements from technology, it’s just as important they allocate budget, and incorporate it into their #digitaltransformation and #metaverse #strategies as well as daily workflows. cc: Open Voice Network American Council for Technology - Industry Advisory Council (ACT-IAC) Paul CutsingerMary Bennion Ronald van Loon Nicholas Evans Ahmed El Adl, Ph.D. Comp. Sci. (Artificial Intelligence) Muhammad Jawwad Paracha Samarjit Das Sanjay Chauhan

Colleen Fahey

Author & US Managing Director, Sixième Son, Audio Branding & Sound Design, CHIEF member

2 年

We humans often could use a tune-up on this abillty, too. Who hasn't had a co-worker whose tone always sounds demeaning? Or one who seems stuck in an overly enthusiastic high-pitched setting? Maybe when we learn to train AI voices, we can turn the training back on ourselves.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了