"Voice First Sucks!"
Ahmed Bouzid
Partnering with Activities Directors and Coordinators in Senior Living Communities to leverage cutting-edge communication technology and Generative AI and to deliver on their Digital Inclusion mission.
Yup, those were the exact words that I recently heard uttered in a mild miff of refreshing biting pique by someone who seemed to have had it with Voice First.
I know it was biting pique because the person who spoke those words was, is, has been, and I know will continue to be, someone who is heavily vested in Voice First, loves Voice First, has had, and will continue to have — in spite of the long, cold winter of discontent that Voice First is enduring — high hopes for Voice First. And I say that it was refreshing because hearing someone deviate from the oppressively relentless cheerleading and instead engage earnestly to get to the truth is a rare occasion worth noting and celebrating.
But reality is reality, and the reality that this Voice First believer was facing was a harsh one: Voice First was not delivering on the promise everyone thought it held when it declared itself a revolutoin a few years ago and they had had enough. It was time to roll up our sleeves, get real, sober up and find a path to the bottom of things.
‘Beyond the weather, time, and the occasional timer and alarm,’ they mused out loud, ‘am I myself really using my Amazon Alexa and Google Assistant that much in my life? I mean, really and honestly, am I? No, actually, not really…. So, if I am not using them that much and yet I am such a believer in Voice First, what hope is there for the rest of the world?’
I, a veteran of sorts in the space — we veterans call the space “Speech” — felt for this young person. Yes, whether you want to call it “Voice First” or “Speech” or something else, the space is and has always been a tricky bitch. Nothing comes easy with Voice First. Nothing will fall on your lap. No one will embrace anything you build just because. Unlike most other fields, you have to work really, really — really — hard to get to value, let alone monetizable value. Sure, you need to work really hard to deliver value in anything that is worthwhile — that’s a basic reality. But here is another basic reality: Voice First is a beast of an altogether different kind. You have to work really, really — really — hard to get to value
This reality about the space and its technologies was as much of a basic fact in 1991, 2001, 2011, as it is in 2021 — Apple Siri, Amazon Alexa, Google Assistant, and Samsung Bixby and their respective tantalizing human language technologies notwithstanding.
Fact: unlike other interactive modalities (visual, tactile, textual, olfactif, haptic, multimodal), Conversational Voice First is spartan and unforgiving. It’s sound made, sound received, sound consumed, and sound made again in a time laced, unrelenting back and forth.
Fact: people usually don’t use Conversational Voice First to enjoy the experience or to admire color schemes or to get a warm-and-fuzzy. There is nothing to look at, nothing to behold, nothing to gawk at, nothing persisting that you can engage and feel. Instead, Voice First is an anxiety inducing, running race — a race against time and patience. It’s ephemeral, serial; it needs you to be cognitively engaged, focused, listening; it is not interested in letting you multitask, pause, take a break, wander off in your thoughts; it messes around with your breathing, taxes your memory, forces you to use your vocal chords, maybe even shatter a beautiful silence. Voice First is greedy, possessive, narcissistic: it certainly won’t let you say something to someone else when you are engaged with it. Voice First wants to control your breath, it demands that you enunciate, articulate, that you come alive and speak loud enough; and it insists that you listen attentively, that you be patient, that you repeat yourself when you are asked to repeat yourself.
Voice First needs you fully vested. Otherwise: Voice First will gladly and pitilessly swallow your time, and it will happily do it over and over again.
And so, if Voice First must be used, the human user shoots back, Voice First had better deliver something concrete, tangible; something worth the tedious while. ‘You want me to talk to the thing? Ok. Sure. No problem. I will talk to it. But only if you give me something back in return that makes my life easier. Otherwise, thank you very much, because I do have a life to live.’
Unlike other modalities, Voice First has no room for getting you to buy because the thing looks cool, or smells sweet, or is heavenly to the touch, or makes you look High Church by association.
Unlike other modalities, Voice First has no room for bamboozlement. Voice First demands the simple truth, the bottom line; it needs you to cut to the chase and it gives you no room to hide and no chance to throw sand at people’s eyes. (Quite the interface for the general Zeigest of the times.)
So: does Voice First suck?
No. Voice First does not suck — and has never sucked.
Voice First is powerful because it is ephemeral, temporal, invisible; because it demands your attention, requires that you speak up, that you be constantly present and not wander off, that you engage in a focused way. You want an interface that lets you space out once in a while. Sure, no problem, and no one is judging. You have your pick of interfaces. Just stay away from Voice First. And if you do use Voice First and then you wander off during the interaction, don’t blame the interface — and don’t blame yourself either. Blame him or her who decided that it was OK to have you engage a Voice First interface in a use case where you were likely to wander off.
Here’s a concrete example of working with rather than against Voice First. Voice First is ephemeral, temporal, invisible, demands the user’s attention, requires them to speak up and to focus, insists that they be present and not wander off, that they remain engaged in a focused way? Ok. What use case demands all of that?
Let’s watch:
Can any interface other than Conversational Voice First deliver the above experience? The answer is no. No other interface is as temporal, invisible, demands your complete attention, requires you to speak up, to focus, and insists that you not wander off and that you remain engaged in a focused way the way that Conversational Voice First does. Sure, you can try to emulate the above via visual/textual/tactile, but you will end up with a poor man’s version of the elegant, back and forth, urgent, let’s keep-moving-and-learning, conversational interface above.
In the spirit of getting real, I call on the Voice First community to do two things:
First and foremost, I call on them to take the time to really understand the Voice First interface. I truly and sincerely fear that many people (though certainly not all) in the Voice First space still do not fully understand the interface. Oftentimes, in my conversations, it is clear to me that many really do believe that Voice First is a stripped down version of other, “richer” interfaces; that Voice First is lacking, poor, constrained and constraining, meager — a poor man’s version of something much more. I am referring for instance to those who seem to believe that multi-modality is “the next step” (someone actually used that exact phrase!) for Voice First, an upgrade, rather than a completely separate interface with its own pluses and minuses, and to which Voice First can be its superior, hands down, given the right use case. And who do I blame for this misguided notion? Well, to be frank, I will squarely lay the blame at the feet of the folks at Amazon and Google who clearly have come to look down on the voice-centric experience and to declare it as inherently lacking. As in: ‘Yeah, so, we now have visual devices, you see, so we will need you to like show something visual when the skill/action launches, and we need what you show to look good. This you will have to do or we can’t certify your skill/action.’ (All true stories here.)
And second, let’s begin compiling examples of actual use cases where the value of the Conversational Voice First interface is clear and compelling. Let’s do this for at least three reasons: (1) So that we get the point across that the power and the value of an interface are not inherent in the interface itself but are a function of the fit between the interface and the use case; (2) So that we can learn that Step One to success is pinpointing use cases where the “weaknesses” of Voice First are in fact the exact features that we need for the given use cases, and (3) So that we keep hope alive and lift the community’s morale by pointing to actual built experiences and not mere yearnings or seemingly unfulfilled and unfulfillable aspirations.
So here’s the document. Feel free to email me your use cases and I will add them.
Ahmed Bouzid, previously Head of Product at Amazon Alexa, is Founder and CEO of Witlingo, Inc., a McLean, VA-based B2B Saas company that helps brands launch voice first solutions and experiences on platforms such as Amazon Alexa, Google Assistant, Samsung Bixby, and beyond.
Software Development Engineer at Workday
3 年Awesome blog! Really honest look at how we are using voice as an interface! From the voice spark time and excited to see the blog up on our website!
Ahmed Bouzid I absolutely love your turns of phrase here. I am not surprised that you were the one to say these things. So true! "Conversational Voice First is spartan and unforgiving. It’s sound made, sound received, sound consumed, and sound made again in a time laced, unrelenting back and forth...Voice First is greedy, possessive, narcissistic: it certainly won’t let you say something to someone else when you are engaged with it. Voice First wants to control your breath, it demands that you enunciate, articulate, that you come alive and speak loud enough; and it insists that you listen attentively, that you be patient, that you repeat yourself when you are asked to repeat yourself."
Frontend Engineer
3 年Great article, I love the emphasis on "voice first is really hard". I agree it is. Main voice app platform providers make development of voice apps look easy - for obvious reasons - to engage masses in creative process - but following tutorials is only a tip of an iceberg when it comes to really good applications. In other areas of programming you often hear about a common methodology called "Progressive Enhancement". This is important to understand and adopt when dealing with voice, both in app voice-only improvements, as well as expanding to screen capable surfaces.
Voice Automation and Conversational AI veteran
3 年I hear you Ahmed! The way I've been thinking about the stalled(?) #voicefirst revolution is that there are a lot of '20% solutions' out there. (Gratuitous abuse of the Pareto principle follows...). They are 20% implemented in reality, achieving 80% of the benefit, and everyone is very happy, patting themselves on the back that that was pretty easy. But 80% performance is not conducive to re-use. That's 1 fail per 5 tries. If your car randomly failed to start one day a week, would you keep happily using that car? Humans typically won't foster a reuse habit until 90-95% performance in my estimation. I think this is where we are at with Voice-First - Pareto's 80% is easy, but it's not good enough, and no-one wants to hear that they need to invest 5x the effort (or a much better toolset) to get that last 15-20% performance, and then high re-use rates. So I align with your view that it is really hard to achieve value - about 5x harder than anyone wants to admit! Also concur with you on use cases - choose carefully!
Founder - Pinch of Growth ??
3 年I really wish I could have joined this #voicelunch and join the fun around #voicefirst. Disclaimer: I am a 100% advocate of Voice-First, but reading this article, I have the feeling that we have a different idea of what Voice-First is. For me, the article describes mainly Voice-Only. If that was what the discussion was about, I totally agree that there are many sucking and frustrating parts to it. And that it is very hard to do. But Voice-First... Man, that is powerful stuff! The fun thing about the Voice-First concept is that it creates so much better experiences for non-voice interactions as well. I could go on for hours. :) Looking forward to the follow-up about this with Allen. Karol, are you planning this session already?