To ask or not to ask; that is the question.
Photo by StorresJayrMx on Unsplash

To ask or not to ask; that is the question.

Considerations for choosing a command or conversational interaction model when building for voice technology

If you’ve ever wanted to try using a voice technology feature, you may have wondered how it would respond to you. For those who build them, the consideration can also be?if?it responds — at all. If it doesn’t, it’s generally known as a command interface. If it does, it’s seen as a conversational interface.

Spoiler alert:?if you talk to it and it responds in any way, it’s a conversational interface.

Let’s take some examples. Apple’s virtual assistant Siri exhibits a command model when you give a specific enough command like, “open Wordle”:

Siri:?(says nothing; just executes the action — opens the Wordle app)

Obvious or not, no verbal response is needed. However, a command that involves a possibly irreversible action such as, “Send a message to Simone with WhatsApp saying ‘I love you’” kicks off a short and efficient but very much necessary, conversation:

Siri: Send it?
(with the visual preview of the message compiled and the buttons to send it manually available if you decide to do so)
Me: Yes.
Siri: Done.

The reason? I might have more than one Simone in my contacts and it could be more than awkward if I sent the message to that colleague named Simone that I coordinated an Uber with 2 years ago.

Iphone with Siri activated

In another example from Siri, a straightforward command like “set a 9 minute timer and call it pasta” or “wake me up at 8” also garners direct but natural responses. For the timer, it may or may not need any back and forth. Yet, there’s still a verbal response:

Siri: 9 minutes. Counting down.

Now, if Siri needs more input from you to continue, it behaves similarly to the WhatsApp message scenario:

Siri:?There’s already a 3 minute and 30 second timer. Replace it?
Me:?No, keep it.
Siri:?Okay, I won’t.

Short and sweet. Effective albeit a bit terse if compared to humans. The alarm scenario works the same way, with context changing how many turns the conversation takes:

Siri: I set your alarm for 8 p.m.

It could end there. But I sometimes feel the need to engage more:

Me:?Thank you.
Siri:?You’re welcome.

End of conversation. Going back to the message example, the context may not even be that there are two Simones which requires some confirmation, but it may be that I simply didn’t give enough information for a command to execute:

Me:?Send a message.
Siri: Who do you want to message?
(a variation; in other instances, it asks, ‘To who?’)
Me:?To Simone on WhatsApp
(note that if I hadn’t specified the app it would have assumed Messages)
Siri:?What do you want to say?
Me:?I love you.
Siri:?Send it?
(same as earlier with the visual preview of the message and the buttons)
Me: Yes.
Siri: Done.

That’s a decently long conversation for an everyday activity assisted by a robot. Each time there is ambiguity that prevents it from continuing, the conversation requires additional turns beyond a single command.

Think about alternatives to the above-mentioned “open Wordle”. If you say something like “show me photos of me in India last Spring,” it simply opens Photos and leaves you there to flounder for yourself. A bit passive aggressive perhaps.

Neon sign showing empty speech bubble
Neon sign with speech bubble saying 'hello'?

Siri isn’t the only one that speaks back. If you use the microphone in the Google Search box to put in your query, it will speak the answer or the beginning of the top result back to you:

Me: What does equity mean?
Google: Here’s the definition of equity — the quality of being fair and impartial.

And the answer can sometimes be much longer. Bear with this next very nuanced pattern of content design used according to how the user has phrased their natural language question. Let’s start with a typed search string.

Me (typed): where do lemurs come from

Here is the answer to that text search.

No alt text provided for this image
Me (typed): where do lemurs come from?
(note that I only added a question mark)
Google: Where do ring tailed lemurs live??Ring tailed lemurs come from Madagascar, the island off the East Coast of Africa. They live in the rainforest and spend most of their time on the forest floor, although they do climb but tend to spend more time in the mid canopy than the very top of the trees like ruffed lemurs.
No alt text provided for this image

It’s probably a red herring and I’m not sure what rules apply here, but somehow the form of a question makes the answer include the question from the source — but it’s not always the case. I couldn’t replicate it even after a dozen tries. Hmm. Interesting though.

The power of a single word

Now let’s switch to voice.

Me (voice):?Where do lemurs come from?
Google: According to folly-farm*, ring tailed lemurs come from Madagascar, the island off the East Coast of Africa. They live in the rainforest and spend most of their time on the forest floor, although they do climb but tend to spend more time in the mid canopy than the very top of the trees like ruffed lemurs.
(*source was?www.folly-farm.co.uk)

Now, take a very simple difference of?just one word?and how the result changes.

Me:?Where?did?lemurs come from?
Google: According to BBC, the conventional view is that lemurs arrived in Madagascar 40–50 million years ago, long after it became an island. It’s thought they floated over from the African continent on rafts of vegetation.

Okay, another lemur search result for fun and cuteness.

No alt text provided for this image

Clearly, the relevant part of the answer is gleaned to construct a sensible spoken answer, complete with an introduction. As to the answer itself, “did” refers to the past, so this answer arguably has more appropriate and logical references to just?how far in the past?to go, as opposed to the answer to “do”.

Let’s look at one more example of how answering back is different.

Me?: What’s the difference between affect and effect?
Google:?According to touro dot e d u,?generally, we use affect as a verb and effect as a noun.

Consider, again, the text comparison.

No alt text provided for this image

Clearly, the relevant part of the answer is gleaned to construct a sensible spoken answer, complete with an introduction.

Interestingly, Google doesn’t always give a verbal introduction to results and my guess at the reason is that it depends on the quality of the answer source — whether it can be formed into a complete statement and so on. As we saw from the above examples, there is a lot to consider.

Where to focus

Between the two (Siri and Google), verbal answers can also differ in sentence structure completely — lending more or less to spoken English, depending on which one you use.

Me: What is the capital of Iceland?
Google: Its capital is Reykjavik.

Ideal spoken English that adheres to the?End Focus Principle. Siri has a couple ways that it responds:

Me: What is the capital of Iceland?
Siri: Reykjavik is the capital of Iceland.
(used most often but there are longer variations similar to Google’s, such as “Here is what I found from Wikipedia. Reykjavik is the capital city of the island country of Iceland.”)

The latter pattern is used only sometimes and I couldn’t figure out what conditions apply.

Voice tech at a crossroads

These few examples illustrate the challenge for responding to voice commands or queries in an intelligent, natural way.

Every 5 years or so the speech and voice technology industry goes through a kind of existential examination of what direction it’s going and where to invest the time and and resources needed to reach its potential. Or at least the potential we sci-fi dreamers have always envisioned for it.

So, as the adoption of conversational AI continues to grow, it’s not always clear what type of interactions work best for a specific domain or set of use cases. Does everything need to be a back and forth conversation or can simple commands from the user work better? When users interact with their voice, their mental model of conversation would not necessarily expect a voiced response in return, but they won’t necessarily balk at it either.

Let’s explore the considerations for choosing one of the two?interaction models?over the other, or even when a hybrid approach might be the best option.

Insights from the land of language

I’m currently looking after the content design and conversation design disciplines at Flipkart, India’s leading e-commerce marketplace. With over 80 million products, Flipkart is poised to be the marker for the future design patterns of this type of technology — that’s my audacious point of view at least.

As we have grown, our technology has also expanded to serve a diverse set of shoppers, sellers, and even the reseller communities that are redefining the landscape of e-commerce. Now, as we embrace the challenge of enabling the next wave of e-commerce users who are new to the digital marketplace, we are expanding our approach by taking cues from how they interact in the physical realm.

And that’s where conversational commerce comes in.

Female Indian outdoor vegetable vendor

If there ever was an interaction model that is timeless and has a predictable structure, it would be commerce. For millennia we humans have developed patterns for how consumers interact with merchants for mutual benefit. At Flipkart, we want to bring that familiarity to how Indians shop online.

Yet the definition of what connotes a conversation with a machine doesn’t always look like a conversation that we would all recognize at face value. For example, what feels more like giving orders (a command-driven interface) is sometimes all that’s needed for the user to get their task done (“show me photos from Bangalore in 2019” or “set an alarm for 6 hours from now” or “earthquakes near me” — I’m in California…).

In other cases, some back and forth (sometimes called “multi-turn”) would be better, based on some considerations (like what happens when you are not specific enough with Siri — “set an alarm” or “send a text” as illustrated earlier).

Both are conversations,?but the former is what some would call a single turn due to temporal and social-psychological context. Forgive the jargon. I’ll get to that shortly.

Conversation is the organizing principle for the interaction model, regardless.

Any method of interaction to exchange or retrieve information has merit. Depending on both the use case and the user type you are serving, you might find you simply need voice as input for power users or to make input faster for an otherwise straightforward task. Or, you could have situations where there is ambiguity and a quick clarifying follow-up question could take the user further or save them steps. Then there’s the need for something much more interactive (I prefer to call it?guided?or?immersive?or?engaging?conversation instead of “hand holding”).

Hands sketching a mobile app design

This back-and-forth kind of interaction model works best for service use cases, where a lot of clarification is required, or for when users are not familiar with the type of interface or platform itself. For example, shopping online for newer users that just don’t have the frame of reference for how e-commerce works can present its own set of challenges. Not only do they not know?how big the store is, but even the iconography can be unfamiliar. Most importantly, they are just starting to build their trust with such a system.

That’s why the interaction model matters so much.

We have to honor innate expectations about how conversations work while simultaneously protecting against the limitations of attempts to automate the same.

No alt text provided for this image

Between humans the possibilities for constructing conversations are endless. So it’s important to respect the problem space and arrive at an interaction model based on what you truly need to accomplish to serve users and not on what others have done or examples from other industries that may not apply.

Also, the?medium matters?more than ever. For better or worse, natural language interfaces of today have strong roots in telephony systems (“IVR” or Interactive Voice Response), where voice-enabling interactions became essential after touch-tone ones had outgrown their utility — a transition that only happened a mere 25 years or so ago. Not taking away from those use cases, some of which worked well, but unfortunately some of the interaction patterns from those early applications have been perpetuated since. But such patterns just don’t work in other interfaces.

In a phone call for instance, the user becomes a captive audience where the interface is not just a conversation between caller and machine. There’s a third conversation partner — time — unlike most other mediums which have some element of asynchrony. It is ever present as a call progresses toward an eventual unavoidable end when the caller will either have to hang up or be transferred. This lends more urgency to the interaction, particularly when something goes wrong. Your choices for how to?repair?any misalignment during the conversation (such as background noise or when the caller says something that’s not expected by the recognition engine) is more limited.

Then there’s the social-psychological aspect in which people use IVRs out of necessity — usually to solve a problem with some service. Now we have devices that people pay for just to interact in this new medium. And when visual elements are introduced, it brings further cognitive load challenges. Before, designers had to manage a user’s memory and auditory cognition.

Now we have to consider all of that along with how human brains factor in visual cues — without undermining either process.

The user and their task at center

Given all of these considerations for temporal context, social-psychological factors, and cognitive challenges, the final decision about an interaction model should be based on two things: use case (the scope of which being the task at hand, not an entire domain) and user cohort.

No alt text provided for this image

In terms of use cases, you’d need to first look at those primarily suited for single turns?initiated by the user —?things like questions and answers, calculations, setting reminders, timers, alarms, navigation, making phone calls or texting, transferring funds, and yes, searching for products on an e-commerce platform.

The system may or may not need to respond verbally when serving up the result. In such a model, the conversation design is limited to the technical underpinnings and whatever structure or form the response comes in. However, the minute you have cases where you have to mitigate some ambiguity, particularly for an irreversible action (like sending to the wrong person out of two similar names in your contacts — remember Simone?), you need to consider supporting more turns.

The more negotiation of meaning required, the more turns required. (must read)

Another factor is that people speak in so many ways and sometimes you might recognize?what they want to do?(their intent) but just don’t have the information needed to?complete?the so-called command. For example, if I don’t say who to send a text to, the command itself can’t be satisfied so a quick follow up question will suffice to carry it forward — again, my conversation with Siri about messaging Simone.

Other use cases that rely entirely upon conversational exchanges that go without saying are things like games, form filling, and most support and service interactions. These simply can’t progress without the receiver of each turn being able to ask clarifying questions (in the biz we like to call it “disambiguation”).

So when evaluating what will be needed for your use cases, consider that it’s not just what domain you’re operating in, but what issues you’ll have to contend with during the eventual interactions to satisfy those use cases.

Open sign through a glass door to a shop

As for making a decision about an interaction model based on a particular user type,?this is naturally the most important one of all. No decision about a use case can be made without considering?who?your system’s conversation partners are.

When we talk about user type here, it’s not really about demographic, though the concept of defining a user persona or making some decisions based on users’ daily living situations and environment are important. Here it’s rather?more about?user familiarity with your interaction model. Do they think and have expectations about what the conversation for their particular task might be like? Even in real life, conversations are framed by what each participant knows about the other’s world knowledge.

That’s how meaning is negotiated and context is established.

So, while one user cohort can give a long direct command to make a flight reservation or search and filter for products, another might not understand what they can say to do the same thing, and still another may have never flown before or have any frame of reference for the constructs that make up most apps or websites. Any user’s familiarity with technology will vary based on access and length of usage, something compounded by geographic location and culture. We ought to remember this in order to prevent technology from leaving anyone behind. Call it?building digital confidence.

In fact, a persistent problem that plagues voice interfaces (and even chatbots to a degree) is discoverability. Even while some users learn and prefer a command-driven interaction model, a few conversation turns might help them discover even more that they can accomplish. It can provide guardrails to help them feel oriented and in control of the interaction. A voice interface that doesn’t consider this factor can feel like a big room with no sense of where the walls are in order to walk in with confidence.

The conversation itself is not the goal

Serving user needs is the goal. Sometimes simply offering voice as an alternative input method is enough to do that (most mobile keyboards and search bars now have this capability — Flipkart included), but stopping there could limit users from reaching all you have to offer them. Voice can offer an incomparable shortcut to otherwise hard to find parts of the experience.

For example, in the Amazon app, I have no clue where I can set my language preference and I don’t feel like spelunking through the profile settings to figure it out. Cue conversation with Alexa, like so:

Me: Change my language.
Alexa: Here you go.

And voila! Shortcut. Digital life hack.

Amazon language changing screen set on English

Because technology is required for machines to interact with humans, there’s the potential to get stuck in trying to bring users on board with that technology and get them familiar with it. But for the most part that will only ever work organically. We don’t need to (or shouldn’t be)?training users. Those who are new to these platforms know how to talk and in our case, know how to shop.?They just need help doing both online.

That’s why the conversation paradigm is so powerful. Negotiating meaning and exchanging information in the physical realm is done through conversation. It’s familiar. So tapping into that innate understanding is what will make all the interfaces we build more useful and meaningful.

Tapping into the familiar while trying to advance strange new technology is an essential part of designing for humans. One might even say that?anthropology demands it.

Slight tangent. In effect, calling something a?multi-turn conversation?is redundant. In real life, one-way commands are only parts of an evolved macro conversation that is still multi-turn in nature. Consider any cooperative environment like an?expeditor?shouting orders at a line cook (“alfredo all day!”) or medical professionals doing the same as part of protocol arrived at through rigorous training and chain of command. These are magnified parts of a broader ongoing conversation the interlocutors (fancy word for people talking to each other) would have had earlier with far more turn taking to establish the trust that led to that interaction model.

A snapshot in time

At the end of the day though, for our context of those who are just building their digital confidence, when we strive to emulate real life conversations from the physical marketplace, I urge everyone to concede one truth:?there is no replacement for the unique and beautiful experience that comes from shopping in India.

No alt text provided for this image

As long as we remember that this is not a race to be first, but a race to do it right, we can keep perspective on how we approach trying to bring conversations closer to what we have been accustomed to doing for thousands of years. For conversational commerce, that means things like taking into account anthropology and a healthy respect for language as essential parts of our toolkit.

In the meantime, try to design for the interaction model that fits the context most. And even if that means choosing a one-way command to start, we still have to answer the question unasked by delivering the result silently. It’s still a two-way exchange. It’s comparable to real life one-way conversations where you might merely nod in response to a query.

When you’re communicating, you can’t NOT have a conversation

Whether it’s across a counter, at a table, or any other place where two minds can meet — even over a digital platform, our expression of language remains one of the most significant triumphs of humankind.

As for the machine? We’ll work on helping it distinguish between “do” and “did” while you keep diligently asking questions. The answers will come — sometimes accompanied by silence, and sometimes out loud.

P.S.?Simone, I love you.

~

Special thanks to?Vittorio Banfi?for the lively conversation that was the inspiration for this article.

~

Nandini Stocker?is currently head of Content & Conversational Commerce Design at Flipkart, India’s leading e-commerce company. Her charter is working to enable hundreds of millions of Indians known as the “digitally young” be empowered in their online shopping experiences. She believes conversation is the interface of future computing and has focused her?25-year career?on designing voice technology. Making it?accessible?is her life’s work and being able to realize her dream for this in India, her birthplace, has unlocked her purpose.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了