Anthropomorphism vs. Robotomorphism: A Running Conversation...
Ahmed Bouzid
Partnering with Activities Directors and Coordinators in Senior Living Communities to leverage cutting-edge communication technology and Generative AI and to deliver on their Digital Inclusion mission.
The discussion below is based on a Linkedin Thread started on December 21, 2020 and provides a running dialog from the many contributors who have pitched in with their thoughts, suggestions, criticism, and observations. Please feel free to post your own comments. Please note if you do post a comment, we may add it to the running conversation below.
_________________________________________________________________________
Ahmed Bouzid: Having arrived at the near conclusion that, at least with present day technology and our current knowledge and understanding of the physics of conversation interactions, it is almost impossible to deliver on a voice assistant that can be as conversationally sophisticated as a human being, I am becoming more and more convinced that in order to deliver highly performing voice UX (voice UX that successfully completes tasks and does so without frustrating the user), new rules of engagement need to be established. We need something akin to the ATM for banks. ATMs don’t try to emulate human agents, but they are great at what they do. One ATM is more or less the same as the next ATM, so that one doesn’t need to learn much to be able to use any ATM that they come across…. I have more thoughts, but I will stop here for now.
Karen Kaushansky: Here’s an article I wrote a while back about being “human enough.” I am all for human-machine interactions.
Roger Moore: This is not only an important practical issue, but it’s also a fundamental question about how language works: “Is Spoken Language All-or-Nothing? Implications for Future Speech-Based Human-Machine Interaction.” Available here as well.
Ariane Nabeth-Halber: I can’t recommend enough classical Bruce Balentine’s “It’s Better to Be a Good Machine Than a Bad Person: Speech Recognition and Other Exotic User Interfaces at the Twilight of the Jetsonian Age” Paperback — March 26, 2007, by Bruce Balentine (Author), Leslie Degler (Illustrator). Most of it is still valid today I think. ??
It goes along the same lines as Karen Kaushansky’s article mentionned (Love it, Karen !) ?
Still, back in the 2000’s, I must admit I did my best to stick to the human-human dialogue metaphore, just wanted to be consistent and leverage spontaneous responses. I noticed that some minor prompts like “Got it!” (en Fran?ais “Je l’ai!”) when looking for a file or info, or very short and direct questions in some dialogue turns, had positive impacts. I think this is still true today.
Ahmed Bouzid: Yes, Bruce Balentine’s book is a wonderful, inspirational read. (And I love the format of the book: short think pieces, one after the other.) I think he is going to be proved right — once again — after all. The initial wave of exuberance when Alexa came out in 2014 is, I believe, turning out to have been on the irrational side (Alexa as a conversational partner has made no needle moving strides), so that Bruce’s grounding philosophy is, I think, right on the money. (He writes on page 4 of the book: “My goal is to take an industry that has been adrift for years — an industry with which I myself have been personally involved and I know intimately — and to set it back onto a course that will enable it to deliver tangible results.”) I re-read the book over the summer with the new lense of far field vs. telephony as the interface, and a lot of what he says is refreshingly actionable and insightful, and cries out for a new design strategy to be adopted, now that voice is going to be everywhere….
Giorgio Robino: I like the ATM example you mentioned. It’s real life design pattern that we all use every day. Consider a conversational interface just for a real ATM (I admit that’s extreme, for privacy blabla) but that’s also a good example of a possible task-oriented / realtime H2M dialog. In this kind of situations (interaction with a sw/hw robot, where immediacy and quickness are essential requirements, I think that the deprecable “command-oriented” is honestly not so bad.
BTW I’m currently experimenting a voice cobot assistant prototype for workers/operators in industry difficult operation tasks and I designed the cobot “conversation” with this mantra to “be essential” and thin. This could appears to be a conversational “anti-pattern”, right?
But my experiment comes after my experience with CPIAbot, a multimodal chatbot helping immigrants to study Italian language. Students and especially teachers (?!) have stated that they prefer simple COMMANDS instead of an approach (I call “language-first”) based upon a discursive/conversational ltalian language)… weird, isn’it?
Deborah Dahl: We’re oversimplifying things by putting computer behavior into just two possible buckets — human-like and not human-like. There are a lot of different purposes for interactions between two humans, and similarly there are a lot of different purposes for human-machine interaction, all of which have implications for how the computer should behave. With respect to quick service interactions, our expectations are actually pretty low for human-human interaction. Think about a task like a typical call center conversation with a human agent. It’s not personal at all. The agent just has to understand our requests, prompt us for any missing slots, and confirm the transaction. The only human-like capability we really need or want in this situation is competent speech recognition, whether we’re talking to a human or a machine. On the other hand, different behavior might be better for other purposes. If the point of the interaction is just to pass time, without trying to accomplish a particular task, users might enjoy talking to a machine that asks them proactive questions about their favorite movies and sports teams, and acts like it’s interested in what they’re saying.
Ahmed Bouzid: True, but one of the cardinal rules that we, human beings follow — at least those of us who have an emotional IQ — is to never treat a human being as a simple tool, but always recognize their humanity and act accordingly, even in exchanges with strangers that we are not likely to encounter again and even when the exchange is highly structured and goal oriented (e.g., speaking over the phone with an agent to take care of some banking issue). So, we greet, we apologize, we don’t interrupt, etc. A lot of this is being observed by the designers of voice interfaces as well as by the users when the interaction is with a robot. My question is: can we move away from this sort of human-mimicking dialog to something different? Or do those ‘soft’ rules and guidelines serve a functional purpose, even in the context of h2m? Either way, what would such an interface sound like and what would be the guidelines that the designer would follow, and crucially what would be the new rules that the user would observe (so that, for instance, they don’t take offense where offense was not mean — as in: they can interrupt whenever they want, they should NOT expect delicate phrasings and don’t need to engage in delicate phrasing, etc.)…
Deborah Dahl: I think this speaks to my point that there is a gradation of human-human interactions, ranging from very simple and transactional to rich and nuanced conversations between friends. Your examples of behaviors — greet, apologize, don’t interrupt — are very minimal on the scale of human-human interactions. Maybe another step up would be some kind of small talk, for example about the weather, on the part of the system in an attempt to build rapport, and on up to Alexa Prize types of capabilities. But I don’t think you’re focusing on anything as human-like as that, but rather on where the minimum is, or could be, as users are exposed to bare-bones interactions.
An orthogonal variable is that I wonder if slightly longer utterances have value in comprehension, especially if the output is TTS. For example, a user might miss “pin?”, but not miss “can you give me your pin?”.
Ahmed Bouzid: Yes on the last point, I agree — very short may not be advisable as it would be missed…. On the larger issue: my point really is that a lot of dissatisfaction happens because user’s expectations are not managed/or are very had to manage, when it comes to conversation, so that delivering gradations of human behavior may be a huge challenge. I’m proposing not a stripped down version of h2h conversations as such but a truly m2h one that focuses on delivering on the function with an accepted protocol that takes care of the expectation management work as well as enables new behavior (interruptions, raising one’s voice with irritation when one is far from the speaker). Examples of an m2h that is non-h2h would be opening with a chime, double beeping when the system didn’t hear you (vs saying, “Sorry, I didn’t hear you”), beeping to give the user a turn, beeping to signal that the listening had completed, percoation sounds when not talking but retaining turn, etc….
Jonathan Eisenzopf: Is this even the right question? This question has been addressed and we know the varying opinions on anthropomorphism. Here’s a different question: should conversational systems follow the social norms of conversation for human speakers and if so to what extent? If not, what are the social norms for a conversational system? I suggest that there are social norms for talking to bots and they are mostly negative. Perhaps as Roger Moore had suggested, we should start spending more time working on pragmatics. I also suggest that we also reconsider examining the social anthropology of conversation. Even though the latter topic can be challenging and not popular, how can we talk about conversational systems that use human language but don’t consider these topics?
I consider what you’re asking to be abstract because some designers consider social anthropology of conversation a dead end. They’re not wrong because the topic is not fully thought out and still more theory than practice. What you’re asking could be related but I am using a more precise term. In both cases there are no agreed upon written rules which is why I’m trying to ask a more particular question. When I say that there are social norms for talking to bots, what I mean is that the expectation of many people is that bots don’t understand them and are therefore bad. When I speak to non-practitioners, I often ask them what they think about IVRs and chat bots. I get more negative feedback than positive “Oh I hate those things”. When I ask why their view is negative, a common response is, “the system doesn’t understand me”. When I dig into that more, I find that the user expectation is that the system won’t understand them. If that is the case more often than not, and it likely is, it is a social norm. Even though several groups of smart people have suggested “standard behavior” for dialogue systems, the patterns have never been widely adopted, other than perhaps “operator” and perhaps a greeting.
Ahmed Bouzid: I see. Yes, that makes sense. Just to make sure I am communicating more clearly my aim here: my aim is precisely to come up with very specific, concrete, actionable best practices, guidelines, conventions, etc., that have been found (throuh scientifically grounded experiments) to be effective in the task of delivering H2M dialogs that result in satisfied users. In other words, I want us to go from “bots don’t understand them and are therefore bad” to something along the lines of: “they love bots because bots help them do what they want quickly”. I go back to the example of the ATM. The ATM does use language (written), with (non-linear visual, though still language based) menus, and they are highly effective. In the last 20 years, I’d say I have used a human agent maybe a handful of times to get cash (almost always because the sum was significant and I also had something else to do in the bank anyway). The ATM in other words is a great example of automation that people love.
What would be an example of an ATM in the world voice assistants, and what are the guidelines that we can infer/translate/come up with for voice assistants so that peole feel that they are being understood & served by the bot?
Phillip Hunter: There is only human conversation. And it’s already extraordinarily performative and adaptive. All languages used by humans are human-created languages. And they are already incredibly varied and powerful. We invented them. We continue to invent and evolve them. We are REALLY good at it.
Automated systems (designed by humans!) are slooooooooowly being taught to participate better in human-like conversation. We invented languages to use with machines (some of these use sound for i/o), they are still human languages, because they represent human-defined concepts, meanings, and intentions. Machines, like some animals, can be trained to interpret and act on the input. BUT, without the training and subsequent input, the machines will do nothing. Nada. Animals will animal. Machines will crumble and rust.
Can we invent more human-machine languages? Sure. There’s no reason to think that this list of hundreds won’t keep growing and changing.
The real question you are asking is whether current and future human-machine language will become nearly indistinguishable from human-human conversation. And we have an answer there, too. We have already been and will continue to be on that path. Look at the plainer programming languages and no-code tools. The ideas in them are to translate well-known standard human behaviors directly into instructions the machine can follow. Sure, most of those don’t enable complexity right now, but that trajectory it set.
That leads me to, what is left to address about why this question continues to come up? Three factors: 1) lack of understanding; 2) lack of will; and 3) inadequate tools. All topics for another time.
Ahmed Bouzid: My question is more modest: when we design say an Alexa skill or a Google action, should we strive to design the assistant to mimic human conversational behavior or should we follow new rules? For instance, should the conversation start with a greeting (“Welcome to X…” vs “<earcon> X…”)? Should the robot say please? Should it use blunt language (“Your pin.” vs “What’s your pin?” vs “Can you give me your pin?” vs. “Can you give me your pin?”). My thinking is that if we go for the blunt robot, the human will lean less on the human to human rules to read meaning in behavior, take less offense to short language (“I didn’t hear you. Speak again.”), and focus on the functional task at hand….
Phillip Hunter: What I’m saying is that for general language interfaces, there is virtually no possibility that new rules will take. Remember from mobile 8–10 years that many apps experimented with multiple additional gestures: spirals, multi-finger, etc. Except for pinch and spread for zooming, everything else we use today are gestures that we already used. Next to nothing new stuck. And that was within the the extremely limited parameters of mobile i/o.
Now, if a company wanted to create a more specialized UI to achieve a specific experience, that could probably work for that context. Success would depend on a detailed and well-crafted experience that had obvious payoffs that were not possible any other way.
Trying to change how people do any spontaneous action is just really, really, really hard.
Diana Deibel: I’m overall of a similar opinion to Phillip Hunter — if you’re going to use conversation as an interface, use conversation. One of the main benefits from a product perspective is that it’s easier for people to use conversation than, say, learn a screen interface, so if you change the conversational rules on on them, it takes away that major benefit.
The idea of efficiency that Brielle Nickoloff is talking about is definitely worth considering when it’s a task-related conversation (instead of a relationship-building conversation). Task-related or not, though, the interaction still relies on the same rules of conversation (which, it should be mentioned, differ from culture to culture) so it all circles back to “make it human.”
[Only caveat I’d add there is to ensure the humanity of the conversation interactions doesn’t trick the human (ie explicitly state or show that the mechanical speaker is non-human).]
Anja de Castro: I agree with you, Diana Deibel! Using conversation in the way we use it h2h is great, as long as we don’t trick the user into thinking he/she is interacting with a human being. I also think it depends on the goal of the user. I myself have worked for a big financial institution and am now working for a theme parc. The tone of voice is totally different, as is to be expected, but the same conversation rules need to be followed. One other thing I think is key is the use of empathy and emotion (just wrote a post on that, asking the community for input). If we are going to design differently for hm conversations, we need to make sure there’s no lack of empathy and emotion. But of course we have to test a lot, to see what actually works for the user.
Ahmed Bouzid: In my opinion, using straight up H2H rules when it’s a human talking to a machine will almost always lead to user dissatisfaction because customer expectations will just go unfulfilled. To be sure, there is the allure of an interface where people need to learn nothing and where the machine is expected to do all of the work, and maybe one day we will get there — but in the meantime, is it worthwhile to come up with a protocol that humans will unfortunately have to learn but one that will enable us to leverage the eyes-free/hands-free value of VUIs? In other words, I’m trying to decouple the bundle of values that voice provide: naturalness (nothing to learn) and eyes-free/hands-free, so that we can buy some time to work on forner while we deliver interfaces of the latter kind….
Ryan Hollander: I think I’d get frustrated with a VUI that was so curt. Audio cues might be better for short responses with a high level of meaning encoded (based on time and context)
Roger Moore: Recall that there are already human sub-languages (e.g. for air traffic control), but they require extensive learning by the users. So, it is necessary to distinguish between naive versus familiar users, and that depends on particular use-cases for the deployed technology. If the use-cases aren’t specified, then the requirements are ambiguous.
Ahmed Bouzid: Indeed. I think the challenge is at many levels: the h2h baggage smuggled in by the user; the ambition being that the audience is ANYONE competent in language; the limitations of voice only (or voice first) interfaces; and more. I think specifying the use case and roles (a bank customer calling a bank agent to tackle account issues) is crucial (parallel of say ATM). But above use case specific rules, are there higher level rules that we can adopt to deliver consistency in expectations and efficiency in performance? For instance, should we drop the “please” and “the sorry” and the long winded formulations that are delivered to respect politeness protocols (“Can you please give me PIN number?” vs “What is your PIN?”)?
Bobbie Wood: Because the conversational design space is so nascent, we have yet to see human users develop the affordances (sorry, yes, I did just use that worn term!) that can allow us to move to a new standard human-machine voice interaction model. The amazing growth of this space means we are architecting the new patterns. Questions like this are excellent prompts to align convo designers — not unlike in the early days of GUI where the best interaction models (menus, forms, error handling, icons, gestures) eventually became industry best practices that we couldn’t live without. I love thinking about what those patterns might look like for voice and how they’re emerging now! When I proposed the answer pattern for Google Home, I based it on progressive disclosure patterns and adapted it. We’re always building on the metaphors in use today, whether those are human-to-human or human-to-machine. Great question!
Ahmed Bouzid: Great points! Observation: conversational design for far field devices in indeed nascent, but not for voice in general. Some people on this thread designed for telephony voice in the 90s and I think the design challenge is more or less the same. (Maybe that’s another topic — how similar are they?) Anyway, probably what we can start with is: what are the basic tenets that should guide the elaboration of a standard? For instance: (1) Fewer words is better (as in, skip “please” and “welcome to X”. (2) Some behavior will be expeced so that the interface doesn’t need to explain it (as in, “You can interrupt me anytime,” or, “Just speak naturally” etc.)…. I think an elaboration of such tenets can help us start articulating the architecture of the human-machine standard….
Roger Moore: As an aside, I’m editing a Multimodal Technologies and Interaction special issue on “Speech-Based Interaction” and I’d be happy to receive submissions addressing the issues being discussed in this thread.
Brielle Nickoloff: More and more I think the goal of a conversational UI is to leverage any elements of communication (whether that’s speech, gesture, reading text, a greeting, a pause, an emoji etc) that make the interaction “efficient”. And if some of those things resemble human-human communication, so be it. If some of those things resemble a new human-computer protocol, so be it. It’ll likely be a blend
Ahmed Bouzid: The challenge after all is said and done is how to control and set the user’s expectations as far as the sophistication of the interloctutor, because such expectations are what determines meaning perceived. For instance, does a pause mean something or is it just a processing latency. If it’s a human, then a pause can mean something. If it’s a machine, then it doesn’t mean much. Same with interruptions. The danger is this: the more an assistant strives to sound human, the heavier the expectations and the greater the chance of reading meaning where there isn’t….
PS: My focus in this write up is on pure voice interactions, btw…
Michael Greenberg: This is a great question! I continue to be of the opinion that the best way to accomplish low-cognitive-load voice interactions is to respect and reflect human language patterns and tendencies. That’s everything from using conjunctions to using embodied metaphors. Doing this increases user comprehension, because it meets the user where they are — it’s how they, and the rest of humanity, communicate every day!
For that reason, I’m still a proponent of reflecting natural language patterns in conversation design. I think there’s room to grow these interaction patterns, but done so in such a way that it’s an extension of our language paradigm, not a reinvention. Identifying conceptual parallels between forward-facing designs and existing language structures is likely where we’ll find opportunities to grow the space the most, things like a confirmation being implicit by a real-world state change vs a natural language “Ok”.
Ahmed Bouzid: What are your thoughts on the cost for doing that — i.e., user expectations are inflated, and one of the cardinal rules of good ux is to try to at least meet expectations. One finding from the good old days is that users were almost always more satisfied with a simple touchtone IVR than they were with a speech one — even good ones. Which for me indicates that the user wants to just move along and do their thing as quickly as possible, and often, pushing buttons is the fastest way to get through a flow…. But I ask in earnest, what would your answer be to the concern that the more human sounding/behaving a Voice UX the higher the chances that the user will step out of bound?
Nick Myers: Ahmed this is a question that I have been pondering quite a bit this year. No matter what humans will always deploy natural conversation tactics when it comes to interacting with voice assistants. That I don’t think ever will change. However, I don’t think we should be striving to 100% replicate human-to-human interactions with voice assistants. When we humans talk to one another it isn’t just verbal communication it is non-verbal as well, and this I believe is an additional layer that we won’t soon be able to replicate. This is why right now I am such an advocate of blending human with machine for Voice. Yes, we can deploy the same conversation tactics when designing experiences, however when people use today’s voice assistants they do so knowing that what they are interacting with is not human and I think that always needs to be kept in mind on the VUI side of things. Our CTO and I have been experimenting a lot this year with custom voice and using some Microsoft Azure tools to train our own synthetic voice models as opposed to just using what it baked into Alexa and Google Assistant or pre-recorded voice.
Shyamala Prayaga: The more we try to mimic human-to-human conversations, users’ expectations with these bots will also level up to the point where we will reach the uncanny valley very soon. I feel our focus should be human-to-machine conversations by picking traits from human-to-human conversations which are meaningful.
Heidi Culbertson: I am in the human-machine camp with designed language nuances.
_________________________________________________________________________
Reading items mentioned
- “Human Enough,” by Karen Kaushansky.
- “Is Spoken Language All-or-Nothing? Implications for Future Speech-Based Human-Machine Interaction,” by Roger Moore.
- “It’s Better to Be a Good Machine Than a Bad Person: Speech Recognition and Other Exotic User Interfaces at the Twilight of the Jetsonian Age,” by Bruce Balentine.