A conversation with Amazon’s head of Alexa
At Web Summit, I had the pleasure of interviewing Rohit Prasad , Amazon's head of Alexa. Here is a transcript of our conversation. You can watch the video here .
Photo credit: Flickr (link )
Transcript:
NT: Rohit, you are a very important man, and I'll explain why. It's now 6am in New York, and in about an hour and a half, my youngest son is going to wake up with your device, ask it whether he should wear pants, and then put on inappropriate hip hop. So, your device has a huge emotional responsibility in my life very soon.?
RP: Yes, I’m excited about that option that Alexa has. We want Alexa to be the indispensable, trusted AI assistant, advisor, and companion. And what you just explained is exactly the kind of? an option that we are seeing. And we have hundreds of millions of Alexa customers using many of these devices in their homes. And this momentum is just growing. We have had our total interactions with Alexa grow by 30% over the past 12 months.
NT: Well, you know, it doesn't always work because, the other day on Monday, he said, “Alexa, can you make it Sunday?”
RP: Ha ha, there are certain things in the AI’s competencies. That’s not one of them, actually.?
NT: Well that’s the future, and that’s what we are going to get to. Alright, this conversation, I want us to go a little bit through some of the interesting problems in AI that you addressed and solved. Some of the interesting problems in AI that you are facing right now. And then the last part, we will get to some of the interesting problems in AI and applications of Alexa as we move into a world of much more ubiquitous ambient computing. So, let’s start with the past. You come in, and you come in 2013, 2014?
RP: 2013.
NT: And the first problem you have to solve is how to identify which voice is saying what. Explain over the years to come, what is the most interesting AI problem you had to solve and how you solved it.
RP: Yeah, first, it’s important to catch, how did we arrive here? And the first challenge, as you said, was, “How could you speak to a device at a distance?” And it should understand you, what you said, and then do something contextually relevant at that time. Which meant it had to solve what we call, far-field speech recognition, which means, “Can you interpret what the customer is saying from far away versus holding a phone?” And then it had to respond back in a pretty compact fashion and in an audio format. So that was challenge one, which meant you had to really invent the technology that did not exist in 2014 when we launched Alexa for the first time about eight years back. And that was challenge number one.?
Challenge number two was we very quickly figured out that we couldn’t do everything ourselves as Amazon. And what you needed to do was foster invention with Alexa the service. And the way to do that was to get Alexa into all kinds of endpoints not just made by Amazon through a Self-Service API which is called Alexa Voice Service so that Alexa can be integrated on any device. And with that, we have now Alexa integrated in 140,000 or more compactable Alexa products as well as more than 300 million smart home devices are connected with Alexa. That’s a huge number, and we know that power of Self-Service is working here.?
And the second thing is when Alexa is on these devices or endpoints, everybody wants a different experience. You as a developer or as a service provider want to reach your customer through Alexa. So we wanted to build what is called Alexa’s Skills Kit that totally democratizes how you build experience with AI. And that was the second big challenge which has really leveled the field so you don’t have to be an AI expert to build an experience on Alexa. You just need to bring your data and your API and everything follows very quickly after that, for making the conversation skills. And that’s why we have more than 130,000 skills that Alexa has which were not built by us. It’s incredible momentum there as well.
NT: So let’s talk about the challenges today. One of the things you noticed during the pandemic, I’ve heard you talk about before is sort of the emotional responsibility of Alexa, right? Dealing with older people, dealing with younger people. Like my eight year old. Explain how emotional recognition works in Alexa right now.?
RP: Yeah, I would say first, before we think about what emotional recognition plays a part, when you have something like an ambient intelligence, which is, the way we describe it is, the AI is there when you need it, and then it recedes into the background when you don’t and even anticipates your needs. Why we want to do ambient intelligence and foster that is because Alexa is not just an AI assistant. I want to dispel that belief, like it’s not a voice assistant. It’s also a trusted advisor and a companion. And when you go into these roles, for instance, do you think about what’s an advisor? If I ask Alexa, “What hike should I do?” Just like my friend, Alexa should tell me, “You should do this hike,” right, because it knows me and should tell me exactly what to do there. Or when it tells you that your thermostat is set at a much higher temperature, when you're leaving, you should probably lower the temperature down. It's giving advice to save energy. Similarly, certain customers, especially kids and elderly adults, are using Alexa in many different ways where it's really a companion. And for that, the AI has to engage in much longer form conversations versus single turn requests. That's already happening, and this is where affect in terms of understanding the emotional state and responding in the appropriate fashion is super important.?
NT: So how, how do you do that? How do you understand? You and I are talking…
RP: Yep.?
NT: And if I talk like this, you have a pretty good sense. And if I talk like this, you've got kind of a different sense. How do you train Alexa on that? Because it's also of course, also my body posture when I make those remarks and if I come forward. How is Alexa interpreting all of that??
RP: So that depends on what kind of sensors are available on the device. If it's only a microphone and no cameras, then all the cues for your, how your responding to the AI or talking to the AI is dependent on on audio cues and your porosity changes, your intonation changes. How we emphasize words change. Let me give you a very concrete example. When I ask Alexa, “Scores about my favorite teams.” And if my team is winning, it should come back and tell me in a very enthusiastic voice that this team is winning. For instance, the New England Patriots for American football. I know I'm in Europe and you follow soccer a lot more, but the same thing. And if my team's losing, it should be, again, in an not an upbeat fashion, should tell the scores. That's a simple approach of where we are already getting to experiment with not just how you spoke, but what are you asking about? Or what are you talking to Alexa about? And that's very important.?
NT: Did Alexa tell you that I'm a Patriots fan? Is that like your subtle emotional manipulation? What do you KPI it on? Right? So when you're running a test and you're saying, we're gonna experiment with this kind of answer versus this kind of answer, are you KPI-ing it on total number of engagements? Are you KPI-ing it on some metric of satisfaction? Are you KPI-ing it on, like, how much stuff people buy from Amazon that week? What is the KPI??
RP: Yeah, there's several. There's no one size fits all for this because every particular experience is a little different. You talked about entertainment in terms of listening to music. If you like the song Alexa played, you would listen to it for longer.?
NT: Yep.?
RP: So that's a great cue, and if you're interrupting, then definitely Alexa did something wrong. So we have a fairly sophisticated machine learning-based metric, which tells us whether what we did for the customer, was it relevant or not? So think of it as a customer perceived satisfaction or defect rate. So that's an important KPI, the two user terminology.?
NT: Mhm.?
RP: But that's not the only one. The amount of time you listen to music can be another one.?
NT: Yep.?
RP: And if we gave you a, if you were ordering something as a product, did you actually buy it? The suggestion that Amazon made through Alexa. Was it actually a good one for you? So, all those are factors where those signals are incredibly important for Alexa to self-learn and make itself better for our customers.?
NT: Got it. Alright, let's talk about the sensors you have. You mentioned the voice sensor, but in the future, and even in the present, Alexa’s not just measuring voice, it's like looking at sound waves, It's looking at vision. It's, in fact, it's developing sensors that are super human. So, what sensors do we have now that might surprise the audience? And what sensors will we have in the future that will definitely surprise the audience??
领英推荐
RP: Yes, sensors are incredibly important for our vision of ambient intelligence because it's all about sensing what's in the environment and then responding appropriately. And I want to go back a bit in terms of what's the most important sensor in this case is the microphone. I mean, the first incarnation of ambient intelligence was the Amazon Echo. You could suddenly speak to the environment and the environment spoke back. So that's the most obvious one. But then the camera became a delighter. When I come home, and there's a note from my family for me, if I have visually enrolled with Alexa, it shows me a sticky note relevant to me. So that's a pretty delightful experience where it says, like, my son has gone for tennis and you have to pick him up. So that's a very important aspect of a sticky note that is again, the power of ambient intelligence. There are a few other things like ultrasound can help you with just the tap gesture. Shut off your alarm if you don't wanna speak to it. So these are the kind of things that I believe will keep coming better and better as the number of sensors grows around our world. And Alexa would be an important AI that works on the background, connecting all these sensors, even anticipating your needs to complete actions on your behalf.?
NT: Okay, let's talk about another thing that's going on right now. So, this summer you announced that Alexa would be able to take just a minute of voice from a relative or any person, and then read in their voice. And you had a very emotional ad of a child listening to a story from their grandmother who'd recently passed away. And some people adore this feature, and it scared the bejeebers out of others. So tell me about the feedback you've gotten and how you've adapted this feature.?
RP: First of all, that was a scientific breakthrough we were describing, and the grandmother was alive in that one. So it is just not in the same vicinity. And it’s an incredible technology.?
NT: It's crazy.
RP: Well, I would say it's a scientific breakthrough that is helping us do many things faster. For instance, if Alexa needs to be a sportscaster and, and talk like how a sportscaster would speak, then that is essentially a very good way. The same technology is helping us develop many different kinds of voices. And in terms of the personal voice, lots of people have said that they would want that feature, but we have to think about what's the best use case. But that was a good demonstration for how you could take, just as you said, a minute of speech and produce a high quality voice.
NT: Okay, well my children want to turn their Alexa into Snoop Dog, so we'll see whether that works. Let's talk a little bit about what's coming in the future. What is the most complicated problem you're trying to solve right now that you're not certain you're gonna be able to solve in the next two years?
RP: Yeah, I think, as I mentioned, it’s the roles that Alexa is expected for customers to be great at. It's the expectations that were growing, and to meet that expectation, the only way to achieve this vision that I described of the trusted assistant, advisor, and companion is for Alexa to be great at many tasks in not a bespoke custom development of AI, which is a huge challenge.
Just I mentioned the huge number of skills Alexa has. So you can't do this through bespoke AI, and we have to invest a lot in generalized intelligence, where by which I mean that Alexa needs to learn about multiple tasks it needs to get done for customers simultaneously. And second, it has to self-learn and adapt with minimal input to the changing environment.
NT: Mhm.?
RP: And these are incredibly hard challenges to do. And lastly, as I mentioned that with the ambient intelligence vision, Alexa needs to anticipate a lot more, automate a lot more what you do on a routine basis. If you wake up in the morning, turn off your alarm, ask for traffic, play your music, like you were describing, can Alexa automate all those actions where you just say, “Alexa, good morning,” and then just does those automatically on your behalf? That's an incredibly hard challenge. So, as we get more going on these, that's the kind of challenges that we have to solve, and it's already working because more than 30% of the interactions in the smart home control are now all Alexa initiated without an explicit voice request.?
NT: So they're not, it's not actually like following a flow chart. It's not, “Please do this.” Alexa looks. It's more of a question that Alexa has not heard before.
RP: Right.
NT: That Alexa is using generalized AI to respond to.?
RP: Yeah so, there are two aspects to it. So when you ask something that Alexa knows about, but it's coming out of sequence, as you said, it should maintain that context to answer it. But then there are other issues that happen where you refer to, let's say in your home, you have a thermostat or a light, and you call your light the reading light, but you've never said that, right? That this is my reading light. And if you said, “Alexa, turn on the reading light.” Alexa should come back and ask you, “What's the reading light?”
NT: Right.
RP: So this is where it should also engage in a dialogue with you when it doesn't know about what exactly did you want. And this is how we as humans learn, so this humanlike learning and interaction skills is super important.?
NT: So as Alexa becomes more intelligent and as it becomes more embedded in our life, and as we get sensors in every device, privacy becomes more important. The best way that I would feel less concerned about Alexa and privacy is if all the action was done locally, right? If instead of sending my voice command to an Amazon, you know, server somewhere, it actually just did it on the device. How much can we do locally now, and in the future, how much will we be able to do locally?
RP: Yeah, thinking about privacy and security, first of all, is paramount. I think we have to design experiences with customer trust in mind, and that's what we are doing. I don't think it's a tradeoff. I don't think it's a limitation. It's actually a huge invention opportunity, and local processing, as you said, is one of them. And many use cases necessitate that if you are in a car, you can lose your connectivity often, and then Alexa should still function very well. And this is why we invested from very beginning on what we call the Alexa hybrid engine, where bulk of processing happens locally, so that you can recognize speech on the device itself.?
Other use cases, which is again driven by both the seamlessness of conversations and privacy as you mentioned, is you can say, “Alexa, join the conversation,” where you don't have to say the wake word “Alexa” on every turn. And for that you can imagine that, of course, we have to process locally. And this is a delightful experience that is available on our premium devices, for instance, the Echo Show 10 where the Alexa figures out who's talking to it, and then also decides when to respond, just like we as humans do. Sometimes I'm referring to you, sometimes to someone else in the same house. So that is the kind of capability that Alexa already has, and that is where all the processing is happening locally for speech and even the visuals. If you have a camera on the device. And only for the information that Alexa needs to process in the cloud, then it goes and does the processing in the cloud and brings back an answer instantaneously.
NT: Got it. Alright, so you've been working on Alexa for eight years now. Eight years into the future. Where will it be and what will it do? Like, will I have a sensor in my shoe, and I'll tell it to tie my shoe and I won't have to bend down? How deeply ingrained will it be in our lives in your ideal vision?
RP: Yeah, as I mentioned, like we do want this AI to be available everywhere and for everyone. And I'm incredibly optimistic about that future. And the way I think about ambient intelligence is that it should simplify your life, reduce your cognitive burden, so that it frees the time for you to spend with the people you want to spend time with and the experiences where you need to spend your time. So this is incredibly powerful in my opinion, and I'm very optimistic that as the sensory world changes, where we have more sensors around us, on us as well in some cases. That this will be where the AI will be there everywhere for us.?
And I'm also humbled by the fact that it's not a technology just for the privileged. I hear stories from India where the Alexa devices are in remote rural villages where kids are interacting with them in schools where they don't have a computer with them, but they have an Echo Dot, and they can just learn about any topic with Alexa. So that, I feel like that gives me ton of optimism that this will truly work for everyone everywhere.?
NT: And very quickly put the people at ease who are worried that Alexa is gonna turn into a generalized intelligence gets going to turn them all into paperclips and use them as batteries.?
RP: So to me, that's not the dystopian AI image that I have. I am very, as I mentioned with the story in the AI, that we have a very pragmatic view of AI and what needs to deliver. And if you put customers first and put the right guardrails for that, I think we have a very optimistic future where the AI will be beneficial for everyone here.?
NT: Wonderful. Well, Alexa told me to end this interview on time and we are right on it to the second. Thank you, Rohit. That was absolutely fascinating and wonderful to share the stage with you.
RP: Thank you. Thank you very much.?
NT: Thank you, Rohit.?
RP: Thank you, thank you.
Storyteller, Sr Data Editor, Podcaster, and more
2 年What did you used to do the transcript?
CX & Technology Analyst, Writer, Ghostwriter, and host of CX Files Podcast
2 年Just don't try playing the audio version or your Alexa will be going crazy for the next 15 minutes