登录查看更多内容

The Now and Future of Speech Interfaces

Ankit Gupta

Founder and CEO at Bicycle Health

发布日期: 2016年8月5日

We have been dreaming about speech as an interface for a long time. It’s not an easy problem, but we are finally close to making it a reality.

To build a product with a kickass speech interface, you need two things:

Audio Hardware: A good microphone and speaker. This could be a device for the home, car or office. It could also be a smart headset or earplugs to be used while commuting, working out or walking the dog.
Speech Recognition Software: A speech interface needs to understand what users are talking about. The first step is having a high quality speech to text engine. The second step is understanding intent from the resulting text. The third step involves enabling developers to build upon this platform a new world where we talk to objects. And, in a few years, cannot imagine living life without talking to objects.

Over the last couple weeks, I have explored how far along we are in this journey and what possibilities exist to make a speech interface today. I built an iPhone app using Google’s recently released Speech Recognition API, that lets users issue voice commands to remember their thoughts.

As a shortcut to intent recognition, I expect users to want to remember a book, movie/tv show or a place when they issue commands starting with read, watch or visit respectively. Given an intent, I use the respective search API from either Goodreads, Open Movie Database (OMDB), or Yelp to fetch content. For example?—?“watch The Big Short” will fetch the movie “The Big Short” using the OMDB API and save it locally.

Speech to Text is pretty much solved. Text to Intent is still a hard problem.

Speech to Text

Due to massive investments by Google, Amazon, Microsoft and Apple, I think this first step of having a high quality speech to text engine is pretty much solved. Alexa is able to transcribe my speech to text very well.

Google’s Speech API (while still in beta!) is VERY accurate and trained to transcribe text even in noisy environments. It expects developers to provide un-preprocessed raw audio data and gives them a lot of flexibility in how they want to use the API. For example, you can ask it to detect single or multiple statements, provide streaming audio input or upload individual files and ask it to return the end of single utterances, end of speech or the end of audio. The only time this API didn’t do well was when I was speaking with a TV show playing in the background.

Text to Intent

Having achieved high speech to text accuracy, I expect a lot of investment to happen in improving intent detection from text in the near future. You can get quite far with a simple rule-based intent detection system. From the pattern of commands my Amazon Echo expects, I’m guessing Alexa uses many rules in its intent detector. Building a highly versatile machine learning solution for intent detection is going to be difficult, simply because in many cases, the same text can have very different ‘correct’ meanings.

One of the main problems that makes <text> parsing so challenging is that human languages show remarkable levels of ambiguity. It is not uncommon for moderate length sentences?—?say 20 or 30 words in length?—?to have hundreds, thousands, or even tens of thousands of possible syntactic structures. A natural language parser must somehow search through all of these alternatives, and find the most plausible structure given the context … Humans do a remarkable job of dealing with ambiguity, almost to the point where the problem is unnoticeable; the challenge is for computers to do the same.
?—?via Google Research Blog

There are several ongoing efforts in this area?—?Wit.ai and Api.ai let you train Natural Language Processing (NLP) models based on your rules and training data. Ex-Siri founders are working on a new company called Viv that looks like a much more context-aware version of Siri. Google has been launching new NLP APIs and Open Source initiatives over the past few months.

The Future Of Speech Interfaces

A new interface modality is a massive driver in creating new experiences or significantly improving upon existing clunky-yet-functional experiences. This enables new usecases to appear that otherwise would have been too hard. Example?—?direct manipulation afforded by touch screens in smartphones made several existing experiences like Communication, Maps, Self Expression much easier and more useful. On the other hand, it led to several new experiences like the musical instrument app Ocarina by Smule.

Similarly, adoption of speech interfaces could lead to improved experiences of two kinds:

A speech interface for existing apps: All existing apps could add a speech interface that accepts quick commands. A music app could play a song, a shopping app could buy things that come to mind or a communication app could send a friend a message on your behalf while you are cooking dinner. In fact, many quick tasks that I need to take my phone out of my pocket, fiddle with a passcode screen and eventually use the app can be easily achieved by a speech interface for that app. This would help people be more present and less distracted.
Unlock new experiences: Speech is a great interface to issue commands to a computer while doing other things. Speech could be a fantastic interface in an immersive virtual reality environment. Speech interfaces could also help us cross language barriers in work and during travel. Speech could also help us formulate complex search queries much better than typing, hence make it easier to retrieve information.

A speech interface needs to be quick, non-interruptive and accurate. A fire-and-forget experience where it lets users issue complex commands, guesses the right intent and executes the command all within milliseconds.

It is an exciting future and I can’t wait to live in it.

Thanks to Gaurav Dosi, Chinmay Jain and Prasang Upadhyaya for reading drafts and providing feedback.

Sushil Tripathi

Digital Marketing Manager at Kodehash Technologies

8 年

Love the sample app - great way to demo the speech

1 次回应

Rachel Kane

Digital Transformation Designer

8 年

It's an amazing concept with far reaching consequences for people with limited sensory and cognitive ability. I don't agree with TTS as being "solved" it has a long way to go as a comparable quality experience to a visual interface. Testing the future of speech interfaces will hopefully tie in with auditory experiences as they are so closely linked in perception of an external experience, it's all heading in a positive direction for the future as long as the user is kept in the process.

2 次回应

Angela Anderson

Online Instructor (Data Analytics) at Western Governors University

8 年

This is certainly needed for video to be easily ADA compliant and transcribed easily into different languages. Google is coming close with its youtube resources. I see the challenges with language ambiguities. Even switching between languages within Google docs is interesting. I wrote a paragraph in Google docs and had it translate to Spanish, then attempted to translate it back. The result was not my original paragraph .. more like broken English. Nice to see this space advancing.

4 次回应

查看更多评论

要查看或添加评论，请登录

Ankit Gupta的更多文章

A busy professional's cheat sheet to quitting opioids

2018年12月3日

A busy professional's cheat sheet to quitting opioids

Reader's Note: If you are using opioids (either prescription pain pills or recreationally), I hope this article helps…

4 条评论
How do people with addiction live, think, survive and get better?

2018年10月13日

How do people with addiction live, think, survive and get better?

My startup Bicycle Health aims to improve access to the diagnosis and treatment of opioid use disorder. We help anyone…

10 条评论
Today’s Indian Startup Scene Is Growing Fast - Here’s Why

2015年1月21日

Today’s Indian Startup Scene Is Growing Fast - Here’s Why

When I graduated from IIT Bombay in 2008, turning down a job to do a startup was serious career suicide. Being poor for…

92 条评论
3 Timeless Quotes That Always Inspire Me

2014年9月29日

3 Timeless Quotes That Always Inspire Me

People have too much to say. Heck writing this post makes me one of those people.

8 条评论
The future of personal robotics

2014年1月25日

The future of personal robotics

There is no reason anyone would want a computer in their home - Ken Olson, founder of Digital Equipment Corp., 1977.

8 条评论

See all articles

The Now and Future of Speech Interfaces

Ankit Gupta

Founder and CEO at Bicycle Health

Speech to Text

Text to Intent

The Future Of Speech Interfaces

Ankit Gupta的更多文章

社区洞察

其他会员也浏览了

AI and Creativity: Unraveling the Impact of Artificial Intelligence in Art and Music Introduction

More Nail-Biting Drama at OpenAI??

Elevating Audio Datasets: The Power of Augmentation Techniques

Voice & speech for the win

Meta Makes a Splash: Unveiling a Wave of New AI Models for Multi-Modal Magic

??"Step by Step": Mastering o1, Mini & Voice AI

Dictation on the Go: The Rise of Mobile Digital Dictation

The timeless dance of listening: From Ancient traditions to modern technology

Wealth DNA Code Review: Alex Maxwell Audio Manifestation Program FAKE or WORTH THE MONEY?

AI & Startups December 18 - December 24

Speech to Text

Text to Intent

The Future Of Speech Interfaces

Ankit Gupta的更多文章

A busy professional's cheat sheet to quitting opioids

How do people with addiction live, think, survive and get better?

Today’s Indian Startup Scene Is Growing Fast - Here’s Why

3 Timeless Quotes That Always Inspire Me

The future of personal robotics

社区洞察

其他会员也浏览了

AI and Creativity: Unraveling the Impact of Artificial Intelligence in Art and Music Introduction

More Nail-Biting Drama at OpenAI??

Elevating Audio Datasets: The Power of Augmentation Techniques

Voice & speech for the win

Meta Makes a Splash: Unveiling a Wave of New AI Models for Multi-Modal Magic

??"Step by Step": Mastering o1, Mini & Voice AI

Dictation on the Go: The Rise of Mobile Digital Dictation

The timeless dance of listening: From Ancient traditions to modern technology

Wealth DNA Code Review: Alex Maxwell Audio Manifestation Program FAKE or WORTH THE MONEY?

AI & Startups December 18 - December 24