登录查看更多内容

The machines that learned to listen

Katia Moskvitch, MPhil

Ensuring that the world is ready for quantum computing | Physicist | Author & Public Speaker

发布日期: 2017年2月16日

Voice recognition technology makes many aspects of modern life easier. The seeds were sown a lot further back than you might think.

A toddler meanders unsteadily through the living room, pausing by a sleek black cylinder in the corner. “Alexa,” he says in a high-pitched voice. “Play children music.” The cylinder acknowledges the request, despite the muffled pronunciation, and the music starts.

Alexa, a cloud-based speech recognition software from Amazon and the brain of its black cylindrical loudspeaker Echo, has been a big hit around the world – except for the younger ones, who take it for granted. Children will grow up alongside it, just as Alexa will evolve, as the AI powering it learns to answer more and more questions, and – perhaps – one day even converses freely with people.

But anyone older than 10 will know that it hasn’t always been like that. Speech recognition software has come a long, long way to where we are today. Echo is slimmer than a beer glass, but the first speech recognition machines – developed during the middle of the 20th Century – nearly took up an entire room.

Humans have long wanted to speak to machines – or at least make them talk to us. “Voice enables unbelievably simple interaction with technology – the most natural and convenient user interface, and the one we all use every day,” says Jorrit Van der Meulen, VP at Amazon Devices and Alexa EU. “Voice is the future.”

Back in 1773, Russian scientist Christian Kratzenstein, a professor of physiology in Copenhagen, seemed to be thinking along the same lines. He built a peculiar device that produced sounds similar to human vowels using resonance tubes connected to organ pipes. Just over a decade later, Wolfgang von Kempelen in Vienna created a similar Acoustic-Mechanical Speech Machine. And in the early 19th Century, English inventor Charles Wheatstone improved on von Kempelen's system with resonators made out of leather. Their configuration could be changed or controlled by hand to produce different speech-like sounds.

Audrey could recognise the sound of a spoken digit – zero to nine – with more than 90% accuracy

Then in 1881, Alexander Graham Bell, his cousin Chichester Bell and Charles Sumner Tainter built a rotating cylinder with a wax coating, with a stylus that would cut vertical grooves, responding to incoming sound pressure. The invention paved the way for the first recording machine, the "Dictaphone", patented in 1907. The idea was to get rid of stenographers by using the machine to record dictation of notes and letters for a secretary, so that they could later be typed offline. The invention took off, with more and more offices around the globe sporting a secretary with a clunky earpiece, listening to the recordings and transcribing them.

But all those baby steps kept machines passive – until “Audrey”, the Automatic Digit Recognition machine, came along in 1952. Made by Bell Labs, the huge machine occupied a six-foot-high relay rack, consumed substantial power and had streams of cables. It could recognise the fundamental units of speech sounds, which are called phonemes.

Back then, computing systems were extremely expensive and inflexible, with limited memory and computational speed. But regardless, Audrey could recognise the sound of a spoken digit – zero to nine – with more than 90% accuracy, at least when uttered by its developer HK Davis. It worked with 70-80% accuracy for a few other designated speakers, but far less well with voices it was unfamiliar with. “This was an amazing achievement for the time, but the system required a room full of electronics, with specialised circuitry to recognise each digit,” says Charlie Bahr of Bell Labs Information Analytics.

Because Audrey could recognise only voices of designated speakers, its use was limited: for instance, it could offer voice dialling by, say, toll operators, but it wasn’t really a necessity because in most cases manual push-button dialling of numbers was cheaper and easier. Audrey was an early bird – it preceded general purpose computers, and although it was not used in production systems, “it showed that speech recognition could be made practical”, says Bahr.

But there was another goal. “I believe Audrey was initially developed to reduce bandwidth, the volume of data travelling over the wires,” says Bahr’s colleague Larry O’Gorman of Nokia Bell Labs. Recognised speech would require much less bandwidth than the original sound waves. But as telephone switches became digital in the 1970s and 80s, they enabled faster and cheaper call routing, while staying dependent upon an operator recognising a person’s request to dial a number. So, in the 1970s and 80s, a huge effort in Bell Labs’ speech research was to simply do the following: recognise zero to nine digits, and ‘yes’ or ‘no’. “With recognition of these 12 words, the telephone system was able to complete the transition to machine-only telephony,” says O’Gorman.

Audrey was not the only kid on the block, though. In the 1960s, several Japanese teams worked on speech recognition, with the most notable ones a vowel recogniser from the Radio Research Lab in Tokyo, a phoneme recogniser from Kyoto University, and a spoken-digit recogniser from NEC Laboratories.

We don’t want to look things up in dictionaries – so I wanted to build a machine to translate speech – Alexander Waibel

At the 1962 World Fair, IBM showcased its "Shoebox" machine, able to understand 16 spoken English words. There were other efforts in the US, UK and the Soviet Union, with Soviet researchers inventing the dynamic time-warping (DTW) algorithm that they used to build a recogniser capable of working with a 200-word vocabulary. But all these systems were mostly based on template matching, where individual words are matched against stored voice patterns.

The most significant leap forward of the time came in 1971, when the US Department of Defense’s research agency Darpa funded five years of a Speech Understanding Research programme, aiming to reach a minimum vocabulary of 1,000 words. A number of companies and academia including IBM, Carnegie Mellon University (CMU) and Stanford Research Institute took part in the programme. That’s how Harpy, built at CMU, was born.

Unlike its predecessors, Harpy could recognise entire sentences. “We don’t want to look things up in dictionaries – so I wanted to build a machine to translate speech, so that when you speak in one language, it would convert what you say into text, then do machine translation to synthesise the text, all in one,” says Alexander Waibel, a computer science professor at Carnegie Mellon who worked on Harpy and another CMU machine, Hearsay-II.

Moving from single words to phrases wasn’t easy. “With sentences, you get words flowing into each other, you get a lot of confusion and don’t know where the words end and where they begin. So you have things like ‘euthanasia’, which could be ‘youth in Asia’,” says Waibel. “Or if you say ‘Give me a new display’ it could be understood as ‘give me a nudist play’’.”

All in all, Harpy recognised 1,011 words – approximately the vocabulary of an average three-year-old – with reasonable accuracy, thus achieving Darpa’s original goal. It “became a true progenitor to more modern systems”, says Jaime Carbonell, director of the Language Technologies Institute at CMU, being “the first system that successfully used a language model to determine which sequences of words made sense together, and thus reduce speech recognition errors”.

In the years that followed, speech recognition systems evolved further. In the mid 1980s, IBM built a voice activated typewriter dubbed Tangora, capable of handling a 20,000-word vocabulary. IBM’s approach was based on a hidden Markov model, which adds statistics to digital signal processing techniques. The method makes it possible to predict the most likely phonemes to follow a given phoneme.

Google’s trick was to use cloud computing to process the data received by its app

IBM’s competitor Dragon Systems came up with its own approach, and technological advances finally pushed speech recognition far enough that it could find its first applications – such as dolls that kids could train to speak. But still, despite these successes, all the programs at the time used discrete dictation, meaning the user had to pause… after… every… word. In 1990, Dragon released the first consumer speech recognition product, Dragon Dictate, for a whopping $9,000. Then in 1997 Dragon NaturallySpeaking appeared – the first continuous speech recognition product.

“Before that time, speech recognition products were limited to discrete speech, meaning that they could only recognise one word at a time,” says Peter Mahoney, senior vice president and general manager of Dragon, Nuance Communications. “By pioneering continuous speech recognition, Dragon made it practical for the first time to use speech recognition for document creation.” Dragon NaturallySpeaking recognised speech at about 100 words per minute – and it is still used today, for instance, by many doctors in the US and the UK to document their medical records.

In the last 10 years or so, machine learning techniques loosely based on the workings of the human brain have allowed computers to be trained on huge datasets of speech, enabling excellent recognition across many people using many different accents.

Still, the technology stalled until Google released its Google Voice Search app for the iPhone. Google’s trick was to use cloud computing to process the data received by its app. Suddenly, publicly available voice recognition had massive amounts of computing power at its disposal. Google was able to run large-scale data analysis for matches between the user's words and the huge number of human-speech examples it had amassed from billions of search queries. In 2010, Google added "personalised recognition" to Voice Search on Android phones, and Voice Search to its Chrome browser in mid-2011. Apple quickly offered its own version, called Siri, while Microsoft called its AI Cortana, named after a character in the popular Halo video game franchise.

Automatic speech recognition is still far less successful than the human ear in many situations – Larry O’Gorman, Nokia Bell Labs

So what’s next? “Within speech processing, the most mature technology is speech synthesis,” says O’Gorman. “Machine voices now are largely indistinguishable from a human’s. But automatic speech recognition is still far less successful than the human ear in many situations.” While speech can be automatically recognised by a clearly speaking person in an environment with little noise, the so-called “cocktail-party effect” – where humans can understand a single speaker in the din of a party – is still beyond any state-of-the-art technology. Even with Alexa, in a noisy room you have to make sure you’re right near the black cylinder and speak to it clearly and loudly.

Amazon’s attempt at voice recognition was inspired by the Star Trek computer, says Van der Meulen, with the aim of creating a computer in the cloud that’s controlled entirely by your voice—so that you could converse with it in a natural way. Sure, the magic of Hollywood still has the edge on today’s technology, but, says Van der Meulen, “we’re in a golden age of Machine Learning and AI. We’re still a long way from being able to do things the way humans do things, but we’re solving unbelievably complex problems every day.”

This article was first published on BBC Future, part of my column Hidden Histories

Brian William

1 年

14+14{=no worries chino 2004

Shane Dean

Freelance Technical Writer

8 年

Great article. I was reminded of the rather cheesy remake of "The Fly" starting Jeff Goldblum. His character had created voice recognition software for his computer. As he degraded further into to fly monster, his voice degraded and his computer no longer recognized his voice. I was also reminded of Star Trek: The Next Generation and how the crew would call out "Computer" and access lots of info. Some of the best shows were when the computer couldn't answer the question the way it was formed. Technology can be both fascinating and frightening, but one thing stays true. No matter how cool it is, there is always room for improvement.

Michael Will

Computation from first principles

8 年

These machines force us to reevaluate our notions of 'intelligence'. First they played Tic-tac-toe. Then they beat us at chess. Then they beat us at Jeopardy. Then they beat us at Go. Then they drove our cars. Now they're interacting with our children. Maybe there's no 'singularity moment' with AI. Maybe it just sort of washes over the world like a strange, slow wave (to borrow a phrase from Lynn Margulis).

2 次回应

Katia Moskvitch, MPhil

Ensuring that the world is ready for quantum computing | Physicist | Author & Public Speaker

8 年

I agree. In the future, I think, many people will have robotic assistants that also move around and help around the house!

1 次回应

查看更多评论

要查看或添加评论，请登录

Katia Moskvitch, MPhil的更多文章

Changing careers, taking risks and aiming high

2024年12月12日

Changing careers, taking risks and aiming high

Look up. These are the two most important words I keep repeating to my kids.

6 条评论
We are entering the quantum era. It is crucial to get the world ready.

2024年8月14日

We are entering the quantum era. It is crucial to get the world ready.

Education. Workforce development.
The fun life of a single working parent during COVID-19

2021年1月14日

The fun life of a single working parent during COVID-19

It's snowing. It was raining just minutes ago, a gloomy start to yet another long, lonely work day from home here in…

4 条评论
How serendipity (and LinkedIn) changed my life

2020年11月26日

How serendipity (and LinkedIn) changed my life

Remember that cheesy Hollywood romcom, 'Serendipity'? Yeah, the one with John Cusack and Kate Beckinsale. It came out…

5 条评论
Science writer/editor - pondering my next move!

2018年3月12日

Science writer/editor - pondering my next move!

Great science journalism jobs are scarce, and I want my next job to be awesome. So here's the deal: If you know anyone…

8 条评论
Can manufacturers find safety in the cloud?

2017年5月19日

Can manufacturers find safety in the cloud?

FedEx in the US, Germany’s national railways, the UK’s National Health Service, universities in China, utilities in…
Cars are the biggest weight-watchers

2017年3月27日

Cars are the biggest weight-watchers

It seems that not just us humans are obsessed with diets; cars are trying hard to get lighter too. “Add power and you…

1 条评论
Amazon's Alexa may soon recognise many voices

2017年3月1日

Amazon's Alexa may soon recognise many voices

Alexa's listening skills could soon get better. Amazon’s digital assistant may soon start recognising different voices.
Darknet provides lesson for IoT resilience

2017年2月24日

Darknet provides lesson for IoT resilience

The notorious Darknet, that’s mostly used for illicit purposes, is better shielded against cyber-attacks than the…

1 条评论
Tips for moderating your first panel: You can do it!

2017年2月24日

Tips for moderating your first panel: You can do it!

The email came out of the blue. Could I moderate a panel at the upcoming 4th Annual International Infrastructure…

22 条评论

See all articles

The machines that learned to listen

Katia Moskvitch, MPhil

Ensuring that the world is ready for quantum computing | Physicist | Author & Public Speaker

Katia Moskvitch, MPhil的更多文章

社区洞察

其他会员也浏览了

Who Pays When AI Steals Your Voice?

AI and Creativity: Unraveling the Impact of Artificial Intelligence in Art and Music Introduction

Elevating Audio Datasets: The Power of Augmentation Techniques

??"Step by Step": Mastering o1, Mini & Voice AI

Can AI Be Truly Creative? Exploring Art and Music Generation

Friday: An example of AI augmentation analysis & guitars

The Future Of AI: Killing On Hold Music

Creativity Is a Process

How the 2022 GRAMMYs used AI to generate consumer engaging content

Cultivating Self-Responsibility: Going Beyond Quick Fixes

Katia Moskvitch, MPhil的更多文章

Changing careers, taking risks and aiming high

We are entering the quantum era. It is crucial to get the world ready.

The fun life of a single working parent during COVID-19

How serendipity (and LinkedIn) changed my life

Science writer/editor - pondering my next move!

Can manufacturers find safety in the cloud?

Cars are the biggest weight-watchers

Amazon's Alexa may soon recognise many voices

Darknet provides lesson for IoT resilience

Tips for moderating your first panel: You can do it!

社区洞察

其他会员也浏览了

Who Pays When AI Steals Your Voice?

AI and Creativity: Unraveling the Impact of Artificial Intelligence in Art and Music Introduction

Elevating Audio Datasets: The Power of Augmentation Techniques

??"Step by Step": Mastering o1, Mini & Voice AI

Can AI Be Truly Creative? Exploring Art and Music Generation

Friday: An example of AI augmentation analysis & guitars

The Future Of AI: Killing On Hold Music

Creativity Is a Process

How the 2022 GRAMMYs used AI to generate consumer engaging content

Cultivating Self-Responsibility: Going Beyond Quick Fixes