登录查看更多内容

Speech to Text/Automated Speech Recognition and Conversational AI

Georg Huettenegger

发布日期: 2022年7月21日

Speech to Text (STT) or Automated Speech Recognition (ASR) is needed for virtual agents or voicebots to understand voice input from users. In recent years machine learning based approaches have replaced older ways of transcribing speech while there are still significant differences between different products and solutions. These days STT can be as good or better than humans at understanding spoken language. While many virtual assistant and virtual agent interactions via voice continue to disappoint. Why is that?

Key aspects and challenges using STT for Conversational AI (CAI) are:

Audio quality: It ranges from high quality audio input to a speaker like Amazon Echo, audio captured by a smartphone app to lower phone call quality with 8kHz sampling and background noise in a kiosk setting or e.g. a person in a car. Lower audio quality still has a big impact on transcription accuracy.
General English vs. specific language and names: Leading STT products are very good at transcribing general English and other prominent languages – potentially beating humans at it. Specific language and names remain an ongoing challenge though – as is the case for humans. On a personal note: Even native German speakers frequently do not spell (transcribe) my last name correctly.
Whether specific audio models can – or must – be trained: Some products allow to train specific audio models while other products only produce good results with a trained model – additional effort to build and maintain.
Whether STT can be steered dynamically to better transcribe specific responses: At times a virtual agent will know what the user might say – e.g. the answer to a security question. Most STT products allow to steer speech recognition – from dynamic speech grammars to providing hints. This could go as far as having a flow of STT inputs to get the parts of a US address in properly starting with town/zip code and then steering STT to understand the relevant street names.
Other relevant capabilities include number recognition and transcription (123 instead of “one two three”), a built-in capability for users spelling names – key if STT fails to transcribe a name directly.
Other challenges include transcribing audio from different locales, from speakers with an accent, and providing an accurate transcription when multiple languages are mixed – often a topic in Asian countries but also done by non-native speakers.
Understanding when a user is finished speaking. That is key as a response needs to be fast – while a user will be annoyed if his input after a short pause is ignored.

What does that mean for CAI?

Parts of flows that relay on general English and other prominent languages – not including names of any kind – can be supported very well.
Use cases where specific language is used work best with STT solutions that can train a specific model and often struggle if that is not the case – e.g. supporting healthcare specific language for different areas like dermatology or transcribing doctor’s notes.
Transcribing names is challenging for non-common names and works best when the virtual agent knows upfront what name to expect. Often this is possible by having a flow that uses the inbound phone number, asks for an identifying number, or similar solutions to lookup what the user might say and steering STT to have a better chance of properly transcribing a name.

领英推荐

How AI Creates Synthetic Speech

Bernard Marr 3 年前

Unravelling Automatic Speech Recognition (ASR): The…

Kamalakar Devaki 1 年前

AI Voice & Speech Generation - Latest Breakthroughs

Solution Analysts 5 个月前

What does not work well is having a flow that asks for a name the system does not expect. Consequently, this should be avoided. Unfortunately, that also means getting a new address or email address into a system via a phone remains a challenge. What can be done about that?

Many times, this will only work with users spelling the name and multiple attempts. Often the call might need to be transferred to a human agent. A better alternative is a multi-modal flow – either switching to chat or gathering the input via text, chat, or form.

Despite the remaining challenges, let us not forget that STT is now often better than the average human being in understanding spoken language!

On top of that there is a lot of ongoing research. Promising areas include:

Isolating the spoken language from any background noise/other speakers. This promises to push STT accuracy for noisy environments up significantly and vendors have started to roll out this capability.
Finding methods to produce very good STT models for languages where less transcribed audio (“labeled data”) is available. That is key to cover the diversity of languages globally.
Building models that can cope with more than one language. This will be key for areas where speakers mix languages – but it will also help in a more and more global world where every additional language that can be supported helps provide a better experience.

要查看或添加评论，请登录

Georg Huettenegger的更多文章

Natural Language Understanding and Conversational AI

2022年6月24日

Natural Language Understanding and Conversational AI

Natural Language Understanding (NLU) together with Natural Language Generation (NLG) is part of the wider field of…

4 条评论
Data: The New Oil? The New Tar Sand? Or Something Else? (Personal Opinion)

2019年6月23日

Data: The New Oil? The New Tar Sand? Or Something Else? (Personal Opinion)

"Data is the new oil" became widely used starting in 2006 and Clive Humby is usually credited coining/popularizing the…

2 条评论
When Chatbots Fail, Virtual Agents Step In

2017年11月27日

When Chatbots Fail, Virtual Agents Step In

Unless you’re a human being, understanding natural language and holding a conversation is considerably difficult. This…

2 条评论
3 Steps to Becoming an AI Business

2017年7月26日

3 Steps to Becoming an AI Business

A successful AI transformation involves the execution of several implementation phases such as a proof of concept…

Speech to Text/Automated Speech Recognition and Conversational AI

Georg Huettenegger

领英推荐

Georg Huettenegger的更多文章

社区洞察

其他会员也浏览了

Text to Speech vs. Speech to Text: What’s the difference?

?AI, language, and me ...

Spotlight on...Dutch conversational AI and speech tech!

The New Age of Interpreting: A Look at Cutting-Edge Technologies

Evaluating System Performance: An Overview of SECS, MOS, and Sim-MOS Metrics for Speech, Audio, and Multimodality Large Language Models

Enhancing Digital Communication with Text-to-Speech Technologies

Automated Speech Recognition Approaches And Challenges

How Data Annotation is used for Speech Recognition

New Advancements in Spoken Language Processing

Cost effective telco call transcription by scaling Whisper at Virgin Media O2

领英推荐

Georg Huettenegger的更多文章

Natural Language Understanding and Conversational AI

Data: The New Oil? The New Tar Sand? Or Something Else? (Personal Opinion)

When Chatbots Fail, Virtual Agents Step In

3 Steps to Becoming an AI Business

社区洞察

其他会员也浏览了

Text to Speech vs. Speech to Text: What’s the difference?

?AI, language, and me ...

Spotlight on...Dutch conversational AI and speech tech!

The New Age of Interpreting: A Look at Cutting-Edge Technologies

Evaluating System Performance: An Overview of SECS, MOS, and Sim-MOS Metrics for Speech, Audio, and Multimodality Large Language Models

Enhancing Digital Communication with Text-to-Speech Technologies

Automated Speech Recognition Approaches And Challenges

How Data Annotation is used for Speech Recognition

New Advancements in Spoken Language Processing

Cost effective telco call transcription by scaling Whisper at Virgin Media O2