Speech to Text (STT) or Automated Speech Recognition (ASR) is needed for virtual agents or voicebots to understand voice input from users. In recent years machine learning based approaches have replaced older ways of transcribing speech while there are still significant differences between different products and solutions. These days STT can be as good or better than humans at understanding spoken language. While many virtual assistant and virtual agent interactions via voice continue to disappoint. Why is that?
Key aspects and challenges using STT for Conversational AI (CAI) are:
- Audio quality: It ranges from high quality audio input to a speaker like Amazon Echo, audio captured by a smartphone app to lower phone call quality with 8kHz sampling and background noise in a kiosk setting or e.g. a person in a car. Lower audio quality still has a big impact on transcription accuracy.
- General English vs. specific language and names: Leading STT products are very good at transcribing general English and other prominent languages – potentially beating humans at it. Specific language and names remain an ongoing challenge though – as is the case for humans. On a personal note: Even native German speakers frequently do not spell (transcribe) my last name correctly.
- Whether specific audio models can – or must – be trained: Some products allow to train specific audio models while other products only produce good results with a trained model – additional effort to build and maintain.
- Whether STT can be steered dynamically to better transcribe specific responses: At times a virtual agent will know what the user might say – e.g. the answer to a security question. Most STT products allow to steer speech recognition – from dynamic speech grammars to providing hints. This could go as far as having a flow of STT inputs to get the parts of a US address in properly starting with town/zip code and then steering STT to understand the relevant street names.
- Other relevant capabilities include number recognition and transcription (123 instead of “one two three”), a built-in capability for users spelling names – key if STT fails to transcribe a name directly.
- Other challenges include transcribing audio from different locales, from speakers with an accent, and providing an accurate transcription when multiple languages are mixed – often a topic in Asian countries but also done by non-native speakers.
- Understanding when a user is finished speaking. That is key as a response needs to be fast – while a user will be annoyed if his input after a short pause is ignored.
What does that mean for CAI?
- Parts of flows that relay on general English and other prominent languages – not including names of any kind – can be supported very well.
- Use cases where specific language is used work best with STT solutions that can train a specific model and often struggle if that is not the case – e.g. supporting healthcare specific language for different areas like dermatology or transcribing doctor’s notes.
- Transcribing names is challenging for non-common names and works best when the virtual agent knows upfront what name to expect. Often this is possible by having a flow that uses the inbound phone number, asks for an identifying number, or similar solutions to lookup what the user might say and steering STT to have a better chance of properly transcribing a name.
What does not work well is having a flow that asks for a name the system does not expect. Consequently, this should be avoided. Unfortunately, that also means getting a new address or email address into a system via a phone remains a challenge. What can be done about that?
Many times, this will only work with users spelling the name and multiple attempts. Often the call might need to be transferred to a human agent. A better alternative is a multi-modal flow – either switching to chat or gathering the input via text, chat, or form.
Despite the remaining challenges, let us not forget that STT is now often better than the average human being in understanding spoken language!
On top of that there is a lot of ongoing research. Promising areas include:
- Isolating the spoken language from any background noise/other speakers. This promises to push STT accuracy for noisy environments up significantly and vendors have started to roll out this capability.
- Finding methods to produce very good STT models for languages where less transcribed audio (“labeled data”) is available. That is key to cover the diversity of languages globally.
- Building models that can cope with more than one language. This will be key for areas where speakers mix languages – but it will also help in a more and more global world where every additional language that can be supported helps provide a better experience.