Automatic Speech Recognition on the Edge - A chimera or a reality?
ASR or Automatic Speech Recognition is definitely knocking at the doors of industrial-applications. While Alexa and similar smart speakers have captivated us with consumer-grade ASR for over four years, industrial applications have different requirements for scalable and secure deployment models for their accelerated adoption. Especially so in health care where privacy is paramount, ASR adoption is a non-starter unless a secure, edge inference is possible in an affordable and robust manner.
ASR-NLU-NLP technologies are certainly changing the UI game where speech will replace many other forms of UIs with which humans interact with machines today. The obliteration of the touch screens and buttons is going to be a seismic shake in the industrial as well as consumer electronics products landscape well before 2025. But one thing holds the key to this tectonic shift - the ability to run the inference on the edge in real-time on form factors that are extremely small - e.g a single-core MCU with 1GB memory.
When we at Biocliq interviewed our clients on ASR applications for medical instruments / clinical settings, they were unanimous that the response time should be less than 200ms for voice commands and they may not have cloud connectivity all the time - especially when the instruments are deployed in environments like operation theaters or remote clinics. Most of the instrument makers wanted a roadmap wherein they would like the ASR to be embedded in a small form factor hardware that can be embedded in the devices themselves. They also expected a strong need for non-English languages for the ASR models in clinical settings.
This means the models must be a few MBs in size and must have very low WER (word Error Rate) which is extremely challenging for the researchers. Some cutting-edge research in this area are MatchboxNet (typically for wake-word or command recognition) and NVIDIA Jarvis - Nemo ASR libraries. Mozilla's DeepSpeech has released an English model that is just under 50MB in size and returns reasonably low WER while running on devices like Raspberry Pi. The real challenge is how can these be trained in domain-specific vocabulary as well as non-English languages. Some academic institutions like IIT Madaras are doing some work in training Mozilla DeepSpeech in Indic languages like Tamil and Hindi. For all the excitement we have with these researches, the road to a reliable deployment of ASR on the edge seems to be a tough one.
But the tech people have always got what they wanted, haven't they? If you are involved in ASR research and/or application, please join hands with us to push the frontiers in this domain.
Infrastructure Architect
4 年Awesome and Interesting...
Vice President - Presales and Solutions Practice Lead - Enable enterprises to accelerate business growth through efficient and innovative sales support
4 年Interesting.
Founder & CEO- DigitalXC | Service Cloud | Digital Service Model | Service-as-a-code | Zero-touch | Productization | CX
4 年Brilliant stuff Renga! You've picked a very challenging problem to solve. Knowing you, am sure you will crack it. All the best in your entrepreneurial journey. It's been ages since we spoke. Let's catch up sometime.
Cloud & IT Services Practitioner, Research Scholar, Executive Leadership, P&L experience, Mentor & Coach
4 年Good insights Renga, I see ASR will be helpful in summarsing long incident management conversation or similar use cases
Powering Robotics with Physical AI
4 年Renga Bashyam Great insights on speech recognition. I am glad that you and Biocliq AI team is on a mission to transform the way we interact & engage with industrial equipment’s. Best wishes!