10 Factors to Consider When Evaluating Speech-To-Text Engines
Photo by Vladislav Klapin on Unsplash

10 Factors to Consider When Evaluating Speech-To-Text Engines

Are you evaluating the addition of Speech-to-text (or automatic speech recognition (ASR)) to your existing products? The ASR buying decision is quite complex because the outputs are generally visible to everyone. Errors make the news story but you may get a 'meh' as a reaction when it works well.?

A?well-designed and well-implemented?ASR is clearly a valuable addition to many WebRTC-based products, communication & collaboration products, Meetings, Events, and Webinar products, Healthcare, Workplace management, revenue intelligence, contact center, and more.

When it comes to speech-to-text technology, due to the complexity of human speech, infrastructure, accents, and adaptability of acoustic models, making the right decision is critical for your business. With so many different engines to choose from, it can be difficult to know where to start. In this blog post, I will outline 10 factors you should consider when choosing a speech-to-text engine. By considering these factors, you will be able to make an informed decision that meets your needs and hopefully exceeds your end-user expectations.

(1) Use case and requirement clarity: Most ASRs offer a cloud API but is on-prem or on-device a critical requirement? It is very important to clearly define the use case and requirements. Whether it is an IVR scenario, Telephony-based conversations, A recorded batch mode use case, the length of the conversation especially for recordings, and other aspects that are critical need to be finalized.

(2) Core Feature set included: Most ASRs go beyond just offering a simple speech-to-text APIs, most commonly offered key features must include: Speaker separation (a.k.a speaker diarization or speaker labels), Automatic punctuation & casing, Custom Vocabulary support, Languages & accents supported, Speech analytics (such as words per minute, listen/talk ration, etc), Formatting (markdown, paragraph, etc), and many more. There are about ~75 features available (I will write another blog about that) in the market but most offer only ~15 to ~25 features at best.

(3) Privacy & Security: Who owns the data and where is it stored? It is important to understand how the API uses your data. Do they keep a copy of your audio/video files and transcriptions to improve their API? Do they give you the ability to permanently delete the audio/video files you send to the API and the transcript that is generated?

(4) Enterprise readiness: There is plenty of amateur-built, low-quality, and unsafe speech-to-text variants online, it is important to make sure that the speech-to-text platform is built for corporate content & scale. A lot of vendors just use Kaldi or equivalent open-source ASR and use whatever-is-available audio/video data to label and train - which doesn't work in enterprise scenarios.

(5) Integrations & QA Testing can be complex and costly, especially if you have real-time use cases, high concurrency use cases, or use additional media mixing or broadcasting infrastructure. One factor that is important is availability & access to SDKs, community slack channels, developer documentation, and support, etc

(6) Accuracy - Establish Target & Baseline accuracy - Correct benchmarking methodologies require time & effort. Machine-generated speech-to-text output is compared against a human-generated output (called "Ground Truth") for the test audio/video content. Word Error Rate (WER)?generally provides a good idea of the accuracy of the Machine-generated transcripts. WER is one of the metrics used to determine the performance of a speech recognition system, calculated by the number of errors (inserts + substitutions + deletions) divided by the number of total words processed. A low WER score indicates that a transcript closely matches the original audio/video content. The lower the WER, the more accurate the transcript.?

(7) Cost is always a factor to consider when making any purchase, and speech-to-text technology is no different. You will want to compare the cost of the various vendors you are considering. In addition to the initial cost, you should also consider the recurring costs, such as monthly or annual fees. Most ASR providers (if they are not reselling) should be able to offer a true per-second pricing option if they have the ability to detect silence. Per-minute x per-channel models tend to get expensive in the long run and such options should be avoided if possible or negotiated early on.

(8) Support & Service levels should also be considered when choosing a speech-to-text vendor. Make sure to find out what kind of support is available and what the response times are like.

(9) Advanced ML features are another important factor to consider. Speech-to-text technology has come a long way in recent years, and there are now many different features available. You will want to make sure that the vendor you choose offers the advanced ML features that you need such as Topic detection, PII Identification or redaction, Profanity filtering, Editability & Machine Learning from transcription edits, Abstractive or extractive summaries, or any domain-specific classification, and more.

(10) Finally, Ease of working is an important factor to consider. You will want to make sure that the speech-to-text technology you choose is easy to use and implement and that you have the right level of influence in getting support, collaboration, and a solid relationship with the provider's customer success, API support, Account, and Dev-Rel teams.

With so many options available, it can be difficult to know where to start. By taking these factors into consideration, you will be able to make a decision that is right for you and your business. speech-to-text is a critical technology for many businesses, and making the right decision is essential. I hope this blog post has been helpful in your search for the perfect speech-to-text engine. Thanks for reading!

Do you have any other factors you would add to this list? Let me know in the comments below!

I would encourage engaging Nuance Products.

要查看或添加评论,请登录

Vikram Modgil的更多文章

社区洞察

其他会员也浏览了