登录查看更多内容

10 Factors to Consider When Evaluating Speech-To-Text Engines

Vikram Modgil

CX Product Growth Acceleration at Amazon Connect | AWS Solutions | Mentor, Advisor, 7x Startups | All views & opinions my own

发布日期: 2022年4月19日

Are you evaluating the addition of Speech-to-text (or automatic speech recognition (ASR)) to your existing products? The ASR buying decision is quite complex because the outputs are generally visible to everyone. Errors make the news story but you may get a 'meh' as a reaction when it works well.?

A?well-designed and well-implemented?ASR is clearly a valuable addition to many WebRTC-based products, communication & collaboration products, Meetings, Events, and Webinar products, Healthcare, Workplace management, revenue intelligence, contact center, and more.

When it comes to speech-to-text technology, due to the complexity of human speech, infrastructure, accents, and adaptability of acoustic models, making the right decision is critical for your business. With so many different engines to choose from, it can be difficult to know where to start. In this blog post, I will outline 10 factors you should consider when choosing a speech-to-text engine. By considering these factors, you will be able to make an informed decision that meets your needs and hopefully exceeds your end-user expectations.

(1) Use case and requirement clarity: Most ASRs offer a cloud API but is on-prem or on-device a critical requirement? It is very important to clearly define the use case and requirements. Whether it is an IVR scenario, Telephony-based conversations, A recorded batch mode use case, the length of the conversation especially for recordings, and other aspects that are critical need to be finalized.

(2) Core Feature set included: Most ASRs go beyond just offering a simple speech-to-text APIs, most commonly offered key features must include: Speaker separation (a.k.a speaker diarization or speaker labels), Automatic punctuation & casing, Custom Vocabulary support, Languages & accents supported, Speech analytics (such as words per minute, listen/talk ration, etc), Formatting (markdown, paragraph, etc), and many more. There are about ~75 features available (I will write another blog about that) in the market but most offer only ~15 to ~25 features at best.

(3) Privacy & Security: Who owns the data and where is it stored? It is important to understand how the API uses your data. Do they keep a copy of your audio/video files and transcriptions to improve their API? Do they give you the ability to permanently delete the audio/video files you send to the API and the transcript that is generated?

(4) Enterprise readiness: There is plenty of amateur-built, low-quality, and unsafe speech-to-text variants online, it is important to make sure that the speech-to-text platform is built for corporate content & scale. A lot of vendors just use Kaldi or equivalent open-source ASR and use whatever-is-available audio/video data to label and train - which doesn't work in enterprise scenarios.

(5) Integrations & QA Testing can be complex and costly, especially if you have real-time use cases, high concurrency use cases, or use additional media mixing or broadcasting infrastructure. One factor that is important is availability & access to SDKs, community slack channels, developer documentation, and support, etc

领英推荐

The Synergy of AI and Human Touch: Enhancing…

? Daniel Burrus 9 个月前

There are no easy A’s in ASR accuracy

Rev 4 个月前

AI Keynote Q&A: What People Really Want to Know about…

Susan Frew, CSP 1 个月前

(6) Accuracy - Establish Target & Baseline accuracy - Correct benchmarking methodologies require time & effort. Machine-generated speech-to-text output is compared against a human-generated output (called "Ground Truth") for the test audio/video content. Word Error Rate (WER)?generally provides a good idea of the accuracy of the Machine-generated transcripts. WER is one of the metrics used to determine the performance of a speech recognition system, calculated by the number of errors (inserts + substitutions + deletions) divided by the number of total words processed. A low WER score indicates that a transcript closely matches the original audio/video content. The lower the WER, the more accurate the transcript.?

(7) Cost is always a factor to consider when making any purchase, and speech-to-text technology is no different. You will want to compare the cost of the various vendors you are considering. In addition to the initial cost, you should also consider the recurring costs, such as monthly or annual fees. Most ASR providers (if they are not reselling) should be able to offer a true per-second pricing option if they have the ability to detect silence. Per-minute x per-channel models tend to get expensive in the long run and such options should be avoided if possible or negotiated early on.

(8) Support & Service levels should also be considered when choosing a speech-to-text vendor. Make sure to find out what kind of support is available and what the response times are like.

(9) Advanced ML features are another important factor to consider. Speech-to-text technology has come a long way in recent years, and there are now many different features available. You will want to make sure that the vendor you choose offers the advanced ML features that you need such as Topic detection, PII Identification or redaction, Profanity filtering, Editability & Machine Learning from transcription edits, Abstractive or extractive summaries, or any domain-specific classification, and more.

(10) Finally, Ease of working is an important factor to consider. You will want to make sure that the speech-to-text technology you choose is easy to use and implement and that you have the right level of influence in getting support, collaboration, and a solid relationship with the provider's customer success, API support, Account, and Dev-Rel teams.

With so many options available, it can be difficult to know where to start. By taking these factors into consideration, you will be able to make a decision that is right for you and your business. speech-to-text is a critical technology for many businesses, and making the right decision is essential. I hope this blog post has been helpful in your search for the perfect speech-to-text engine. Thanks for reading!

Do you have any other factors you would add to this list? Let me know in the comments below!

eDist Canada

2 年

I would encourage engaging Nuance Products.

1 次回应

要查看或添加评论，请登录

Vikram Modgil的更多文章

75 Speech Recognition Features

2022年5月11日

75 Speech Recognition Features

If you are considering the addition of Speech-to-text (or automatic speech recognition (ASR)) to your product, there…

1 条评论
How Does Speech Recognition Technology Work?

2022年5月3日

How Does Speech Recognition Technology Work?

It seems easy now, but numerous failures and dead ends have hit every advance in speech recognition. Between 2013 and…

2 条评论
Product-Led Growth: How to Drive User Growth with Your Product

2022年4月11日

Product-Led Growth: How to Drive User Growth with Your Product

It's no secret that a successful business is built on top of a great product. Product-led growth is a company approach…

3 条评论
Shelf Life of Conversations: How Long Do They Last?

2022年4月6日

Shelf Life of Conversations: How Long Do They Last?

Have you ever had a conversation that you wished would have lasted longer? Or, maybe one that you wish would have never…
How to Manage Your Conversations using Deep Learning...

2020年2月11日

How to Manage Your Conversations using Deep Learning...

Let's talk about how we can use AI to manage conversations. But before we go too deep, I want to share a recent…

1 条评论

See all articles

10 Factors to Consider When Evaluating Speech-To-Text Engines

Vikram Modgil

CX Product Growth Acceleration at Amazon Connect | AWS Solutions | Mentor, Advisor, 7x Startups | All views & opinions my own

领英推荐

Vikram Modgil的更多文章

社区洞察

其他会员也浏览了

AI Insights #24

How to avoid losing your office job to AI - the art of effective prompting

Streamlining Business Meetings with AI-Powered Transcription

Scale Big, Stay Lean with AI

Boosting Efficiency as a Business Owner: The AI Revolution

Fireflies.ai 7-Day Free Trial: Everything You Need to Know

No Jitter Roll: Responsible AI Practices and New GenAI Helpers

Can AI take better meeting notes than you?

Beyond Traditional Workflows With AI Assistance

The AI Revolution in Business Communication: A Personal Take

领英推荐

Vikram Modgil的更多文章

75 Speech Recognition Features

How Does Speech Recognition Technology Work?

Product-Led Growth: How to Drive User Growth with Your Product

Shelf Life of Conversations: How Long Do They Last?

How to Manage Your Conversations using Deep Learning...

社区洞察

其他会员也浏览了

AI Insights #24

How to avoid losing your office job to AI - the art of effective prompting

Streamlining Business Meetings with AI-Powered Transcription

Scale Big, Stay Lean with AI

Boosting Efficiency as a Business Owner: The AI Revolution

Fireflies.ai 7-Day Free Trial: Everything You Need to Know

No Jitter Roll: Responsible AI Practices and New GenAI Helpers

Can AI take better meeting notes than you?

Beyond Traditional Workflows With AI Assistance

The AI Revolution in Business Communication: A Personal Take