Facebook AI: The First End-to-End Many-to-One Multilingual Model for Spoken Language Translation

Facebook AI: The First End-to-End Many-to-One Multilingual Model for Spoken Language Translation

This research summary is just one of many that are distributed weekly on the AI scholar newsletter. To start receiving the weekly newsletter, sign up here.

Recently speech-to-text translation (ST) has attracted a lot of attention. But advances in the field are still dragged behind by a lack of labeled data. This has become a major hindrance for bridging the performance gaps between end-to-end models and cascading systems.

While several datasets have been developed in recent years some only involve English only language pairs, specific domains, very low resource languages or a limited set of language pairs which limits the scope of the studies in the field.

A Diverse Multilingual Speech-To-Text Translation Corpus

Facebook AI research team recently released CoVoST, a diverse multi-lingual speech to text translation dataset. CoVoST is built on Common Voice (2019–06–12 release). The dataset includes speeches in 11 languages (French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian and Chinese), their transcripts and English translations.

The research team also provides additional out-of-domain evaluation set from Tatoeba for 5 languages (French, German, Dutch, Russian and Spanish) under CC licenses. CoVoST is released under CC0 license and free for anyone to use.

Potential Uses and Effects

With the internet growth connecting the world, translation services are now more than ever crucial. Speech to text translation algorithms requires to have the capability to translate multiple languages. is a many-to-one multilingual ST corpus that’s mostly similar and concurrent to a corpus released by Iranzo-Sanchez and his team work who created a multilingual ST corpus from the European Parliament proceedings.

However, CoVoST introduces larger speech durations and more translation tokens and is more diversified. It has around 27-hour Russian speeches, 37- hour Italian speeches and 67-hour Persian speeches, which is 1.8 times, 2.5 times and 13.3 times of the previous largest public one (Black, 2019). Most of the sentences (transcripts) in CoVoST are covered by multiple speakers with potentially different accents, resulting in a rich diversity in the speeches. For example, there are over 1,000 speakers and over 10 accents in the French and German development/test sets. This enables good coverage of speech variations in both model training and evaluation.

All the data can be accessed on Github.

Read more: A Diverse Multilingual Speech-To-Text Translation Corpus

Thanks for reading, comment, share & let’s connect on TwitterMedium, and Facebook. Stay updated with the latest AI research developments, news, resources, tools, and more by subscribing to our weekly AI Scholar Newsletter for free! Subscribe here Remember to ?? if you enjoyed this article. Cheers!


Janine Jakob 安丽彦

High Performance for Executives and Multicultural Teams APAC-West, Training & Consulting??AI, Leadership, Mental Fitness | International Speaker | Podcaster

5 年

Very helpful for the translation of my next articles and especially videos. Thank you!

回复
Vinnie Apicella

Consultant, Writer, Publisher & Entrepreneur: I help people increase their abilities and improve their outcomes

5 年

Truly remarkable what we're able to achieve today!?

回复

要查看或添加评论,请登录

Christopher D.的更多文章

社区洞察

其他会员也浏览了