Facebook AI: The First End-to-End Many-to-One Multilingual Model for Spoken Language Translation
This research summary is just one of many that are distributed weekly on the AI scholar newsletter. To start receiving the weekly newsletter, sign up here.
Recently speech-to-text translation (ST) has attracted a lot of attention. But advances in the field are still dragged behind by a lack of labeled data. This has become a major hindrance for bridging the performance gaps between end-to-end models and cascading systems.
While several datasets have been developed in recent years some only involve English only language pairs, specific domains, very low resource languages or a limited set of language pairs which limits the scope of the studies in the field.
A Diverse Multilingual Speech-To-Text Translation Corpus
Facebook AI research team recently released CoVoST, a diverse multi-lingual speech to text translation dataset. CoVoST is built on Common Voice (2019–06–12 release). The dataset includes speeches in 11 languages (French, German, Dutch, Russian, Spanish, Italian, Turkish, Persian, Swedish, Mongolian and Chinese), their transcripts and English translations.
The research team also provides additional out-of-domain evaluation set from Tatoeba for 5 languages (French, German, Dutch, Russian and Spanish) under CC licenses. CoVoST is released under CC0 license and free for anyone to use.
Potential Uses and Effects
With the internet growth connecting the world, translation services are now more than ever crucial. Speech to text translation algorithms requires to have the capability to translate multiple languages. is a many-to-one multilingual ST corpus that’s mostly similar and concurrent to a corpus released by Iranzo-Sanchez and his team work who created a multilingual ST corpus from the European Parliament proceedings.
However, CoVoST introduces larger speech durations and more translation tokens and is more diversified. It has around 27-hour Russian speeches, 37- hour Italian speeches and 67-hour Persian speeches, which is 1.8 times, 2.5 times and 13.3 times of the previous largest public one (Black, 2019). Most of the sentences (transcripts) in CoVoST are covered by multiple speakers with potentially different accents, resulting in a rich diversity in the speeches. For example, there are over 1,000 speakers and over 10 accents in the French and German development/test sets. This enables good coverage of speech variations in both model training and evaluation.
All the data can be accessed on Github.
Read more: A Diverse Multilingual Speech-To-Text Translation Corpus
High Performance for Executives and Multicultural Teams APAC-West, Training & Consulting??AI, Leadership, Mental Fitness | International Speaker | Podcaster
5 年Very helpful for the translation of my next articles and especially videos. Thank you!
Consultant, Writer, Publisher & Entrepreneur: I help people increase their abilities and improve their outcomes
5 年Truly remarkable what we're able to achieve today!?