Tackling Medical Misinformation using Machine Translation: NLLB-200, Why should we care, Part 2

Tackling Medical Misinformation using Machine Translation: NLLB-200, Why should we care, Part 2

In an era of viral vitriol, superstition, and conspiracy theories about vaccines and epidemics, “freedom of speech” on digital platforms has amplified the influence of misinformation sometimes resulting in real-life fatal outcomes for vulnerable populations with little access to credible health information.

In Part 1, I discussed the looming decline of minority languages, highlighting how high-quality machine translation can reverse this trend. Part 2 focuses on healthcare applications, using models like Meta’s recently open-sourced NLLB-200, a single multilingual model that translates between 200 languages including 55 African languages (plus Yoruba, my mother tongue) as a powerful public health tool.

As highlighted in the NLLB paper, during the Covid-19 outbreak, in communities where science-backed information was sparse due to the lack of trust-worthy formal institutions, seniors in these communities were dependent on their more tech-savvy network and family members to acquire timely, translated health information derived from international organizations. Democratizing high-quality machine translation could shorten this gap, reducing the dependence on intermediaries and opening up quicker access to credible healthcare information.

While machine translation primarily helps those from more advantaged backgrounds learn new languages or travel more effectively, its presence in low-resource language communities could be instrumental for social mobility, economic survival and even longevity. Access to credible healthcare information represents another line of defence against egregious superstitions about the origins, etiology, and pathogenesis of well understood diseases. Vulnerable populations can independently verify diagnostic, therapeutic, and preventative claims that directly impact their quality of life and health outcomes.

The chances are pretty low that scientific breakthroughs, clinical trials, drug discoveries, and systematic reviews will be published in minority languages in any timely manner if at all. This creates a massive information vacuum, where nearly half the world is separated from information that could impact longevity. Thankfully, cable networks like BBC create and publish global and local content in multiple African languages. High-quality, open-source machine translation could scale this potential to the information on the web. Powerful!!

Why does NLLB Work: The Gory Technical Details

No alt text provided for this image

Meta’s NLLB project is one of the best examples I’ve seen of data-centric and value-based design applied to machine learning. The paper shows the team was deliberate about the quality of translations AS PERCEIVED BY native speakers, not just another massive PR stunt. Lasting over a year, the team painstakingly optimized for data quality, spending over 50% effort gathering the right data (37 Petabytes!) to solve the problem. Here are a few highlights showing WHY their approach worked:

  • Dirty Data cleaning at scale: 37 Petabytes of web data, monolingual and parallel text must have been a nightmare to preprocess and clean. Storing, moving, and processing the data across clusters was no mean feat. For the scale of the problem, 40,000+ translating directions, it was definitely worth it.
  • NLLB Seed: professional translations in 204 languages on 3000 selected sentences from Wikipedia pages available in several languages. This core set of parallel translations was the foundation for a lot of the experiments and model evaluation, especially for very low-resource languages.
  • NLLB-MD: professional translations across 4 domains: news, health, scripted formal speech, unscripted informal speech
  • Flores-200: An Evaluation dataset in 204 languages, building and expanding on Flores-100 benchmark data
  • Toxicity-200: wordlists to detect toxicity in 200 languages. This investment was necessary to keep the resulting model safe and sane given the amount of vitriol in web text
  • Bitext Mining: Find synthetic translations by scouring the web. But how?
  • Language Identification (LID): Correctly identify monolingual text for more than 200 languages Using Language Identification models and techniques?
  • LASER3: train sentence encoders for each language using a massive multilingual teacher with monolingual student models for each language or group of similar languages, aligned/anchoring to the same vector space by minimizing cosine loss with teacher. This creates sentence encoders for identifying aligned bitext for 148 languages
  • Creating Bitexts: Use LASER3 to identify sentences from the web in other languages similar in meaning to each preselected sentence from the monolingual data.
  • Sparsely gated mixture of experts: a sub-type of conditional compute models that activate a subset of model parameters per input, as opposed to dense models that activate all model parameters per input. Functioning like adapters, the FFN in the Transformer model architecture is replaced with multiple FFNs which can be turned on/off with gating mechanism or dropout, automatically routing input through different paths.
  • Curriculum learning: training begins on high-resource languages before introducing low-resource languages. Essentially, low resource languages were introduced at 60% completion and very-low resource languages were introduced at 80% based on the number of updates to overfit translation direction
  • Data augmentation with back-translation

Let me know in the comments if this was helpful or if I missed out any important detail

The End!

要查看或添加评论,请登录

Tobi Olatunji MD的更多文章

社区洞察

其他会员也浏览了