登录查看更多内容

Tackling Medical Misinformation using Machine Translation: NLLB-200, Why should we care, Part 2

Tobi Olatunji MD

ASR/MT/AI for Global Health @ Intron | Ex-AWS | Ex-Enlitic | 3x patents

发布日期: 2022年8月19日

In an era of viral vitriol, superstition, and conspiracy theories about vaccines and epidemics, “freedom of speech” on digital platforms has amplified the influence of misinformation sometimes resulting in real-life fatal outcomes for vulnerable populations with little access to credible health information.

In Part 1, I discussed the looming decline of minority languages, highlighting how high-quality machine translation can reverse this trend. Part 2 focuses on healthcare applications, using models like Meta’s recently open-sourced NLLB-200, a single multilingual model that translates between 200 languages including 55 African languages (plus Yoruba, my mother tongue) as a powerful public health tool.

As highlighted in the NLLB paper, during the Covid-19 outbreak, in communities where science-backed information was sparse due to the lack of trust-worthy formal institutions, seniors in these communities were dependent on their more tech-savvy network and family members to acquire timely, translated health information derived from international organizations. Democratizing high-quality machine translation could shorten this gap, reducing the dependence on intermediaries and opening up quicker access to credible healthcare information.

While machine translation primarily helps those from more advantaged backgrounds learn new languages or travel more effectively, its presence in low-resource language communities could be instrumental for social mobility, economic survival and even longevity. Access to credible healthcare information represents another line of defence against egregious superstitions about the origins, etiology, and pathogenesis of well understood diseases. Vulnerable populations can independently verify diagnostic, therapeutic, and preventative claims that directly impact their quality of life and health outcomes.

The chances are pretty low that scientific breakthroughs, clinical trials, drug discoveries, and systematic reviews will be published in minority languages in any timely manner if at all. This creates a massive information vacuum, where nearly half the world is separated from information that could impact longevity. Thankfully, cable networks like BBC create and publish global and local content in multiple African languages. High-quality, open-source machine translation could scale this potential to the information on the web. Powerful!!

领英推荐

Think different, think local ??

Bhasker Gupta 3 年前

Paper Review: Translatotron 3: Speech to Speech…

Andrey Lukyanenko 1 年前

Microsoft's AI breakthrough will make it easier to…

Glenn Leibowitz 6 年前

Why does NLLB Work: The Gory Technical Details

Meta’s NLLB project is one of the best examples I’ve seen of data-centric and value-based design applied to machine learning. The paper shows the team was deliberate about the quality of translations AS PERCEIVED BY native speakers, not just another massive PR stunt. Lasting over a year, the team painstakingly optimized for data quality, spending over 50% effort gathering the right data (37 Petabytes!) to solve the problem. Here are a few highlights showing WHY their approach worked:

Dirty Data cleaning at scale: 37 Petabytes of web data, monolingual and parallel text must have been a nightmare to preprocess and clean. Storing, moving, and processing the data across clusters was no mean feat. For the scale of the problem, 40,000+ translating directions, it was definitely worth it.
NLLB Seed: professional translations in 204 languages on 3000 selected sentences from Wikipedia pages available in several languages. This core set of parallel translations was the foundation for a lot of the experiments and model evaluation, especially for very low-resource languages.
NLLB-MD: professional translations across 4 domains: news, health, scripted formal speech, unscripted informal speech
Flores-200: An Evaluation dataset in 204 languages, building and expanding on Flores-100 benchmark data
Toxicity-200: wordlists to detect toxicity in 200 languages. This investment was necessary to keep the resulting model safe and sane given the amount of vitriol in web text
Bitext Mining: Find synthetic translations by scouring the web. But how?
Language Identification (LID): Correctly identify monolingual text for more than 200 languages Using Language Identification models and techniques?
LASER3: train sentence encoders for each language using a massive multilingual teacher with monolingual student models for each language or group of similar languages, aligned/anchoring to the same vector space by minimizing cosine loss with teacher. This creates sentence encoders for identifying aligned bitext for 148 languages
Creating Bitexts: Use LASER3 to identify sentences from the web in other languages similar in meaning to each preselected sentence from the monolingual data.
Sparsely gated mixture of experts: a sub-type of conditional compute models that activate a subset of model parameters per input, as opposed to dense models that activate all model parameters per input. Functioning like adapters, the FFN in the Transformer model architecture is replaced with multiple FFNs which can be turned on/off with gating mechanism or dropout, automatically routing input through different paths.
Curriculum learning: training begins on high-resource languages before introducing low-resource languages. Essentially, low resource languages were introduced at 60% completion and very-low resource languages were introduced at 80% based on the number of updates to overfit translation direction
Data augmentation with back-translation

Let me know in the comments if this was helpful or if I missed out any important detail

The End!

要查看或添加评论，请登录

Tobi Olatunji MD的更多文章

USMLE score [Google AI (PaLM)]: 67% on 1,273 NBME Questions

2022年12月30日

USMLE score [Google AI (PaLM)]: 67% on 1,273 NBME Questions

Should all my colleagues get scared or excited? Is this a great #USMLE study assistant or will AI take over medicine…

5 条评论
The decline in minority languages and MetaAI's NLLB model for multi-lingual translation

2022年8月11日

The decline in minority languages and MetaAI's NLLB model for multi-lingual translation

No-Language-Left-Behind (NLLB), MetaAI’s mega-project to automatically translate between 200 languages with a single…

4 条评论
Closing the Mistranslation Chasm between Clinicians and Engineering Teams

2020年9月17日

Closing the Mistranslation Chasm between Clinicians and Engineering Teams

I remember having to explain how interleukins work to a team of engineers working on a drug discovery project for…

4 条评论
A dramatic explanation of convex optimization and gradient descent to a group of ER physicians in a language they understand all too well

2020年9月10日

A dramatic explanation of convex optimization and gradient descent to a group of ER physicians in a language they understand all too well

Q: Most algorithms in medicine are carefully hand-crafted rules built on years of research. How on earth does “AI”…

4 条评论

Tackling Medical Misinformation using Machine Translation: NLLB-200, Why should we care, Part 2

Tobi Olatunji MD

ASR/MT/AI for Global Health @ Intron | Ex-AWS | Ex-Enlitic | 3x patents

领英推荐

Why does NLLB Work: The Gory Technical Details

Tobi Olatunji MD的更多文章

社区洞察

其他会员也浏览了

Exploring the Future: A Human Odyssey #7 – Language translation is all you need

ChatSUTRA: Bridging Global Conversations with Multilingual AI

Digital Transformation in Pharma: AI-Driven Language Services for Enhanced Global Communication

What to Watch: Future Trends in the Language Services Industry

Future-Proofing Language Services: AI, Human Expertise, and the Path to Quality Assurance

How Language Solutions Are Powering Innovation in the Life Sciences Industry

The Continued Relevance of Language Learning in the Age of AI-Based Translation Tools

Beyond Barriers: How Large Language Models Can Power a New Era of Health Equity

Making internet access to every Indian with “digital India Bhashini”

Devnagri Secures Funding to Revolutionize Multilingual Communication

领英推荐

Why does NLLB Work: The Gory Technical Details

Tobi Olatunji MD的更多文章

USMLE score [Google AI (PaLM)]: 67% on 1,273 NBME Questions

The decline in minority languages and MetaAI's NLLB model for multi-lingual translation

Closing the Mistranslation Chasm between Clinicians and Engineering Teams

A dramatic explanation of convex optimization and gradient descent to a group of ER physicians in a language they understand all too well

社区洞察

其他会员也浏览了

Exploring the Future: A Human Odyssey #7 – Language translation is all you need

ChatSUTRA: Bridging Global Conversations with Multilingual AI

Digital Transformation in Pharma: AI-Driven Language Services for Enhanced Global Communication

What to Watch: Future Trends in the Language Services Industry

Future-Proofing Language Services: AI, Human Expertise, and the Path to Quality Assurance

How Language Solutions Are Powering Innovation in the Life Sciences Industry

The Continued Relevance of Language Learning in the Age of AI-Based Translation Tools

Beyond Barriers: How Large Language Models Can Power a New Era of Health Equity

Making internet access to every Indian with “digital India Bhashini”

Devnagri Secures Funding to Revolutionize Multilingual Communication