登录查看更多内容

How multilingual is Multilingual BERT?

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

发布日期: 2021年1月11日

+ 关注

This article is basically an extractive summary of the paper "How multilingual is Multilingual BERT?" by Google.

Previously, we have introduced Transformers and BERT.

What is M-BERT?

?? M-BERT: a single language model pre-trained on the concatenation of monolingual Wikipedia corpora from 104 languages.

Cross-lingual generalization

Surprisingly, good at zero-shot cross-lingual model transfer

?? We fine-tune the model using task-specific supervised training data from one language, and evaluate that task in a different language, thus allowing us to observe the ways in which the model generalizes information across languages.

?? M-BERT’s pretraining on multiple languages has enabled a representational capacity deeper than simple vocabulary memorization.

Generalization across scripts

?? While the high lexical overlap between languages improves transfer, M-BERT is also able to transfer between languages written in different scripts, thus having zero lexical overlap, indicating that it captures multilingual representations.

?? M-BERT’s ability to transfer between languages that are written in different scripts, and thus have effectively zero lexical overlap, is surprising!

An M-BERT model that has been fine-tuned using only POS labeled Urdu (written in Arabic script), achieves 91% accuracy on Hindi (written in Devanagari script), even though it has never seen a single POStagged Devanagari word.

This provides clear evidence of M-BERT’s multilingual representation ability, mapping structures onto new vocabularies based on a shared representation induced solely from monolingual language model training data.

Cross-script transfer is less accurate for other pairs, such as English and Japanese, indicating that M-BERT’s multilingual representation is not able to generalize equally well in all cases.

??A possible explanation for this is typological similarity. English and Japanese have a different order of subject, verb and object, while English and Bulgarian have the same, and M-BERT may be having trouble generalizing across different orderings.

Subject, Object, Verb order

?? Performance is best when transferring between languages that share word order features

?? While M-BERT’s multilingual representation is able to map learned structures onto new vocabularies, it does not seem to learn systematic transformations of those structures to accommodate a target language with different word order.

Multilingual characterization of the feature space

We sample 5000 pairs of sentences from WMT16 (Bojar et al., 2016) and feed each sentence separately to M-BERT with no fine-tuning.
We then extract the hidden feature activations at each layer for each of the sentences and average the representations for the input tokens except [CLS] and [SEP], to get a vector for each sentence, at each layer.
For each pair of sentences, we compute the vector pointing from one to the other and average it over all pairs.
Finally, we translate each sentence, EN→DE, find the closest German sentence vector, and measure the fraction of times the nearest neighbor is the correct pair, which we call the “nearest neighbor accuracy”.

We plot the nearest neighbor accuracy for EN-DE (solid line). It achieves over 50% accuracy for all but the bottom layers, which seems to imply that the hidden representations, share a common subspace that represents useful linguistic information, in a language-agnostic way. Similar curves are obtained for EN-RU, and UR-HI (in-house dataset), showing this works for multiple languages.

Why the accuracy goes down in the last few layers?

??one possible explanation is that since the model was pre-trained for language modeling, it might need more language-specific information to correctly predict the missing word.

Why M-BERT generalizes across languages?

?? We hypothesize that having word pieces used in all languages (numbers, URLs, etc) which have to be mapped to a shared space forces the co-occurring pieces to also be mapped to a shared space, thus spreading the effect to other word pieces, until different languages are close to a shared space.

Conclusions

M-BERT’s robust, often surprising, ability to generalize crosslingually is underpinned by a multilingual representation, without being explicitly trained for it.
The model handles transfer across scripts and to code-switching fairly well.
It is our hope that these kinds of probing experiments will help steer researchers toward the most promising lines of inquiry by encouraging them to focus on the places where current contextualized word representation approaches fall short.

Can we do better? Yes, we can! wait for the next article

Regards

Ali Arabi

7 个月

Great article?

Mostafa Elhoushi

Research Engineer, Meta FAIR

4 年

Great article!

1 次回应

Mostafa Hussein

Research and Development Algorithms Engineer - ADAS Nearfield, Perception SW @ ZF Group | Sensor Calibration | Perception | ADAS | M.Sc.

4 年

Interesting ??

1 次回应

查看更多评论

要查看或添加评论，请登录

Ibrahim Sobh - PhD的更多文章

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

2025年3月1日

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

Article created by Perplexity Deep Research. Prompt: "You are a deep-learning experienced researcher.

1 条评论
The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

2025年3月1日

The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

Research Report Created by Perplexity Deep Research My Research Question : "Now I want to dig deeper in the human judge…

3 条评论
How to Learn Artificial Intelligence: A Beginner’s Guide

2024年5月31日

How to Learn Artificial Intelligence: A Beginner’s Guide

Artificial Intelligence (AI) is a fascinating field that simulates human intelligence and task performance using…
[????????????] ?????????????????? ???????????? explained with code ??

2023年1月28日

[????????????] ?????????????????? ???????????? explained with code ??

"During the last two years there has been a plethora of large generative models such as ChatGPT or Stable Diffusion…

2 条评论
A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

2023年1月21日

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

Hello everyone, and thank you all for being here today! Let me introduce our new star, the ChatGPT, who will discuss…
10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

2022年2月17日

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

In this article, 10 well-known pre-trained object detectors are loaded and used in a standard and easy way. YOLOF: You…

6 条评论
FNet: Do we need the attention layer at all? [Explained with code]

2021年10月30日

FNet: Do we need the attention layer at all? [Explained with code]

FNet: Mixing Tokens with Fourier Transforms "In this work, we investigate whether simpler token mixing mechanisms can…
Patches Are All You Need! [with code]

2021年10月28日

Patches Are All You Need! [with code]

"It is only a matter of time before Transformers become the dominant architecture for vision domains, just as they have…
MLP is all you need! [with code]

2021年10月23日

MLP is all you need! [with code]

From Google: MLP-Mixer: An all-MLP Architecture for Vision Main idea: "While convolutions and attention are both…

2 条评论
9 Steps for solving any machine learning problem

2021年8月28日

9 Steps for solving any machine learning problem

In this article, we will present a universal blueprint that we can use to attack and solve any machine-learning…

2 条评论

See all articles

社区洞察

Linguistics

You’re researching language acquisition in multilingual children. What’s the best tool to use?

How multilingual is Multilingual BERT?

Ibrahim Sobh - PhD

?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer

What is M-BERT?

Cross-lingual generalization

Generalization across scripts

Subject, Object, Verb order

Multilingual characterization of the feature space

Conclusions

Ibrahim Sobh - PhD的更多文章

社区洞察

其他会员也浏览了

The LanguageLine Round Up #10

How SUTRA, A Multilingual AI Model by Two AI Is Reshaping Language Processing in South Asian Markets

Issue 14: The power of Multi-lingual LLMs

AI Enables Translation of Rare Languages

Utilizing AI for multilingual Traffic Announcements – Possibilities and Limitations

Paper Review: Translatotron 3: Speech to Speech Translation with Monolingual Data

4 Benefits of Comprehensive Language Support for Global Businesses

How I Use AI For Chinese Language Learning For Free

Shaping the Future at the TAUS Massively Multilingual AI Conference

Unveiling 3DI's Groundbreaking Language Detection in Document Processing

What is M-BERT?

Cross-lingual generalization

Generalization across scripts

Subject, Object, Verb order

Multilingual characterization of the feature space

Conclusions

Ibrahim Sobh - PhD的更多文章

The Evolution and Applications of Attention Mechanisms in Deep Learning: A Comprehensive Survey

The Judicial Cognitive Process: From Case Inception to Judgment and the Promise of AI Augmentation

How to Learn Artificial Intelligence: A Beginner’s Guide

[????????????] ?????????????????? ???????????? explained with code ??

A conversation with ChatGPT about AI, study roadmap, applications, interview questions with answers, salaries, and more!

10 Object detectors with code [YOLOF, YOLOX, DETR, Deformable DETR, SparseR-CNN, VarifocalNet, PAA, SABL, ATSS, Double Heads]

FNet: Do we need the attention layer at all? [Explained with code]

Patches Are All You Need! [with code]

MLP is all you need! [with code]

9 Steps for solving any machine learning problem

社区洞察

其他会员也浏览了

The LanguageLine Round Up #10

How SUTRA, A Multilingual AI Model by Two AI Is Reshaping Language Processing in South Asian Markets

Issue 14: The power of Multi-lingual LLMs

AI Enables Translation of Rare Languages

Utilizing AI for multilingual Traffic Announcements – Possibilities and Limitations

Paper Review: Translatotron 3: Speech to Speech Translation with Monolingual Data

4 Benefits of Comprehensive Language Support for Global Businesses

How I Use AI For Chinese Language Learning For Free

Shaping the Future at the TAUS Massively Multilingual AI Conference

Unveiling 3DI's Groundbreaking Language Detection in Document Processing