How multilingual is Multilingual BERT?
Ibrahim Sobh - PhD
?? Senior Expert of Artificial Intelligence, Valeo Group | LinkedIn Top Voice | Machine Learning | Deep Learning | Data Science | Computer Vision | NLP | Developer | Researcher | Lecturer
This article is basically an extractive summary of the paper "How multilingual is Multilingual BERT?" by Google.
Previously, we have introduced Transformers and BERT.
What is M-BERT?
?? M-BERT: a single language model pre-trained on the concatenation of monolingual Wikipedia corpora from 104 languages.
Cross-lingual generalization
Surprisingly, good at zero-shot cross-lingual model transfer
?? We fine-tune the model using task-specific supervised training data from one language, and evaluate that task in a different language, thus allowing us to observe the ways in which the model generalizes information across languages.
?? M-BERT’s pretraining on multiple languages has enabled a representational capacity deeper than simple vocabulary memorization.
Generalization across scripts
?? While the high lexical overlap between languages improves transfer, M-BERT is also able to transfer between languages written in different scripts, thus having zero lexical overlap, indicating that it captures multilingual representations.
?? M-BERT’s ability to transfer between languages that are written in different scripts, and thus have effectively zero lexical overlap, is surprising!
An M-BERT model that has been fine-tuned using only POS labeled Urdu (written in Arabic script), achieves 91% accuracy on Hindi (written in Devanagari script), even though it has never seen a single POStagged Devanagari word.
This provides clear evidence of M-BERT’s multilingual representation ability, mapping structures onto new vocabularies based on a shared representation induced solely from monolingual language model training data.
Cross-script transfer is less accurate for other pairs, such as English and Japanese, indicating that M-BERT’s multilingual representation is not able to generalize equally well in all cases.
??A possible explanation for this is typological similarity. English and Japanese have a different order of subject, verb and object, while English and Bulgarian have the same, and M-BERT may be having trouble generalizing across different orderings.
Subject, Object, Verb order
?? Performance is best when transferring between languages that share word order features
?? While M-BERT’s multilingual representation is able to map learned structures onto new vocabularies, it does not seem to learn systematic transformations of those structures to accommodate a target language with different word order.
Multilingual characterization of the feature space
- We sample 5000 pairs of sentences from WMT16 (Bojar et al., 2016) and feed each sentence separately to M-BERT with no fine-tuning.
- We then extract the hidden feature activations at each layer for each of the sentences and average the representations for the input tokens except [CLS] and [SEP], to get a vector for each sentence, at each layer.
- For each pair of sentences, we compute the vector pointing from one to the other and average it over all pairs.
- Finally, we translate each sentence, EN→DE, find the closest German sentence vector, and measure the fraction of times the nearest neighbor is the correct pair, which we call the “nearest neighbor accuracy”.
We plot the nearest neighbor accuracy for EN-DE (solid line). It achieves over 50% accuracy for all but the bottom layers, which seems to imply that the hidden representations, share a common subspace that represents useful linguistic information, in a language-agnostic way. Similar curves are obtained for EN-RU, and UR-HI (in-house dataset), showing this works for multiple languages.
Why the accuracy goes down in the last few layers?
??one possible explanation is that since the model was pre-trained for language modeling, it might need more language-specific information to correctly predict the missing word.
Why M-BERT generalizes across languages?
?? We hypothesize that having word pieces used in all languages (numbers, URLs, etc) which have to be mapped to a shared space forces the co-occurring pieces to also be mapped to a shared space, thus spreading the effect to other word pieces, until different languages are close to a shared space.
Conclusions
- M-BERT’s robust, often surprising, ability to generalize crosslingually is underpinned by a multilingual representation, without being explicitly trained for it.
- The model handles transfer across scripts and to code-switching fairly well.
- It is our hope that these kinds of probing experiments will help steer researchers toward the most promising lines of inquiry by encouraging them to focus on the places where current contextualized word representation approaches fall short.
Can we do better? Yes, we can! wait for the next article
Regards
Machine Learning Engineer ?? | Software Engineer ?? | Game Developer ?? | UI/UX Designer??| Microsoft Learn Student Ambassador????| Best Speaker among 18 Egyptian Universities in 2024??
7 个月Great article?
Research Engineer, Meta FAIR
4 年Great article!
Research and Development Algorithms Engineer - ADAS Nearfield, Perception SW @ ZF Group | Sensor Calibration | Perception | ADAS | M.Sc.
4 年Interesting ??