Language Models' Factuality Depends on the Language of Inquiry
Credit: https://arxiv.org/pdf/2502.17955

Language Models' Factuality Depends on the Language of Inquiry

Today's paper investigates an interesting limitation in multilingual language models (LMs): their inconsistency in recalling factual knowledge across different languages. The authors discover that LMs may correctly recall a fact when asked in one language but fail to do so in another, even when they possess the correct information. This inconsistency reveals that factual knowledge in LMs is often stored in language-specific "silos" rather than being universally accessible.

Overview

The paper introduces a comprehensive benchmark to evaluate how consistently language models recall factual knowledge across different languages. The authors created a dataset of 10,000 country-related facts translated into 13 languages, categorized into high-resource (English, Chinese, French, Japanese), medium-resource (Hindi, Russian, Arabic, Greek), and low-resource (Nepali, Ukrainian, Turkish, Swahili, Thai) languages.

The benchmark evaluates three key capabilities: Factual Recall, In-context Recall, and Counter-Factual Context Adherence. For Factual Recall, models are asked simple factual questions about entities associated with specific countries in different languages. In-context Recall tests whether models can use contextual information to answer questions without being influenced by their internal knowledge. Counter-Factual Context Adherence examines whether models adhere to provided context even when it contradicts their internal factual knowledge.

To quantify performance, the paper uses three metrics. The Factual Recall Score (FRS) measures how accurately a model recalls facts in a given language. The Knowledge Transferability Score (KTS) quantifies how well factual knowledge is transferred across languages. The Cross-Lingual Factual Knowledge Transferability (X-FaKT) Score combines both measures to provide a comprehensive evaluation of a model's ability to maintain consistent factual knowledge across languages.

The authors distinguish between "associative knowledge" (facts asked in the language associated with the country) and "non-associative knowledge" (facts asked in languages not associated with the country). This distinction helps identify whether models perform better when recalling facts in languages culturally connected to those facts.

Results

The paper reveals several important findings about multilingual language models:

  1. All tested models demonstrate significantly better performance on "associative knowledge" compared to "non-associative knowledge," indicating that facts are often stored in language-specific silos rather than being universally accessible.
  2. Larger models generally perform better at both factual recall and knowledge transferability. Llama-3-70B achieved the highest X-FaKT score of 0.848, demonstrating superior balanced performance in both factual recall and knowledge transfer.
  3. High-resource languages like English and French show consistently lower error rates (around 3.83% for associative knowledge) compared to medium and low-resource languages (26.73% and 29.53% respectively).
  4. Languages that share similar scripts (like Hindi-Nepali and Russian-Ukrainian pairs) show correlated performance patterns, suggesting that script similarity plays a crucial role in knowledge transfer.


Conclusion

The paper reveals a limitation in current multilingual language models: their inability to consistently transfer factual knowledge across languages. This inconsistency suggests that factual information is stored in language-specific silos rather than being universally accessible. The authors emphasize the need for "calibrated multilingualism", where models can autonomously leverage the most reliable internal representations for any given multilingual query. For more information please consult the full paper.

Congrats to the authors for their work!

Aggarwal, Tushar, et al. "Language Models' Factuality Depends on the Language of Inquiry." arXiv preprint arXiv:2502.17955 (2025).

要查看或添加评论,请登录

Vlad Bogolin的更多文章