Language Models' Factuality Depends on the Language of Inquiry
Today's paper investigates an interesting limitation in multilingual language models (LMs): their inconsistency in recalling factual knowledge across different languages. The authors discover that LMs may correctly recall a fact when asked in one language but fail to do so in another, even when they possess the correct information. This inconsistency reveals that factual knowledge in LMs is often stored in language-specific "silos" rather than being universally accessible.
Overview
The paper introduces a comprehensive benchmark to evaluate how consistently language models recall factual knowledge across different languages. The authors created a dataset of 10,000 country-related facts translated into 13 languages, categorized into high-resource (English, Chinese, French, Japanese), medium-resource (Hindi, Russian, Arabic, Greek), and low-resource (Nepali, Ukrainian, Turkish, Swahili, Thai) languages.
The benchmark evaluates three key capabilities: Factual Recall, In-context Recall, and Counter-Factual Context Adherence. For Factual Recall, models are asked simple factual questions about entities associated with specific countries in different languages. In-context Recall tests whether models can use contextual information to answer questions without being influenced by their internal knowledge. Counter-Factual Context Adherence examines whether models adhere to provided context even when it contradicts their internal factual knowledge.
To quantify performance, the paper uses three metrics. The Factual Recall Score (FRS) measures how accurately a model recalls facts in a given language. The Knowledge Transferability Score (KTS) quantifies how well factual knowledge is transferred across languages. The Cross-Lingual Factual Knowledge Transferability (X-FaKT) Score combines both measures to provide a comprehensive evaluation of a model's ability to maintain consistent factual knowledge across languages.
The authors distinguish between "associative knowledge" (facts asked in the language associated with the country) and "non-associative knowledge" (facts asked in languages not associated with the country). This distinction helps identify whether models perform better when recalling facts in languages culturally connected to those facts.
Results
The paper reveals several important findings about multilingual language models:
Conclusion
The paper reveals a limitation in current multilingual language models: their inability to consistently transfer factual knowledge across languages. This inconsistency suggests that factual information is stored in language-specific silos rather than being universally accessible. The authors emphasize the need for "calibrated multilingualism", where models can autonomously leverage the most reliable internal representations for any given multilingual query. For more information please consult the full paper.
Congrats to the authors for their work!
Aggarwal, Tushar, et al. "Language Models' Factuality Depends on the Language of Inquiry." arXiv preprint arXiv:2502.17955 (2025).