登录查看更多内容

Language Models' Factuality Depends on the Language of Inquiry

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2025年3月1日

Today's paper investigates an interesting limitation in multilingual language models (LMs): their inconsistency in recalling factual knowledge across different languages. The authors discover that LMs may correctly recall a fact when asked in one language but fail to do so in another, even when they possess the correct information. This inconsistency reveals that factual knowledge in LMs is often stored in language-specific "silos" rather than being universally accessible.

Overview

The paper introduces a comprehensive benchmark to evaluate how consistently language models recall factual knowledge across different languages. The authors created a dataset of 10,000 country-related facts translated into 13 languages, categorized into high-resource (English, Chinese, French, Japanese), medium-resource (Hindi, Russian, Arabic, Greek), and low-resource (Nepali, Ukrainian, Turkish, Swahili, Thai) languages.

The benchmark evaluates three key capabilities: Factual Recall, In-context Recall, and Counter-Factual Context Adherence. For Factual Recall, models are asked simple factual questions about entities associated with specific countries in different languages. In-context Recall tests whether models can use contextual information to answer questions without being influenced by their internal knowledge. Counter-Factual Context Adherence examines whether models adhere to provided context even when it contradicts their internal factual knowledge.

To quantify performance, the paper uses three metrics. The Factual Recall Score (FRS) measures how accurately a model recalls facts in a given language. The Knowledge Transferability Score (KTS) quantifies how well factual knowledge is transferred across languages. The Cross-Lingual Factual Knowledge Transferability (X-FaKT) Score combines both measures to provide a comprehensive evaluation of a model's ability to maintain consistent factual knowledge across languages.

The authors distinguish between "associative knowledge" (facts asked in the language associated with the country) and "non-associative knowledge" (facts asked in languages not associated with the country). This distinction helps identify whether models perform better when recalling facts in languages culturally connected to those facts.

Results

The paper reveals several important findings about multilingual language models:

All tested models demonstrate significantly better performance on "associative knowledge" compared to "non-associative knowledge," indicating that facts are often stored in language-specific silos rather than being universally accessible.
Larger models generally perform better at both factual recall and knowledge transferability. Llama-3-70B achieved the highest X-FaKT score of 0.848, demonstrating superior balanced performance in both factual recall and knowledge transfer.
High-resource languages like English and French show consistently lower error rates (around 3.83% for associative knowledge) compared to medium and low-resource languages (26.73% and 29.53% respectively).
Languages that share similar scripts (like Hindi-Nepali and Russian-Ukrainian pairs) show correlated performance patterns, suggesting that script similarity plays a crucial role in knowledge transfer.

Conclusion

The paper reveals a limitation in current multilingual language models: their inability to consistently transfer factual knowledge across languages. This inconsistency suggests that factual information is stored in language-specific silos rather than being universally accessible. The authors emphasize the need for "calibrated multilingualism", where models can autonomously leverage the most reliable internal representations for any given multilingual query. For more information please consult the full paper.

Congrats to the authors for their work!

Aggarwal, Tushar, et al. "Language Models' Factuality Depends on the Language of Inquiry." arXiv preprint arXiv:2502.17955 (2025).

AI Paper of the Day

1,303 位关注者

要查看或添加评论，请登录

Vlad Bogolin的更多文章

START: Self-taught Reasoner with Tools

2025年3月8日

START: Self-taught Reasoner with Tools

Today's paper introduces START (Self-taught Reasoner with Tools), a novel approach that enhances large language models'…
Token-Efficient Long Video Understanding for Multimodal LLMs

2025年3月7日

Token-Efficient Long Video Understanding for Multimodal LLMs

Today's paper introduces STORM (Spatiotemporal Token Reduction for Multimodal LLMs), a novel architecture for efficient…
Predictive Data Selection: The Data That Predicts Is the Data That Teaches

2025年3月6日

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Today's paper introduces PRESELECT, a novel approach for selecting high-quality data for language model pretraining…
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

2025年3月5日

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Today's paper introduces MultiAgentBench, a comprehensive benchmark designed to evaluate Large Language Model (LLM)…
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

2025年3月4日

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Today's paper introduces Phi-4-Mini and Phi-4-Multimodal, two compact yet powerful language models. Phi-4-Mini is a 3.
How far can we go with ImageNet for Text-to-Image generation?

2025年3月3日

How far can we go with ImageNet for Text-to-Image generation?

Today's paper challenges the prevailing "bigger is better" paradigm in text-to-image generation by demonstrating that…
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

2025年3月2日

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Today's paper introduces DeltaBench, a comprehensive benchmark for evaluating the ability of Large Language Models…
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

2025年2月28日

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

Today's paper introduces REFUTE, a novel benchmark for evaluating language models' ability to falsify incorrect…

1 条评论
OpenAI GPT-4.5 System Card

2025年2月27日

OpenAI GPT-4.5 System Card

Today's paper introduces OpenAI GPT-4.5, the company's largest and most knowledgeable model to date.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

2025年2月26日

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Today's paper introduces SWE-RL, an approach that uses reinforcement learning to enhance large language models'…

See all articles

Overview

Results

Conclusion

AI Paper of the Day

1,303 位关注者

Vlad Bogolin的更多文章

START: Self-taught Reasoner with Tools

Token-Efficient Long Video Understanding for Multimodal LLMs

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

How far can we go with ImageNet for Text-to-Image generation?

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

OpenAI GPT-4.5 System Card

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution