The Superpower of “en-US”: “en” vs. the under-represented languages

The Superpower of “en-US”: “en” vs. the under-represented languages

en-US”, once a language identifier for American English, now has a new connotation: English/USA as a superpower for language AI.

Welcome to the first part of a two-blog series exploring the dominance of “en-US” (English/USA) as a superpower in language AI. In this blog, we delve into the impact of English as a dominant force compared to under-represented languages. In the next blog, we analyze how the USA invests in language AI with respect to other under-invested countries.

I am a computational linguist who specializes in building language AI applications using Large Language Models (LLMs). My work involves prompt engineering across diverse languages and analyzing language outputs. This thesis combines my practical experiences with quantitative analysis to examine how LLMs such as ChatGPT supports various languages.

TLDR:

  • LLMs can interpret all 161 language scripts in Unicode.
  • English constitutes over 90% of the training data in GPT-3.
  • English is the most efficient language for prompting LLMs—1.3x more efficient than Spanish, 1.5x more efficient than French, and 2x more efficient than CJK (Chinese, Japanese, Korean) languages.
  • Approximately 10 high-resource languages are adequately supported by LLMs.
  • The bottom 150 languages are both low-resource and under-represented.
  • Nearly 7,000 languages worldwide lack representation in LLMs.

English is the most efficient language for prompting LLMs—1.3x more efficient than Spanish, 1.5x more efficient than French, and 2x more efficient than CJK (Chinese, Japanese, Korean) languages.
Table of Content


The World’s Languages: High Resource vs. Low Resource

In the realm of traditional Natural Language Processing (NLP) research, there's a distinction between high resource and low resource languages. High resource languages encompass approximately 20 languages , including English, Chinese, Spanish, French, German, Japanese, Russian, Portuguese, Arabic, Hindi, Italian, Korean, Dutch, Turkish, Persian, Swedish, Polish, Indonesian, Vietnamese, Hebrew.

These high-resource languages benefit from rich linguistic resources such as extensive monolingual text, parallel corpora for machine translation, comprehensive lexical dictionaries, syntactic annotations, and labeled corpora for supervised learning.

Interestingly, some high-resource languages, like Dutch, may not have a vast number of speakers, but their robust linguistic research community has led to the development of significant linguistic corpora and tools, elevating them to high-resource status. Conversely, certain low-resource languages, such as Nigerian Pidgin, are spoken by over 100 million people but lack substantial research and development, relegating them to low-resource status.

Speakers of low-resource languages often face barriers due to limited funding for language analysis, annotation, and preservation. However, the advent of Large Language Models represents a transformative shift. Unlike traditional NLP methods that heavily rely on expensive labeled data to elevate a language's status, LLMs can be pre-trained on vast amounts of unlabeled text. This makes LLMs a cost-effective solution for enabling meaningful language understanding and utilization even for low-resource languages.

The exciting development of LLMs begs the question: will LLMs help save low-resource languages?

Low-resource Languages are Still Under-represented in LLMs

Despite the transformative potential of LLMs, such as ChatGPT, the reality remains that most LLMs predominantly cater to English and a handful of other high-resource languages. A closer examination of the training corpus used for models like GPT-3 reveals a stark imbalance:

  • English Dominance: The training corpus of GPT-3 is overwhelmingly English, accounting for 92.6% of the data . This trend continues with subsequent models like ChatGPT (based on GPT-3.5).
  • Limited Representation:

??? Only a few languages make up more than 1% of GPT-3’s corpus, namely French (1.8%) and German (1.5%).?

??? A further 14 languages fall within the range of 0.1% to 1%, including Spanish, Italian, Portuguese, Dutch, Russian, Romanian, Polish, Finnish, Danish, Swedish, Japanese, Norwegian.?

??? Notably, languages like Chinese and Hindi, spoken by over 2 billion people combined, do not even meet the 0.1% threshold.?

  • Concentration of Training Data: The top 16 languages in GPT-3's training corpus constitute a staggering 99.24% of the data.
  • Limited Word Coverage: Only 65 languages have more than 1 million words in GPT-3's training corpus, with the 65th language being Khmer, spoken by 17 million people in Cambodia.

ChatGPT's bias towards English and select high-resource languages is not unique to OpenAI; it reflects a broader challenge in NLP where language representation is skewed by the availability of textual resources online. LLMs largely ignores the majority of the world’s 7,000 living languages .

For instance, the following languages with significant speaker populations contribute less than 1% of the internet's textual content, making it challenging to gather sufficient data for training LLMs.?

  1. Hindi: 602M speakers
  2. Arabic: 274M speakers
  3. Bengali: 273M speakers
  4. Urdu: 321M speakers

The discrepancy between language speakers and available textual data underscores a critical issue. While it's tempting to attribute blame, the imbalance between internet text and linguistic diversity poses a fundamental challenge for LLMs aiming to support a broader array of languages.

Here’s how I categorize the world’s languages with respect to its representation by ChatGPT:?

How each language is presented by ChatGPT according to its available pre-training data

In conclusion, the under-representation of low-resource languages in ChatGPT and similar LLMs is a consequence of both intentional training data biases and the mismatch between internet content and language speaker populations.

Besides English, only approximately 10 high-resource languages are adequately supported by LLMs.

English is the Most Efficient “Programming” Language for LLMs

In the evolving landscape of language models like GPT-3.5-turbo and GPT-4-turbo as of May 2024, context lengths have expanded significantly, with GPT-4-turbo now supporting up to 128K tokens . Prompt engineering—a technique aimed at maximizing the effectiveness of prompts within these constraints—has become an art form.

GPT turbo models and their context window tokens as of May 2024

The question arises: which written language is the most efficient for prompting LLMs? When considering text compression and informativeness, the choice of language plays a pivotal role. An intriguing observation can be drawn from comparing different language versions of texts like the Bible, where translations often vary in length. For example, the Chinese version is typically much more concise compared to the English version.

Does this imply that one could utilize Chinese as a more space-efficient prompting language for ChatGPT? Furthermore, is there a language even denser than Chinese?

The answer involves a nuanced understanding of language structure and complexity. While certain languages like Chinese can convey information more succinctly due to their character-based nature, the efficiency of a language for LLM prompt engineering extends beyond mere brevity. English, despite its morphological complexity, remains a favored "programming" language for LLMs due to several key factors:

  • Vocabulary Dominance: LLMs like ChatGPT are predominantly trained on English text, equipping them with a robust English vocabulary and linguistic nuances.
  • Prompting Efficiency: English prompts often yield more effective responses from LLMs, given their extensive training on English-language data.
  • Cultural and Semantic Richness: English serves as a lingua franca in many domains, offering a broad spectrum of cultural references and semantic depth that enriches LLM interactions.

While languages like Chinese may exhibit a higher degree of textual compression, the broader linguistic context and training data favor English as the optimal "programming" language for LLMs like ChatGPT.

In the pursuit of efficient prompt engineering, leveraging the strengths of languages like English remains paramount, emphasizing the intricate interplay between linguistic structure, training data, and model performance.

Stay tuned as we delve deeper into the nuances of language efficiency and its implications for LLM development and application.

Writing Efficiency Ranked: from Chinese to English to Spanish to Japanese

In the quest to determine the world's most dense language—measured by the efficiency of character usage to convey meaning—various studies offer intriguing insights into linguistic density.

Character Efficiency

Someone took a similar idea of the Bible translation and evaluated it against different languages’ translations of Google’s Privacy Policy . Here’s a sample of languages ranked by the total number of characters:

  1. Traditional Chinese: 101 chars
  2. Simplified Chinese: 124 chars
  3. Japanese: 215 chars
  4. English: 345 chars
  5. Spanish: 376 chars
  6. French: 417 chars
  7. Vietnamese: 403 chars
  8. Hindi: 500 chars

What’s the most efficient language? Visualization of space and characters taken for the same translation of Google’s Privacy Policy snippet.

These figures highlight the textual compactness of certain languages like Chinese, especially in its traditional form, compared to more verbose languages such as English and Hindi.

Speech Speed and Density

Another study measured speech speed, with the assumption that “less dense languages sound faster”. The finding was that Spanish and Japanese speakers are fast while Chinese and Vietnamese speakers take it slowly. The following is a rank of language density:

  1. Vietnamese: 1
  2. Chinese: 0.94
  3. English: 0.91
  4. Spanish: 0.63
  5. Japanese: 0.49

Interestingly, languages like Spanish and Japanese are spoken rapidly, while Chinese and Vietnamese exhibit a slower speech pace.

Does it mean that we should just write the prompt in Chinese? No, because --

ChatGPT’s Vocabulary is Dominantly English

English is the most efficient prompt language for at least the GPT models, or any English-dominant models. This is due to how OpenAI “tokenizes” each language. The general conclusions are:

  1. Native Support for English: English is considered a "first-class" language in ChatGPT, optimized for seamless integration and efficient prompt processing.
  2. Unicode Tokenization: Unicode languages, totaling 161 in number , are tokenized byte-by-byte to ensure compatibility with ChatGPT's processing framework.
  3. Non-Unicode Language Limitations: Unfortunately, non-Unicode languages are not supported by ChatGPT, underscoring the model's current limitations in accommodating linguistic diversity.

Have you heard of the ChatGPT vocabulary ? It contains 100,261 tokens mostly derived from English. The ChatGPT vocabulary provides a fascinating glimpse into the model's linguistic framework, heavily rooted in English. Notable aspects of this vocabulary include:?

  • Tokenized Elements

??? token 0 is the exclamation mark !

??? tokens 32 to 57 are the capital letters A … Z

??? token 67853 is the word piece “-ish”

??? token 75459 is “battery”

??? the word “GPT” is unfortunately not in the vocabulary

  • Variants and Synonyms: Various writings of the 2nd month: “ February” (token 7552), “ Feb” (13806), “February” (33877), “Feb” (41691), “feb” (78471) “-Feb” (94871). Note that some tokens have a prefix space.
  • English Dominance: English-centric tokens within the vocabulary underscore its dedication to supporting and prioritizing English language patterns and expressions.

A glimpse of the ChatGPT vocabulary

The ChatGPT vocabulary is so dedicated to English that it has 9 tokens just for “Twitter”! Other languages don’t get their fair share of tokens in the 100K vocabulary unfortunately. This shows how dominant English is at least for the GPT models.

9 tokens representing Twitter out of the
The ChatGPT vocabulary is so dedicated to English that it has 9 tokens just for “Twitter”!

Writing Efficiency != Prompting Efficiency

The representation and recognition of languages within ChatGPT highlight intriguing disparities in token efficiency. For instance, consider the character 猫 (cat) in Chinese, which is represented by three tokens (hexadecimal values: \xe7, \x8c, \xab), compared to the English word "cat" that constitutes a single token.

How a Unicode character is broken down into bytes and converted to ChatGPT tokens

This tokenization discrepancy underscores a crucial distinction between writing efficiency and prompting efficiency in ChatGPT. When confronted with token limits—such as the 16,385 token capacity of GPT-3.5-turbo—English emerges as a notably more efficient prompting language than Chinese or Korean.

- Token Efficiency Comparison

??- English: "cat" = 1 token

??- Chinese: 猫 (cat) = 3 tokens

??- Korean: ??? (cat) = 4 tokens

In the narrow case of expressing “cat” to ChatGPT, English is 3x more efficient than Chinese and 4x more efficient than Korean.

The underlying technical intricacies further emphasize this disparity. In UTF-8 encoding, characters typically range from 1 to 4 bytes, with most world language characters occupying 2 to 3 bytes. Consequently, the token length for non-English languages tends to average 2 to 3 tokens per word, diminishing prompting efficiency compared to English.

Considering the expanded context length of GPT-4-turbo, which supports up to 128,000 tokens, the disparity in language efficiency becomes even more pronounced. How many words are 128k tokens? You can write:

  • English: Approximately 96,000 words
  • Simplified Chinese: Approximately 54,000 characters
  • Korean: Approximately 41,000 characters

English is about 1.8x more efficient than Chinese and 2.3x more efficient than Korean in prompting ChatGPT.

In summary, English emerges as the most efficient language for prompting ChatGPT, boasting a prompting efficiency approximately 2 times greater than that of CJK (Chinese, Japanese, Korean) languages.

As we navigate the evolving landscape of language AI, understanding these efficiency dynamics is crucial for optimizing prompt engineering and maximizing the effectiveness of interactions with language models like ChatGPT.

English is about 1.8x more efficient than Chinese and 2.3x more efficient than Korean in prompting ChatGPT.

A special word for Klingon and Javanese

The compatibility of languages with LLMs like ChatGPT hinges on their inclusion within Unicode, the standard character encoding system. Unicode enables LLMs to recognize and process words encoded in its supported scripts. However, the absence of a language in Unicode presents a significant barrier to LLM comprehension.

Here are a sample of languages that are not supported by Unicode, or LLMs:

  • Tangsa - A language used by the Tangsa community in India and Myanmar.
  • Toto - Spoken by the Toto tribe in West Bengal, India.
  • Ainu - Used by the Ainu people of Japan, it has limited support with some characters in the Katakana block.
  • Pahawh Hmong - A script for writing the Hmong language, created in the mid-20th century.
  • Chakma - Used by the Chakma people in India and Bangladesh.
  • Kpelle - Used by the Kpelle people in Liberia and Guinea.
  • Vai - A syllabary used by the Vai people in Liberia.
  • Bassa Vah - A script used to write the Bassa language of Liberia.

Klingon

Klingon, a constructed language from the Star Trek universe, is notably absent from Unicode. Consequently, LLMs like ChatGPT cannot read or process Klingon scripts due to this lack of Unicode support.

Klingon script is not part of Unicode, thus not supported by LLMs

Javanese

Javanese, spoken by 68 million people on the island of Java in Indonesia, holds a unique historical connection with the programming language Java. Despite Java's pivotal role in advancing Unicode adoption within programming languages, Javanese itself was not officially supported in Unicode until 2009 (Unicode version 5.2). This delayed inclusion underscores the challenges faced by non-Western languages in gaining recognition within global standards like Unicode.

The scenario highlights a broader issue: if a language is not part of Unicode, it remains inaccessible to LLMs and other language technologies, posing significant barriers to linguistic inclusivity and preservation.

As of Unicode version 15.1, which encompasses 161 scripts and nearly 150,000 characters, the discrepancy between supported scripts and the world's 7,000 languages is evident. Without adequate representation within Unicode, the prospects for understanding and preserving these languages in the digital age become increasingly challenging.

Looking ahead, ensuring the inclusion of diverse languages within Unicode and related standards is crucial for fostering linguistic diversity and enabling comprehensive language support within emerging technologies like LLMs.

As we navigate the intersection of language, technology, and cultural heritage, addressing these challenges will be pivotal in preserving linguistic richness for future generations. It’ll be more difficult in the future to understand these languages if there were no speakers left, or an LLM preserved to make sense of them.

Even a US Senator Recognized the Imbalance of “en” vs. other languages

On 5/16/2023, US Senator Padilla expressed his concern at the Senate artificial intelligence hearing with OpenAI CEO Sam Altman (video at 1:49:38, transcript ):

Sen. Alex Padilla (D-CA):

“Now, with language models becoming increasingly ubiquitous, I wanna make sure that there’s a focus on ensuring equitable treatment of diverse demographic groups. My understanding is that most research into evaluating and mitigating fairness harms has been concentrated on the English language, while non-English languages have received comparably little attention or investment. And we’ve seen this problem before. I’ll tell you why I raised this. Social media companies, for example, have not adequately invested in content moderation, tools and resources for their non-English in, in non-English language. And I share this not just out of concern for non-US based users, but so many US based users prefer a language other than English in their communication. So I’m deeply concerned about repeating social media’s failure in AI tools and applications. Question Mr. Altman and Ms. Montgomery, how are OpenAI and IBM ensuring language and cultural inclusivity that they’re in their large language models and it’s even an area focused in the development of your products”

(It was sad that Senator Padilla came from the standpoint that he wanted to moderate non-English languages, thus inquiring about ChatGPT’s support of other languages.)

Sam Altman:

We think this is really important. One example is that we worked with the government of Iceland which is a language with fewer speakers than many of the languages that are well represented on the internet to ensure that their language was included in our model. And we’ve had many similar conversations. And I look forward to many similar partnerships with lower resource languages to get them into our models. GPT-4 is unlike previous models of ours, which were good at English and not very good at other languages. Now pretty good at a large number of languages. You can go pretty far down the list ranked by number of speakers and, and still get good performance. But for these very small languages, we’re excited about custom partnerships to include that language into our model run. And the part of the question you asked about values and making sure that cultures are included, we’re equally focused on that.

(Did you hear the news about OpenAI opening a Japan office? Maybe that’s part of the custom partnerships.)

Takeaways

Reflecting on the exploration of language representation and efficiency within LLMs like ChatGPT, several key takeaways emerge:

1. English Dominance: English remains the most efficient language for prompting LLMs like ChatGPT due to its extensive token coverage within the model's vocabulary. This dominance underscores the practical advantages of leveraging English in prompt engineering.

2. Token Efficiency: The tokenization process in LLMs reveals significant disparities in efficiency across languages. While English prompts often require fewer tokens, languages like Chinese and Korean may necessitate multiple tokens for simple expressions, impacting overall prompt efficiency.

3. Unicode and Language Support: The reliance of LLMs on Unicode for language recognition highlights the critical importance of standardization in enabling linguistic inclusivity. Languages absent from Unicode, such as Klingon, face significant barriers in accessing LLMs.

4. Challenges in Linguistic Diversity: Despite Unicode's comprehensive coverage of scripts, a substantial gap remains between supported scripts and the world's diverse languages. The limited representation of languages within Unicode poses challenges for preserving and understanding linguistic diversity in digital contexts.

5. Future Prospects: As technologies like LLMs continue to evolve, addressing the imbalance in language representation and efficiency becomes paramount. Efforts to enhance Unicode's inclusivity and expand language support within LLM architectures will be essential for fostering linguistic equity and cultural preservation.

In summary, navigating the complexities of language efficiency and representation in LLMs reveals both challenges and opportunities for advancing linguistic diversity and inclusive language technologies. Addressing these issues requires concerted efforts to bridge gaps in standardization and promote comprehensive language support within the digital landscape.

Stay tuned for Blog 2, where I explore the implications of “US” versus under-invested countries in the context of language AI. Blog 2 will be The Superpower of “en-US”: “US” vs. the under-invested countries.

where do u reach the conclusion "English is the most efficient language for prompting LLMs—1.3x more efficient than Spanish, 1.5x more efficient than French, and 2x more efficient than CJK (Chinese, Japanese, Korean) languages." I am going to cite it in my academic paper, and could u plz tell me the citation?

回复
Ben S.F. Yao

Founder/TPIsoftware

5 个月

Thanks for the insightful sharing Xuchen Yao. Looking forward to your Blog 2 to learn how to enhance underrepresented languages like CJK to work with LLM.

Adrian Swinscoe

Customer experience advisor, author, speaker, workshop leader and aspirant punk at Punk CX

6 个月

Fantastic read and fascinating to learn more about some of the imbalances in the linguistic landscape of AI ??

Marcelo Grebois

? Infrastructure Engineer ? DevOps ? SRE ? MLOps ? AIOps ? Helping companies scale their platforms to an enterprise grade level

6 个月

Does the English-centric nature of language AI exacerbate linguistic inequalities? An intriguing exploration, raising vital concerns about inclusivity. Xuchen Yao

Vincent Valentine ??

CEO at Cognitive.Ai | Building Next-Generation AI Services | Available for Podcast Interviews | Partnering with Top-Tier Brands to Shape the Future

6 个月

Fascinating insights into language AI's impact on linguistic diversity. Recognizing biases fosters inclusivity. How can we empower underrepresented languages? Xuchen Yao

要查看或添加评论,请登录

Xuchen Yao的更多文章

社区洞察

其他会员也浏览了