Study Shows That ChatGPT Can Identify 100 Languages Almost Perfectly

Study Shows That ChatGPT Can Identify 100 Languages Almost Perfectly

There are about 7,000 languages spoken worldwide. Researchers at University of British Columbia and Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), found that ChatGPT is able to identify 100 languages almost perfectly but has zero capabilities at identifying 384 languages, and very poor performance with African languages. Wei-Rui Chen and Muhammad Abdul-Mageed and their colleagues presented their findings at the annual conference of the North American Chapter of the Association for Computational Linguistics. Their paper on this study entitled Fumbling in Babel: An Investigation into ChatGPT’s Language Identification Ability is available on arXiv.

Although ChatGPT has demonstrated strong language capabilities in many languages including English, Afrikaans, Arabic, Indonesian, Italian, Mandarin Chinese, it is unclear what languages ChatGPT actually ‘knows’. Language identification is a fundamental NLP task that plays a critical role in ensuring accurate processing of multilingual data by identifying the language of text or speech. Social media provides researchers with a massive amount of multilingual text. In this study the objective was to evaluate ChatGPT's performance in language identification and provide insights into its strengths and limitations.

In this study, ChatGPT was able to identify these languages almost perfectly.

Data Collection

For this study, researchers used 3 datasets to curate Babel-670. The 3 datasets cover a total of 670 languages from 24 language families written in 30 different scripts.

  1. AmericasNLP2022: this dataset includes five low-resource South American Indigenous languages
  2. AfroLID: this manually curated dataset covers 517 African languages and language varieties. This dataset is multi-domain and multi-script.
  3. FLORES-200: this dataset is specifically designed for addressing low-resource machine translation covering over 200 languages.

Study Highlights

  • Researchers compared performance between GPT-3.5, GPT-4, and other language identification tools.
  • GPT-4 consistently exhibited stronger performance than GPT-3.5 across all settings.
  • In easy and medium difficulty levels, GPT-4 doubles the performance of GPT-3.5
  • In the hard level, GPT-4 outperforms GPT-3.5 by smaller margins.
  • The narrow performance gap is thought to be due to GPT-4’s slightly broader range of supported languages compared to GPT-3.5.
  • If the number of supported languages were to increase significantly, the researchers expect a larger performance gap in the hard level.

The researchers conclude that large language models would benefit from further development before they can serve diverse communities.

Languages That ChatGPT Can Identify

Languages organized by F1 scores. Languages with >90% F1 shown with beige bar.

  • The study found that ChatGPT is able to identify 100 languages nearly perfectly (>90% F1), but has zero capabilities at identifying 384 languages (0% F1).
  • These countries have a F1 score above 90% (beige) Afrikaans, Albanian, Arabic, Armenian, Asturian, Aymara, Azerbaijani, Bashkir, Basque, Belarusian, Binisaya, Bulgarian, Burmese, Catalan, Chinese, Coptic, Creole, Czech, Danish, Dutch, Esperanto, Estonian, Faroese, Fijian, Finnish, Friulian, Gaelic, Galician, Georgian, German, Greek, Guarani, Gujarati, Hebrew, Hima, Hungarian, Icelandic, Ilocano, Indonesian, Italian, Japanese, Javanese, Jingpho, Kannada, Kazakh, Khmer, Korean, Kreol, Kurdish, Kyrgyz, Lao, Latvian, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malayalam, Maltese, Maori, Marathi, Mizo, Mongolian, Napali, Occitan, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Sanskrit, Santali, Sardinian, Serbo-Croatian.
  • English has a F1 score of 77% (turquoise). The researchers found that while all English examples in the test data are correctly labeled, numerous examples of other languages are incorrectly classified as English, including English-based creoles like Nigerian Pidgin and Cameroonian Pidgin, as well as languages like Somali, Swahili, Harari, and Kinyarwanda, which feature some code-mixing in their data.
  • French has a F1 score of 56% (purple). The researchers found that all French examples are correctly labeled as French, but several other languages are mistakenly classified as French. These misclassified languages are those spoken in Francophone Africa, which exhibit some degree of code-mixing with French.
  • 384 languages have 0% F1 score. Languages with lower F1 scores typically have less representation in the digital domain. The limited online presence affects not only the size of the training data but also the quality of tokenization, as LLMs predominantly utilize subword tokenization techniques. The scarcity of high-quality data further compounds these issues. Collectively, these factors can significantly restrict the performance of LLMs on low-resource languages.

A Geographical Analysis

A geographical analysis of languages spoken in each region. The darker the color, the stronger the abilities of ChatGPT at identifying languages spoken in that region.

  • Researchers conducted an analysis from a geographical perspective and visualized the model performance with a choropleth map.
  • Africa demonstrates the lightest colors.
  • This highlights ChatGPT’s limited support for African languages.
  • This underscores the importance of inclusion of languages with less digital resources and representation.
  • It also indicates that ChatGPT has not reached the state of serving diverse communities.
  • Arab countries have different colors from dark green in Gulf countries to light green in Libya and Egypt. The darker color in the Gulf region is due to the included languages being mainly Arabic which has a very high F1 score. In contrast, for Libya and Egypt, Arabic as well as Nobiin, Kenzi, and Tamahaq are included. Since ChatGPT is less capable of identifying these languages, the average F1 score is lower and therefore the color is lighter. For a more detailed discussion of design decisions and the algorithm please see Appendix C, and Appendix D for comprehensive list of dialects of Arabic included in the study.
  • The researchers identified several cases of very low-resource languages that achieved unexpectedly high F1 scores. Languages like Gaelic, Guarani, Jingpho, and Kurdish fall into this category. It is plausible that the data used in the test set may have been included in the training data for the GPT models, resulting in these high F1 scores.

Test Examples: French, Spanish, Southwestern Dinka

An overview of different experimental settings with exemplified predictions and test examples in French, Spanish, and Southwestern Dinka.

  • To explore ChatGPT’s ability to identify languages, researchers designed 2 types of prompts: language name prompt (LNP) and language code prompt (LCP)
  • Each prompt has 3 difficulty levels.
  • In an overview of the data pipeline (above) LNP asks ChatGPT to predict language names while LCP asks it to produce language codes.
  • Although most language identification research uses language code as labels, in this study, researchers decided to also prompt ChatGPT to predict language names.

Researchers observed that ChatGPT predicted language names better than language codes given the same set of test examples.

Authors

Wei-Rui Chen is a PhD student focusing on NLP at University of British Columbia. His research interests are the applications and evaluations of large language models, focusing on multilingual abilities.

Muhammad Abdul-Mageed is Canada Research Chair in Natural Language Processing and Machine Learning; Director of University of British Columbia Deep Learning & NLP Group.

Ife Adebara is a researcher and teaching assistant at University of British Columbia. She is using artificial intelligence to build technology that understands African languages and has covered 517 African languages so far.

Khai Doan is an AI specialist pursuing a masters in Natural Language Processing at MBZUAI.

Qisheng Liao is a masters student in in Natural Language Processing at MBZUAI.

Wei-Rui Chen, Ife Adebara, Khai Doan, Qisheng Liao, Muhammad Abdul-Mageed, Deep Learning & Natural Language Processing Group, The University of British Columbia, Department of Natural Language Processing & Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Invertible AI.

Subscribe, Comment, Join Group

I'm interested in your feedback - please leave your comments.

To subscribe to the AI in Healthcare Milestones newsletter click here.

To join the AI in Healthcare Milestones Group click here.

Copyright ? 2024 Margaretta Colangelo. All Rights Reserved.

This article was written by Margaretta Colangelo. Margaretta is a leading AI analyst who tracks significant milestones in AI in healthcare. She consults with AI healthcare companies and writes about some of the companies she consults with. Margaretta serves on the advisory board of the AI Precision Health Institute at the University of Hawai?i?Cancer Center @realmargaretta

How do I find out if ChatGPT recognizes my native language?

Thank you for bringing this interesting work to our attention. It shows the power of LLMs and the bias in the same study. I bet most of the 384 languages it didn't know just didn't have a substantial online presence.

Tomoko Mitsuoka (三岡 智子)

AI Ethicist, bridge between Japan and overseas, Storyteller, Market Entry, Co-hosted MedTech Show at Clubhouse, Marketing, marketing research, competitive analysis & consultation

3 个月
Iraneus Ogu

Neural development, degeneration and repair | Improved access to quality neuro care

3 个月
Raouf Hajji, MD, PhD.

HealthTech Futurist | Professor Assistant of Internal Medicine | Co-Founder & Medical Lead of International Medical Community (IMC: HealthTech Hub)

3 个月

Thanks for sharing Margaretta Colangelo. I found surprising that Arab countries have different colors from dark green in Gulf countries to light green in Libya and Egypt. Is this study taking in consideration the local dialects as it is the only difference in Arabic speaking countries?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了