New Benchmark Evaluates LLMs On 100 Languages

New Benchmark Evaluates LLMs On 100 Languages

One major limitation of current LLMs is their capacity to understand and respond accurately to cultural and linguistic diversity. Although current LLMs perform well with widely spoken languages, they struggle with many other languages. To help improve next generation LLMs, scientists have developed All Languages Matter Benchmark (ALM-bench). ALM-bench tests the ability of LLMs to understand culturally diverse images paired with text. This is the largest and most comprehensive effort to date for evaluating LLMs across 100 languages. This project was a collaboration between scientists at MBZUAI Mohamed bin Zayed University of AI, University of Central Florida, Aalto University, Australian National University, Link?ping University, and Amazon. The preprint is available on arXiv.

1) Data Annotation and Curation

  • ALM-bench was curated and verified by native-language speakers in 50 countries
  • Human annotators were given detailed instructions to ensure quality datasets
  • Human annotators verified over 22k samples across 73 countries and 24 scripts
  • ALM-bench has over 800 hours of human annotations

2) 16 LLMs

The authors evaluated the performance of the following 16 LLMs:

  1. GPT-4o
  2. Gemini-1.5-Pro
  3. GLM-4V -9B
  4. LLaVa-OneVision
  5. LlaVA -v1.6
  6. LLaVA 1.5-7b
  7. Qwen-VL
  8. Qwen2-VL
  9. MiniCPM-V-2
  10. Eagle X5-13B Chat
  11. PALO
  12. Pixral-12B
  13. Phi3-Vision- 128K
  14. mBLIP-Mt0
  15. Molmo
  16. InternVL2

3) 100 Languages

100 languages, their associated country, language scripts, families, subgrouping, and the resource specification.

4) Types of Questions

5) Performance

  • Closed-source propriety models GPT-4o and Gemini-1.5-Pro performed better across 100 languages compared to open-source models.
  • GPT-4o was the best performing closed-source model (80% accuracy)
  • GPT-4o achieved 84% on Education and Heritage but dropped to 73% on Notable Key Figures
  • GLM-4V-9B was the best performing open-source model (52% accuracy)
  • The performance difference between the best-performing open-source model and the proprietary model, GPT-4o, was 27%.
  • The performance of GPT-4o dropped from 88% for English to 51% for Amharic.
  • The performance of GLM-4V-9B dropped from 80% for English to 16% for Amharic.
  • LLMs performed better on predominant language scripts such as Latin, Cyrillic, and Devanagari than on underrepresented scripts such as Ge’ez, Lao, Sinhalese, Oriya.
  • LLMs demonstrated a better cultural understanding of prominent language families such as Indo-European, Austronesian, and Afro-Asiatic compared to Atlantic-Congo and Turkic.
  • LLMs performed better on multiple choice and true/false questions than on open-ended questions.
  • The performance of all LLMs dropped when prompted with textual questions alone and showed a significant performance gain when images were included.

Performance of top-performing LLMs according to different question types.
Performance comparison of LLMs. Higher accuracy is highlighted with a high color intensity.

6) Successes and Failures

Examples of successes and failures of GPT-4o. The success cases are on the first row and failure cases are on the second row.

References

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

Author Affiliations: University of Central Florida, Mohamed bin Zayed University of AI, Amazon,Aalto University, Australian National University, Link?ping University

Authors: Ashmal Vayani, Dinura Dissanayake , Hasindri Watawana , Noor Ahsan, Nevasini Sasikumar, Omkar Thawakar, Henok Biadglign Ademtew, Yahya Hmaiti, Amandeep Kumar, Kartik Kuckreja, Mykola Maslych, Wafa Al Ghallabi, Mihail Mihaylov, Chao Qin, Abdelrahman Shaker, Mike Zhang, Mahardika Krisna Ihsani, Amiel Esplana, Monil Gokani, Shachar Mirkin, Harsh Singh, Ashay Srivastava, Endre Hamerlik, Fathinah Asma Izzati, Fadillah Adamsyah Maani, Sebastian Cavada, Jenny Chim, Rohit Gupta, Sanjay Manjunath, Kamila Zhumakhanova, Feno Heriniaina Rabevohitra, Azril Amirudin, Muhammad Ridzuan, Daniya Kareem, Ketan More, Kunyang Li, Pramesh Shakya, Muhammad Saad, Amirpouya Ghasemaghaei, Amirbek Djanibekov, Dilshod Azizov, Branislava Jankovic, Naman Bhatia, Alvaro Cabrera, Johan Obando-Ceron, Olympiah Otieno, Fabian Farestam, Muztoba Rabbani, Sanoojan Baliah, Santosh Sanjeev, Abduragim Shtanchaev, Maheen Fatima, Thao Nguyen, Amrin Kareem, Toluwani Aremu, Nathan Xavier, Amit Bhatkal, Hawau Toyin, Aman Chadha3, Hisham Cholakkal2 , Rao Muhammad Anwer, Michael Felsberg, Jorma Laaksonen, Thamar Solorio , Monojit Choudhury, Ivan Laptev, Mubarak Shah, Salman Khan, Fahad Shahbaz Khan.

Subscribe, Comment, Join Group

I'm interested in your feedback - please leave your comments.

To subscribe to the AI in Healthcare Milestones newsletter click here.

To join the AI in Healthcare Milestones Group click here.

Copyright ? 2024 Margaretta Colangelo. All Rights Reserved.

This article was written by Margaretta Colangelo. Margaretta is a leading AI analyst who tracks significant milestones in AI in healthcare. She consults with AI healthcare companies and writes about some of the companies she consults with. Margaretta serves on the advisory board of the AI Precision Health Institute at the University of Hawai?i?Cancer Center @realmargaretta

要查看或添加评论,请登录

Margaretta Colangelo的更多文章

社区洞察

其他会员也浏览了