Issue 14: The power of Multi-lingual LLMs
Image created by DALL-E: note all the representative languages.

Issue 14: The power of Multi-lingual LLMs

Introduction

I had not planned to publish today, but I find it interesting to deep dive into current developments, and then I had enough material for a new edition. I woke up to PR on the Microsoft and Sarvam partnership (Microsoft Backs Sarvam AI to Scale Indic LLMs ) and wanted to learn more about multi-lingual LLMs.

We will approach this topic with two questions in mind:

  • Why do we need Multi-lingual LLMs
  • What is the state of innovation

Why do we need Multi-lingual LLMs

Let us approach this from a business value perspective first, then the technical considerations.

Business Value:

As we have seen in my previous edition, there are a large number of LLMs in market. There are also many powerful and popular applications, most expect human like conversational interfaces, but they are less fluent in non-English languages. Therefore, the development of applications and impact on global commerce, search, and innovation will be biased in favor of English language creating a wider gap over time. Usage however cannot be expected to scale proportionately as less then 25% of the global population speaks and does business in English.

Here are a few use cases and applications of top multilingual LLMs:

  1. Machine Translation: Machine transcription is a $1B market and multilingual LLMs enable translation of text from one language to another. The opportunity here expands beyond machine translation as a market, as it expands into other markets dependent on translation, such as voice assistants, content creation, testing and communications across linguistic boundaries.
  2. Cross-lingual Information Retrieval: This is super powerful - multilingual models can take queries in one language, search and retrieve information in another, and return the response in the original input language, making it possible to access knowledge/data, not limited to the original query language.
  3. Sentiment Analysis: Businesses can use multilingual LLMs to understand customer sentiments, reviews, and feedback across different languages and effectively improve global customer experiences.
  4. Content Moderation: Moderating content on social media platforms and forums, detecting and filtering inappropriate or harmful content in various languages to maintain safe online environments is possible due to Multi-lingual LLMs.
  5. Question Answering Systems, Language Learning apps, legal and medical document translation tools, Content summarizing tools – all need Multi-lingual LLMs to be available in multiple languages:

Top 10 popular languages

This made me wonder, what is the size of this market? How many consumers or enterprises need to worry about having support in multiple languages. A quick search will tell you there are over 7000 languages in the world, and countries like India have 780 languages, China 300 languages and even Papua New Guinea has 480 (which surprised me). But clearly we don’t need, and we will never have applications (for translation services, educational content or customer service etc, listed above) in all these languages.

The logical next step was to look at the top 10 most spoken languages, and see where we are on LLMs in those languages. Though the actual number of speakers can vary depending on the source and methodology used for the estimates, here is an fairly accurate top 10 most spoken languages:

  1. English: Approximately 1.5 billion
  2. Mandarin Chinese: About 1.1 billion
  3. Hindi: Roughly 600 million
  4. Spanish: Around 550 million
  5. French: Estimated 300 million
  6. Arabic: Approximately 274 million
  7. Bengali (Bangla): About 273 million
  8. Portuguese: Roughly 258 million
  9. Russian: Around 258 million
  10. Urdu: Estimated 230 million

?

Technical considerations

As we know, LLMs are trained on a massive dataset of text from various sources – books, articles, websites, social media etc. Let us focus on Text for now, as we are talking about multi-lingual LLMs. The LLM learns the statistical relationships between words, phrases, and sentences in the dataset, which allows it to generate text that is similar to the text it was trained on. Naturally the dataset for training the LLM is the key factor for the effectiveness, accuracy and bias in the output.

In a highly simplified representation this is the workflow of training LLMs (I cover more details in the newsletter edition dedicated to LLMs, stay tuned). So the raw data set is at the top left, and the deployment ready LLM is at the bottom right.

As you can imagine, the volume of training data available in English is far superior to that of any other language. This is known as the ‘resourcedness’ gap.

As a reference, GPT 3.5 had the following distribution of source data:

Per above, roughly 80% of the training is based on general filtered web crawl and targeted website data. It is easy to imagine these are predominantly in English and hosted in the US, but we have study that shows the real data - A research paper from Cornell says “51.3% of pages are hosted in the United States. The countries with the estimated 2nd, 3rd, and 4th largest English-speaking populations—India, Pakistan, Nigeria, and The Philippines—have only 3.4%, 0.06%, 0.03%, 0.1% the URLs of the United States, despite having many tens of millions of English speakers.”

GPT 3.5 also claims it had trained on up to 10% non English dataset. Easy to guess that even those would be biased in favor of European languages and culture, not Asian or Latin American. The result is:

  • ChatGPT lacks the ability to understand and generate sentences in low-resource languages.?
  • ChatGPT is weak in translating sentences in non-Latin script languages, despite the languages being considered high-resource.
  • ChatGPT performs better with English prompts?even though it is technically intended for other languages.
  • ChatGPT performs substantially worse at answering factual questions or summarizing complex text in non-English languages.
  • ChatGPT’s performance is generally better for English for complex reasoning – as it is dependent on embedding and tokenization.
  • ChatGPT performs better on tasks going from Language X to English, than from English to Language X – again because the corpus of training data in Language X is inferior.
  • The asymmetry may be accentuated in certain specialized tasks or vertical applications in Heathcare, Education and such, as research content, higher education content, science or research content are all predominantly in English.
  • Last but not the least, the dependence on the corpus of English language text will amplify the cultural bias in favor of English and English speaking cultures. ?

Here is a report from June 2023, that shows the dataset distribution.

Source: CSA Research


What needs to happen:

Let us again approach this from a Business Value and Technical perspective.

Business Value:

We need acute awareness that "resourcedness gap" may create incorrect responses, because it may be statistically correct, and may enable bias to persist and grow. This will result in ineffective applications, suboptimal business outcomes, poor customer experience and eventually negatively impact growth.

It is also clear now, to address the English bias in the data, we need to focus on the acquisition of large amounts of non-English training data to be added to the core training data and create more language specific LLMs. A large corpus of non English or non European text exists, but is not digitized, or accessible. This requires alignment of intent and resources at a higher altitude within those specific countries - Government agencies, Non-profit organizations, Academic, research organizations, Testing organizations and public libraries, to name a few.

While such efforts are already on the way, the attention on Sarvam AI this week as well as its $700M raise in 3 years, shows that there is tremendous work yet to be done.


Technical Considerations:

Beyond the training datasets, there are changes needed in the foundation model as well. Multi-lingual LLMs work by leveraging shared linguistic features across languages and advanced NLP techniques?- when they are trained on vast datasets in numerous languages, they capture the nuances, grammar, and vocabulary of each language included in the training set. The underlying mechanisms and techniques that enable these models to function across languages include:

1. Shared Subword Tokenization:

  • Multi-lingual models often use a technique called subword tokenization (e.g., Byte-Pair Encoding or SentencePiece) that breaks words down into smaller units (subwords or tokens) that are common across many languages. This approach helps in handling vocabulary overlap and sharing representations between languages, thereby facilitating cross-lingual transfer learning.

2. Large-Scale Multilingual Corpora:

  • These models are trained on large-scale multilingual text corpora that include a wide range of languages. This training approach enables the model to learn from the context and semantics of multiple languages simultaneously. The diversity of the training data helps the model to understand and generate text in languages not explicitly trained on, a phenomenon known as zero-shot learning.

3. Cross-Lingual Transfer Learning:

  • Multi-lingual LLMs leverage cross-lingual transfer learning, where knowledge learned from one language is applied to understand or generate text in another language. This is particularly effective for languages with limited training data, as the model can transfer knowledge from high-resource languages to low-resource ones.

4. Language-Agnostic Model Architecture:

  • The architecture of multi-lingual models is designed to be language-agnostic, meaning it does not favor one language over another. Layers and parameters within the model are shared across all languages, which allows the model to generalize across languages and perform tasks in a language-independent manner.

5. Attention Mechanisms:

  • Multi-lingual LLMs, especially those based on the Transformer architecture, use attention mechanisms to weigh the importance of different words in a sentence. This helps the model to better understand context and relationships between words, regardless of the language.

6. Fine-Tuning and Contextual Embeddings:

  • After pre-training on multilingual data, these models can be fine-tuned on specific tasks (like translation, question-answering, etc.) in specific languages or language pairs. During fine-tuning, the model adjusts its parameters to better perform the task at hand, using contextual embeddings to capture the meaning of words in their specific linguistic context.

7. Zero-Shot and Few-Shot Learning Capabilities:

  • Multi-lingual models can often perform tasks in languages they were not explicitly trained on (zero-shot learning) or with very few examples in the target language (few-shot learning). This is possible because of the shared structure and parameters across languages, allowing the model to apply knowledge from one language to another.

Current state of multi-lingual LLMs:

It is safe to say there is substantial awareness of the need of non-English training data and fine tuning of the non-English LLMs. However, aside from finding massive amounts of non-English text and re-training the base generative AI model from scratch, researchers are recommening two other approaches:

  • Creating new data sets of non-English text by generating synthetic data.
  • Being intentional about which data sets make it into which model training, optimized for use cases, rather than feeding it everything, and understanding the incremental impact on the results, rather than feeding it everything.
  • Better controls, thresholds and guardrails on the training data, to ensure it is better balanced linguistically could improve performance and solve for bias.

Below I will share a list of all milti-lingual LLMs, as well as LLMs optimized for each of the top 10 languages (top 9, leaving out English). The popular/generic LLMs like BERT or GPT, will be available in multiple languages. But there will be very localized ones too, like OpenHathi from Sarvam AI. This is a rapidly evolving field, and new advancements and models are continually being developed to improve multi-lingual understanding and text generation in the most human-like interface.

Most notable multi-lingual LLMs:

  1. Google's Multilingual BERT (mBERT) Description: Part of Google's BERT models, mBERT is pre-trained on a large corpus covering 104 languages. It's particularly effective for tasks like named entity recognition and question-answering across multiple languages. Users: Researchers, developers building cross-lingual NLP applications, and companies requiring multi-language support in their products.
  2. XLM and XLM-R (Cross-lingual Language Model) by Facebook Description: XLM and its successor XLM-R are designed for cross-lingual understanding and are trained on 100 languages. XLM-R, in particular, has shown state-of-the-art performance on many cross-lingual benchmarks. Users: Academic researchers, developers in multilingual contexts, and businesses needing sophisticated language understanding across various languages.
  3. OpenAI's GPT-3 and GPT-4 Description: Although primarily trained in English, GPT-3 and GPT-4 have shown significant capabilities in understanding and generating text in multiple languages due to their vast training datasets. Users: A wide array of users including developers, creative professionals, researchers, and businesses leveraging AI for content creation, translation, and more in multiple languages.
  4. Hugging Face's multilingual models Description: Hugging Face offers a range of multilingual models, including variations of BERT, GPT, and others that are adapted for multi-language support. Users: A broad spectrum of users from academic to commercial sectors, utilizing these models for diverse applications like translation, sentiment analysis, and more across different languages.
  5. Youdao's ERNIE Description: Originating from Baidu, ERNIE (Enhanced Representation through kNowledge Integration) is a multi-lingual model that incorporates knowledge graphs into its pre-training for improved language understanding. Users: Primarily users and developers focused on the Chinese language and multi-lingual applications involving Chinese, including educational technology and content platforms.
  6. LASER (Language-Agnostic SEntence Representations) by Facebook Description: LASER is a toolkit for calculating sentence embeddings for 93 languages, facilitating tasks like text classification, translation, and information retrieval across languages. Users: Researchers and developers in the field of information retrieval and text analysis who require efficient cross-lingual capabilities.Before listing the rest, I want to add that I prefer tables, but LinkedIn newsletter makes it incredibly difficult to add tables. I have to save them as pictures and then add them. So going with text format this time. There is no creative writing from me here, it is purely what is available online, but triangulated, aggregated and organized by me.

Chinese:

Listing all Chinese Language Models (LLMs) can be quite an extensive task due to the rapid development and release of new models within China:

  1. BERT-Base, Chinese: A Chinese version of the BERT model, which was one of the first widely-used transformer-based language models.
  2. ERNIE (Baidu): Baidu's ERNIE (Enhanced Representation through kNowledge Integration) is a series of language representation models that have been designed to capture rich semantics from texts by incorporating knowledge graphs into pre-training.
  3. WoBERT: A Chinese pre-trained model based on BERT but optimized for Chinese character representations.
  4. CDial-GPT: Adapted for Mandarin, this model is akin to GPT-3 but specifically trained on Mandarin datasets to enhance conversational AI capabilities in Mandarin Chinese.
  5. GPT-3 for Chinese (CDial-GPT): A GPT-3-like model specifically trained for Chinese, often used for conversational AI and natural language understanding tasks.
  6. CPM (Chinese Pre-trained Models): A series of large-scale pre-trained Chinese language models developed by the Beijing Academy of Artificial Intelligence (BAAI), with CPM-1, CPM-2, and potentially more versions focused on various aspects of language understanding and generation.
  7. Tencent's RocketQA: An open-domain question answering system optimized for Chinese, utilizing dense passage retrieval and a deep learning ranking model.
  8. NEZHA (Huawei): A pre-trained language model developed by Huawei, which modifies the BERT architecture and introduces techniques like functional relative positioning to improve performance on Chinese language tasks.
  9. PanGu-α (华为2020年推出): Another model by Huawei, known for its large scale and aimed at general-purpose language understanding and generation, similar to the capabilities of OpenAI's GPT models.

Hindi:

Though we are predominantly looking at Hindi, there are many local ‘Indic’ languages, for which LLMs and NLP tools have been developed to cater to the unique linguistic features. There are numerous government-led efforts, community-driven open-source projects, and private sector innovations, all contributing to the advancement of Hindi & Indic NLP applications and services:

  1. IndicBERT: A multi-lingual model that includes support for Hindi among other Indian languages. It's based on the ALBERT architecture and is suitable for a range of NLP tasks in Hindi, such as text classification, sentiment analysis, and question answering.
  2. MuRIL (Multilingual Representations for Indian Languages) by Google: Although MuRIL is a multilingual model, it offers substantial support for Hindi, providing high-quality pre-trained embeddings. It's optimized for better contextual representation of Hindi text, aiding in various NLP applications.
  3. Hindi BERT: A BERT-based model specifically pre-trained on a large corpus of Hindi text. It's designed to understand the context and nuances of the Hindi language, making it effective for tasks like text classification, named entity recognition, and more.
  4. ULMFiT for Hindi: Adaptation of the ULMFiT (Universal Language Model Fine-tuning) model for Hindi. It's particularly useful for text classification tasks and can be fine-tuned for other NLP applications in Hindi.
  5. iNLTK (Indian Natural Language Toolkit): Offers support for Hindi, including pre-trained models for language identification, text generation, and embedding generation. It's an accessible tool for developers looking to implement Hindi NLP tasks.
  6. Anuvaad: While focusing on machine translation, Anuvaad includes models trained specifically for translating between Hindi and other languages. It's particularly aimed at government and official documents but is versatile enough for broader applications.
  7. AI4Bharat's IndicNLP Corpus: Includes a substantial Hindi corpus as part of its collection, which can be used for training models on various NLP tasks in Hindi, such as text classification, sentiment analysis, and machine translation.
  8. Bhashini: A Government of India initiative aimed at developing digital public infrastructure for languages to enable the development of AI-based language technology solutions and foster innovation in Indian languages, including Hindi.
  9. Project Indus: A collaborative effort aimed at creating a robust NLP ecosystem for Indian languages. While specific details may vary, such projects typically focus on developing datasets, models, and tools that enhance the processing and understanding of languages like Hindi.
  10. Open Hathi: An open-source project that may involve developing NLP tools and resources for Indian languages. Open-source projects like Open Hathi can contribute significantly to language technology by providing accessible tools, datasets, and models for Hindi and other languages.
  11. Krutrim: This could refer to an initiative or project focused on artificial intelligence, potentially including NLP applications for Hindi. Projects under names like Krutrim often aim to advance AI research and application in specific linguistic or regional contexts.
  12. Corover.ai : A private venture that may involve AI-powered solutions, possibly including chatbots, virtual assistants, or customer support solutions in multiple languages, including Hindi. Companies like Corover.ai contribute to the practical application of AI and NLP technologies in business and consumer contexts.

Spanish

Listing all Spanish Language Models (LLMs) is challenging due to the rapidly evolving landscape of natural language processing and the development of new models by various organizations worldwide:

  1. BERT-Base, Multilingual: This version of BERT is trained on 104 languages, including Spanish, and is widely used for various NLP tasks across multiple languages.
  2. mBERT (Multilingual BERT): Similar to BERT-Base, Multilingual, mBERT is designed to handle 104 languages, with Spanish being one of them. It's used for tasks like named entity recognition, part-of-speech tagging, and question-answering in Spanish.
  3. XLM-R (Cross-lingual Language Model - Roberta): XLM-R is an improved version of BERT for multilingual tasks, trained on 100 languages including Spanish, and achieves state-of-the-art performance on several cross-lingual benchmarks.
  4. GPT-3 with Multilingual Support: While primarily an English model, GPT-3 has demonstrated capabilities in various languages including Spanish, due to its vast and diverse training dataset.
  5. MARBERT (Multilingual ARabic BERT): Although initially focused on Arabic, MARBERT has shown capabilities in other languages, including Spanish, due to its diverse training set.
  6. RoBERTa-Base, Multilingual: This model is a multilingual version of RoBERTa, optimized for more languages, including Spanish.
  7. ELECTRA Multilingual: An efficient and smaller model compared to BERT and GPT, ELECTRA Multilingual has been adapted for several languages, including Spanish.
  8. Spanish GPT-2 and GPT-3 Models: Some projects and institutions have taken the initiative to fine-tune or train GPT-2 and GPT-3 models specifically for Spanish, enhancing their performance on Spanish texts.
  9. BETO: A Spanish adaptation of BERT, BETO is trained specifically on Spanish language data, making it particularly effective for Spanish language tasks.

?

French

Similar to Spanish, there are several LLMs are in market, developed or adapted for French, ranging from models specifically trained on French language data to multilingual models that include French among their languages:

  1. CamemBERT: A model specifically trained on French language data, CamemBERT is based on the RoBERTa architecture and is optimized for understanding and generating French text.
  2. FlauBERT: Another French-specific model, FlauBERT is designed to understand the nuances of French syntax and semantics, and is used for a variety of NLP tasks in French.
  3. BERT-Base, Multilingual: This version of BERT has been trained on 104 languages, including French, making it capable of handling various NLP tasks in French.
  4. mBERT (Multilingual BERT): Similar to BERT-Base, Multilingual, mBERT handles 104 languages, including French, and is widely used for tasks such as named entity recognition, part-of-speech tagging, and question-answering in French.
  5. XLM-R (Cross-lingual Language Model - Roberta): An improved version of BERT for multilingual tasks, XLM-R has been trained on 100 languages, including French, and achieves high performance on several cross-lingual benchmarks.
  6. GPT-3 with Multilingual Support: Although primarily an English model, GPT-3 has shown capabilities in various languages, including French, due to its extensive and diverse training dataset.
  7. RoBERTa-Base, Multilingual: This model extends the RoBERTa model to multiple languages, including French, providing improved performance on French language tasks.
  8. French GPT-2 and GPT-3 Models: There are initiatives and projects that have fine-tuned or trained GPT-2 and GPT-3 models specifically for French, enhancing their performance on tasks involving French text.

Arabic

For the Arabic language, LLMs have been developed or adapted to handle its unique morphology and script. Some models are specifically trained for Arabic, while others are multilingual models with strong support for Arabic:

  1. AraBERT: Inspired by BERT, AraBERT is specifically trained on a large corpus of Arabic text and is designed to understand and generate Arabic text, optimizing for the language's unique characteristics.
  2. MARBERT: MARBERT is tailored for Arabic, leveraging a large-scale corpus to improve performance on various NLP tasks, including sentiment analysis, text classification, and named entity recognition in Arabic.
  3. Arabic-BERT: Similar to AraBERT, Arabic-BERT is another adaptation of the original BERT model, fine-tuned specifically for the Arabic language to enhance its performance on Arabic NLP tasks.
  4. GigaBERT: A variant of BERT that has been specifically trained on a gigaword-size Arabic corpus, aiming to improve understanding and generation of Arabic text across a wide range of domains.
  5. QARiB: Developed by the Qatar Computing Research Institute, QARiB is a BERT-based model trained on a diverse set of Arabic dialects, aimed at understanding the nuances of different Arabic dialects and MSA (Modern Standard Arabic).
  6. AraELECTRA: An Arabic version of ELECTRA, AraELECTRA is designed to be efficient and effective for Arabic language processing, using a pre-training approach that differs from BERT by focusing on replacing tokens rather than masking them.
  7. BERT-Base, Multilingual: This version of BERT has been trained on 104 languages, including Arabic, and is capable of handling various NLP tasks in Arabic.
  8. mBERT (Multilingual BERT): mBERT is designed to support 104 languages, including Arabic, and is used for a variety of NLP tasks such as named entity recognition, part-of-speech tagging, and question-answering in Arabic.
  9. XLM-R (Cross-lingual Language Model - Roberta): XLM-R is trained on 100 languages, including Arabic, and achieves state-of-the-art performance on several cross-lingual benchmarks, including those involving Arabic.
  10. AraGPT2 & AraGPT3: Adaptations of the GPT-2 and GPT-3 models for Arabic, these models are fine-tuned to better handle Arabic text generation and understanding, taking into account the linguistic features of Arabic.

?

Bengali

Though the 5th most spoken language in the world, the advancements in Bengali are not proportionately high. Most are adapted from Indic models.

  1. IndicBERT: A multilingual model trained on 12 Indian languages, including Bengali, IndicBERT leverages the ALBERT architecture and is optimized for understanding and generating text in these languages.
  2. mBERT (Multilingual BERT): This version of BERT is designed to support 104 languages, including Bengali, and is used for a variety of natural language processing tasks such as named entity recognition, part-of-speech tagging, and question-answering in Bengali.
  3. XLM-R (Cross-lingual Language Model - Roberta): Trained on 100 languages, including Bengali, XLM-R is a powerful multilingual model that achieves high performance on several cross-lingual benchmarks, including those involving Bengali.
  4. MuRIL (Multilingual Representations for Indian Languages): Developed by Google, MuRIL supports 17 Indian languages along with English, including Bengali. It is designed to handle the linguistic nuances of Indian languages better and performs well on various NLP tasks.
  5. BNLP: A toolkit for Bengali natural language processing that includes pre-trained models for tasks like named entity recognition, part-of-speech tagging, and sentiment analysis in Bengali.
  6. BanglaBERT: Although information might be limited, there have been efforts to adapt BERT models specifically for Bengali, aiming to improve performance on Bengali language tasks by fine-tuning on Bengali text corpora.
  7. Bangla Electra: A model trained specifically for Bengali, utilizing the Electra pre-training approach, which is designed to be more efficient than the BERT model by using a generator-discriminator setup.

?

Portuguese

The development of Portuguese LLMs, is similar Spanish and French:

  1. BERTimbau: A BERT-based model specifically trained on a large corpus of Brazilian Portuguese text. BERTimbau is designed to understand and generate Portuguese text, optimizing for the nuances of the Brazilian Portuguese dialect.
  2. Portuguese GPT-2 and GPT-3 Models: There have been efforts to fine-tune or adapt GPT-2 and GPT-3 models for Portuguese, particularly focusing on generating coherent and contextually relevant Portuguese text.
  3. BERT-Base, Multilingual: This version of BERT has been trained on 104 languages, including Portuguese, making it capable of handling various NLP tasks in Portuguese.
  4. mBERT (Multilingual BERT): mBERT supports 104 languages, including Portuguese, and is used for tasks such as named entity recognition, part-of-speech tagging, and question-answering in Portuguese.
  5. XLM-R (Cross-lingual Language Model - Roberta): XLM-R is trained on 100 languages, including Portuguese, and achieves state-of-the-art performance on several cross-lingual benchmarks, including those involving Portuguese.
  6. Unicamp's Portuguese models: The University of Campinas (Unicamp) in Brazil has developed several NLP models specifically for Portuguese, including adaptations of BERT and other architectures for the Portuguese language.
  7. ParsBERT (Portuguese and Persian BERT): Although initially focused on Persian, ParsBERT has shown capabilities in Portuguese as well, due to its training on a diverse set of languages.
  8. Portuguese ELECTRA: There are adaptations of the ELECTRA model specifically fine-tuned for Portuguese, designed to be efficient and effective for Portuguese language processing.

?

Russian

Likewise for Russian. We are seeng rapid advancement from AI community on Russian NLP and LLMs.

  1. RuBERT: Developed by DeepPavlov, RuBERT is a BERT-based model specifically pre-trained on a large corpus of Russian text. It is designed to understand and generate Russian text, optimizing for the language's unique grammatical and syntactical nuances.
  2. GPT-3 for Russian: Although GPT-3 is a multilingual model, it has shown capabilities in Russian thanks to its vast and diverse training dataset. There have also been efforts to fine-tune GPT-3 specifically for Russian to enhance its performance on tasks involving Russian text.
  3. mBERT (Multilingual BERT): This version of BERT supports 104 languages, including Russian, and is used for various NLP tasks such as named entity recognition, part-of-speech tagging, and question-answering in Russian.
  4. XLM-R (Cross-lingual Language Model - Roberta): XLM-R is trained on 100 languages, including Russian, and achieves high performance on several cross-lingual benchmarks, including those involving Russian.
  5. RuGPT: Inspired by the GPT architecture, RuGPT models are adapted specifically for Russian, with various versions available that are fine-tuned on Russian language datasets to enhance their performance in generating coherent and contextually relevant Russian text.
  6. SDSJ Task B Russian: Part of the Sberbank Data Science Journey competition, this model was specifically developed for a task involving Russian language processing, showcasing the capabilities of NLP models in handling complex Russian language tasks.
  7. Russian ELECTRA: Similar to other language-specific adaptations, there are versions of ELECTRA that have been fine-tuned for Russian, aiming to be more efficient and effective for Russian language processing.

Urdu

The development of Language Models (LLMs) specifically for Urdu is still in its nascent stages compared to languages like English or Chinese:

  1. mBERT (Multilingual BERT): This version of BERT supports 104 languages, including Urdu, and is used for a variety of natural language processing tasks such as named entity recognition, part-of-speech tagging, and question-answering in Urdu.
  2. XLM-R (Cross-lingual Language Model - Roberta): Trained on 100 languages, including Urdu, XLM-R is a powerful multilingual model that achieves high performance on several cross-lingual benchmarks, including those involving Urdu.
  3. MuRIL (Multilingual Representations for Indian Languages): Although primarily focused on Indian languages, MuRIL also includes support for Urdu due to its usage in South Asia. Developed by Google, MuRIL is designed to handle the linguistic nuances of Indian languages and Urdu, performing well on various NLP tasks.
  4. Urduhack: While not a large language model in the same vein as BERT or GPT, Urduhack is a toolkit that provides pre-trained models for Urdu language processing, including tasks like text normalization, sentence tokenization, and pos-tagging.
  5. CRULP Urdu Phonetic Keyboard Layout: This tool is more about language input than a language model, but it's worth mentioning as part of the computational linguistics landscape for Urdu. Developed by the Center for Research in Urdu Language Processing (CRULP), it facilitates the typing of Urdu script on standard keyboards.
  6. BERT Multilingual Models Fine-tuned for Urdu: There have been efforts to fine-tune existing multilingual models like mBERT for Urdu-specific tasks, enhancing their performance for Urdu language processing.

?

References:

https://www.sarvam.ai/blog/announcing-openhathi-series

Researchers examine how multilingual BERT models encode grammatical features ( techxplore.com )

The Role of LLMs in Multilingual Communication and Translation | LinkedIn

Introducing The World's Largest Open Multilingual Language Model: BLOOM ( huggingface.co )


Coming Up Next:

Please watch this list... I have an active queue of newsletters in draft. This queue has now grown to about 25 and I am running out of time to complete my research and writing. I also have 4 guest writers who are writing topics they are domain experts on. So, this list will keep changing based on what gets finished first. I couldn't be more excited about the volume of topics and the immense learning here.

  1. The rise of Small Language Models
  2. Where is Apple? Slow and steady wins the race?
  3. AI revolutionizing e-commerce marketplaces
  4. The environmental cost of AI
  5. The convergence of AI, Immigration and Privacy
  6. Decoding ML, DL, LLM, AGI, and the world of AI
  7. Major League sports, Cricket and Sports Analytics
  8. Intelligent Recruiting and human talent management
  9. Rhythms | A new AI powered operating systems to transform the future of work?
  10. AI in Cybersecurity | Auth0/Okta
  11. AI predicting the future of human life and planet Earth.
  12. Q Star (GPT 5) and AGI (Artificial General Intelligence) - the good, bad and the ugly.


?

Loved reading this! So informative and helpful!

回复
Oussama BARKIA

Award winning Marketer??| B2B Marketing & Communications | Strategic Marketing | ABM ?? | Consulting | Technology | Ecosystem & Alliances ?? | AI for Business ??| Marketing Platforms

5 个月

Brilliant informative article! Thanks for sharing

回复

Great compilation, provides a comprehensive view

Sharmilli Ghosh

Product Management | GTM | ISV & SI Partnerships | Startup Founder | Board Member | Investor |

9 个月

Although this says it’s a 21 min read, which was alarming to me, it’s really because of a list of ~10 LLMs per language, for each of the top 10 languages

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了