EuroBERT – Advancing Multilingual NLP Through Open Collaboration
Multilingual NLP remains a persistent challenge. While large generative models like GPT-4 have transformed text generation, encoder-based models are still essential for search engines, document retrieval, and classification systems. However, existing multilingual encoder models often struggle with performance inconsistencies, short context windows, and limited domain adaptability, making them inefficient for real-world applications.
EuroBERT changes that. It is a great example of collaboration between government, the private sector, and academia, demonstrating how open research and cross-border cooperation can drive real innovation. Developed through a cross-European effort, EuroBERT brings together MICS Laboratory at CentraleSupélec, Diabolocom, Artefact, Unbabel, Instituto Superior Técnico & Universidade de Lisboa, with technological support from AMD and CINES and funding from the France 2030 program. By combining expertise from machine learning researchers, industry engineers, and public institutions, the project delivers a practical, high-performance multilingual encoder model that overcomes the limitations of its predecessors, including BERT, ModernBERT, and XLM-RoBERTa.
EuroBERT is designed to push the boundaries of efficiency, scalability, and multilingual precision. It is the most advanced multilingual encoder model for retrieval (RAG), classification, and quality estimation (e.g., summarization, translation) across 15 languages. With support for 8 major European languages and 7 of the world’s most spoken non-European languages, it ensures broad linguistic coverage. Unlike earlier models, EuroBERT introduces a long-context window of 8,192 tokens, allowing it to process entire documents more effectively than the standard 512-token limit seen in previous encoder models.
With 5 trillion tokens of training data—twice as much as typical encoder models and even surpassing generative models like Llama 2 (2 trillion tokens)—EuroBERT delivers unparalleled linguistic understanding without additional usage costs. Unlike previous models with limited transparency, it is fully open-source, with all training checkpoints, datasets, and code available for public use.
This initiative sets a powerful precedent for collaborative AI development, proving that diverse institutions working together can achieve results beyond what any single entity could accomplish alone. No single institution or country can tackle these challenges alone. Other countries should take note—this model of cooperation can be replicated to build high-quality language models that better serve local linguistic and cultural needs. By organizing similar efforts, nations can strengthen their AI ecosystems and ensure their languages remain well-represented in the rapidly evolving world of NLP.
Why EuroBERT Matters
Despite the progress in multilingual NLP, most encoder models have struggled with three major issues:
EuroBERT addresses these challenges by introducing a multilingual training approach, improved architectural efficiency, and a significantly longer context window.
How EuroBERT Improves on Previous Models
EuroBERT builds upon ModernBERT and previous multilingual encoder models like XLM-RoBERTa. The key advancements include:
1. Expanded Multilingual Training
EuroBERT was trained on a 5-trillion-token dataset spanning 15 languages, ensuring better linguistic representation. This diverse dataset includes high-quality sources, reducing biases and improving cross-lingual transfer learning.
2. Enhanced Architecture
Several design improvements contribute to EuroBERT’s efficiency:
3. Longer Context Support
Unlike BERT (512-token limit) and ModernBERT, EuroBERT supports sequences up to 8,192 tokens, making it suitable for document retrieval, long-form summarization, and legal text analysis.
4. Domain-Specific Knowledge
EuroBERT incorporates additional training on mathematical and programming data, making it more effective for tasks like code search and structured reasoning. This feature makes it particularly useful for developers, data scientists, and industries that require structured data interpretation.
Benchmark Performance
EuroBERT has been evaluated across several NLP benchmarks, consistently outperforming previous multilingual encoder models:
These results shown on the EuroBERT Github indicate that EuroBERT is not just a theoretical improvement—it provides tangible performance benefits across a range of real-world NLP applications.
Practical Applications for Businesses and Developers
EuroBERT’s improvements have direct implications for both enterprises and technical teams:
For Businesses
For Developers
The Open-Source and Cross-Border Collaboration Behind EuroBERT
EuroBERT’s development is a testament to the power of open collaboration. Unlike proprietary AI models developed in isolation, EuroBERT is the result of a coordinated effort across academia, industry, and government, demonstrating how pooling resources and expertise can accelerate NLP advancements.
I want to personally extend my deepest appreciation and congratulations to the entire EuroBERT team for their outstanding work in developing this groundbreaking multilingual encoder model. Your commitment to innovation, collaboration, and open research has resulted in a tool that will have a lasting impact on the NLP community.
By making EuroBERT fully open-source, you have not only pushed the boundaries of multilingual AI but have also given a valuable gift to researchers, developers, and businesses worldwide. The transparency, accessibility, and performance of this model set a new standard for what multilingual NLP can achieve.
This project brought together leading researchers, engineers, and AI practitioners from institutions across France, Portugal, and other European countries, including:
By openly sharing datasets, training methodologies, and model architectures, EuroBERT ensures that the entire AI community—not just a select few organizations—can benefit from and build upon its innovations. The decision to make EuroBERT fully open-source, with all training checkpoints, datasets, and implementation details available, fosters transparency, reproducibility, and continuous improvement.
Why Open-Source Matters for NLP Development
Unlike closed models, which limit access and restrict adaptation, EuroBERT’s open-source nature allows:
A Model for Future Collaboration
The success of EuroBERT is a clear example of how cross-border and cross-sector partnerships can drive meaningful AI progress. Governments, research institutions, and enterprises worldwide should take note—by working together and prioritizing open research, they can build stronger, more inclusive AI systems that serve a broader range of users and linguistic communities.
By fostering a global, open-source NLP ecosystem, EuroBERT lays the groundwork for future innovations that will further improve multilingual AI and democratize access to powerful language models.
Why Other Countries and Languages Should Follow EuroBERT’s Example
While EuroBERT is a major breakthrough in multilingual NLP, it is only the beginning. Many languages around the world remain severely underrepresented in AI models, making it difficult for speakers of these languages to access high-quality NLP tools. If AI is to be truly inclusive, it must go beyond a handful of dominant languages and ensure global linguistic diversity is preserved and empowered.
Underrepresented Languages in AI
Most large-scale language models are disproportionately trained on English, Chinese, and a few widely spoken European languages, while many African, Indigenous, and regional languages have little to no representation in AI systems. This has serious implications:
Community Collaboration: The Key to Expanding Language Coverage
The success of EuroBERT proves that open-source, cross-sector collaboration is the most effective way to build powerful, multilingual AI models. No single company, institution, or country can tackle this alone. By working together—pooling research efforts, sharing datasets, and refining models—the global NLP community can ensure that every language has a place in AI.
Governments, researchers, and developers should take inspiration from EuroBERT and:
Time for Language Groups to Team Up and Collaborate
The success of EuroBERT proves that collaborative, open-source NLP models are the best way to advance multilingual AI. However, many languages are still severely underrepresented, leaving billions of people without access to high-quality AI-driven language tools. The next step is for linguistic communities, researchers, and institutions to team up and develop specialized encoder models that cover more regions, more languages, and more domains.
There are already some regional and domain-specific BERT models in existence, demonstrating the growing demand for specialized NLP solutions, such as:
While these models serve specific linguistic and industry needs, there are still huge gaps in many underrepresented languages—especially African, Indigenous, and low-resource Asian languages.
?? Not enough regional and domain-specific BERT models exist today.
?? Many languages remain completely unsupported.
?? It’s time to team up, collaborate, and build new models to fill these gaps.
Regional Multilingual Encoder Models to Build
1. AsiaBERT+ ?? Covers a wide range of Asian languages across East Asia, South Asia, Southeast Asia, and Central Asia, including:
2. AfricaBERT ?? Supports both widely spoken and low-resource African languages, including:
3. LatAmBERT ?? Tailored for Latin American Spanish, Brazilian Portuguese, and Indigenous languages such as Quechua, Guarani, and Nahuatl.
4. SlavBERT ?? Covers Slavic languages including Russian, Polish, Czech, Slovak, Ukrainian, Bulgarian, Serbian, and Croatian.
5. IndoBERT ???????????? Focuses on South Asian languages, such as Hindi, Bengali, Tamil, Telugu, Punjabi, Marathi, Gujarati, Urdu, and Sinhala.
6. ArabBERT++ ?? Expands on existing ArabBERT to include Modern Standard Arabic, regional Arabic dialects, and related languages like Persian and Kurdish.
Domain-Specific Encoder Models
While general multilingual encoders like EuroBERT and XLM-R are designed to handle broad language tasks, domain-specific encoder models are optimized for specialized fields where precision, context awareness, and technical language are critical. These models, such as MedBERT for healthcare, LegalBERT for law, SciBERT-X for scientific research, and FinBERT-World for finance, are trained on highly structured, domain-specific datasets, ensuring they understand industry terminology, regulatory nuances, and contextual dependencies far better than general-purpose models. By fine-tuning on professionally curated data, these encoders enable more accurate legal text retrieval, medical diagnosis automation, patent analysis, and financial risk assessment, making AI applications safer, more reliable, and more effective in professional environments.
Why These Models Matter
Start Building Today
The best way to kickstart multilingual NLP development is to leverage the tools already available. The EuroBERT GitHub repository offers a strong foundation for multilingual encoder models, providing:
? Pretrained models
? Training pipelines
? Multilingual benchmarks
?? Check out the EuroBERT GitHub for tools and resources: EuroBERT GitHub
By building on top of existing open-source work, the community can accelerate progress, expand NLP inclusivity, and develop AI that serves speakers of all languages—not just those that dominate today’s models. The next multilingual encoder model could be yours—let’s build it together! ??
Once We Get More Models, What is Next? Can We Merge Them All?
Merging Existing Multilingual Encoder Models into One Large Model: What Would It Take?
Combining regional and domain-specific BERT models such as those listed above (e.g., EuroBERT, XLM-R, AraBERT, CamemBERT, AfriBERTa, SciBERT) into one massive, unified multilingual encoder would be a complex but potentially valuable endeavor. Here’s what it would take and the potential pros and cons of such an approach.
Steps to Merge Existing Models & Datasets
1. Standardizing & Aligning Datasets
Challenge:
Solution:
2. Unifying Model Architectures & Training Objectives
Challenge:
Solution:
3. Computing Resources for Training a Massive Model
Challenge:
Solution:
4. Handling Tokenization & Vocabulary Expansion
Challenge:
Solution:
5. Preventing Overfitting to High-Resource Languages
Challenge:
Solution:
Pros & Cons of a Merged Global Multilingual Encoder Model
? Pros (Why It Might Be Worth It)
? Cons (Challenges & Drawbacks)
Alternative: A Modular Approach Instead of One Monolithic Model
Instead of one massive model, a better approach could be a "Modular BERT Ecosystem", where:
This would allow scalability, efficiency, and adaptability while avoiding the pitfalls of a single, monolithic multilingual model.
Final Thoughts: Should We Merge Models?
A global multilingual encoder is possible, but it would require massive computing resources, thoughtful dataset balancing, and modular design to be practical. Instead of a single model, a hierarchical system of regional and domain-specific models could offer the best of both worlds—combining scalability with adaptability.
?? The key is collaboration—merging models will require open-source contributions from multiple countries, institutions, and research groups.
The Bottom Line: Open Collaboration is the Future of Multilingual NLP
EuroBERT is more than just another NLP model—it’s a proof of concept for what is possible when governments, academia, and industry collaborate across borders. It provides a high-performance, scalable solution for multilingual retrieval, classification, and structured information processing, addressing the limitations of previous encoder-based models such as BERT, ModernBERT, and XLM-RoBERTa.
With support for 15 languages, a 5-trillion-token training dataset, and a long-context processing capability of 8,192 tokens, EuroBERT establishes a new standard for multilingual NLP. By making it fully open-source, with all training checkpoints, datasets, and code publicly available, this project ensures that cutting-edge NLP technology is accessible and adaptable for global use.
But EuroBERT is also a call to action. No single institution or country can tackle these challenges alone—other nations must take note and act now. The dominance of a few languages in AI models risks leaving others behind. If countries want their languages to be properly represented in future AI systems, they must invest in large-scale, open collaborations to develop sovereign, high-quality multilingual models that serve their own linguistic and cultural needs.
The success of EuroBERT proves that real breakthroughs happen when governments, researchers, and industry align their efforts toward a common goal. The future of multilingual AI will not be built by observers, but by those who organize, collaborate, and contribute. The time to act is now.
Resources
Founder & CEO of EPIC Translations
5 天前EPIC Translations has been developing its own AI translation software. This looks interesting. Thanks for sharing!
Data Governance and Data Standards Policy Expert
1 周It is positive for the tech industry to see alternatives coming out of non - US sources, given present political tensions.