Tips & Tricks: implementing multi-language with RAG ??
Dr. Lucia Stavarache
Executive Cognitive Architect & Technical Development Manager at IBM
As you delve into building a RAG solution, which encompasses adjacent families, patterns, and acronyms of RAG (Dense Retrieval, Sparse Retrieval, Hybrid Retrieval, Neural Retrieval, Memory-Augmented Generation, Contextual Retrieval, Hybrid, Agentic or modular RAG), it's important to anticipate that once the product stabilizes, the next request in line will be for multi-language support.
While this may seem straightforward in theory, with existing models supporting multi-language embeddings and powerful LLMs for multi-language inference, the practical implementation, especially for business use cases, presents some unique challenges. I've gathered some valuable lessons from our experience that I believe can assist in refining the results, as we've discovered that out-of-the-box solutions are not always sufficient.
Criteria to evaluate an Embedding Model
I included a few representatives in this table that we have considered for our RAG platform.
What are embedding types and their Use Cases association
*OOV = "Out-Of-Vocabulary." It refers to words or terms that are not present in the vocabulary of a model or system at the time of processing. Here are some key points about OOV words. This can happen frequently with names, rare words, slang, and newly coined terms.
Before selecting, answer these questions first
Making the decision
With these in mind, let's walk through the most common issues to help you decide on your approach. Suppose there are enough resources and no requirement to control the model. In that case, it is always nice not to bother with any model's hosting, maintenance, and performance, especially multi-language embedders, which can be very large. GPU is a factor, and having the right skills to manage these models is equally important. However, suppose one would like to control the embedding model. In that case, self-manage becomes an option and can be cheaper on a scale, given that you pay only infrastructure, not cost per token (as stated in the table, the cost per token can be high at volume).
There are thousands of models out there, but very few for multi-language and a handful only that have good coverage across vocabularies, decent quality, and a verified origin; I tried to summarize the ones above. As we talked about quality, this aspect becomes particularly important in multi-language scenarios, as most of the embedding models have perfected or been tested on English or Latin vocabularies, but they do not support other language families. Let us see the techniques of creating such an embedding model and why some are more qualitative than others:
领英推荐
In general, Pretrained Multilingual?models are generally considered the best for their broad language coverage, effectiveness in zero-shot and few-shot tasks, and ease of use. Training such models is not cheap and requires a comprehensive corpus of documents, annotations, linguistic experts, and a good processing pipeline to avoid the common problems:
If your use case is not multilanguage or is narrowed to specific scenarios is best to use highly performant English models, as with multi-language embedders, there are 3 points of measurement when it comes to RAG, the performance of the embedding model itself: the multi-language quality and token length you need – and all these aspects need to be balanced.
Lessons learned along the exploration pathway: in our choice we selected E5-multilanguage-base due to the following reasons:
Conclusions & Observations
?The lesson we learned is that, so far from our testing, no model perfectly balances the criteria, and most of the work comes from fine-tuning and classic NLP combined with understanding the language and cultural aspects. To conclude, whenever you need a multi-language use case, build your products with the same diverse and culturally agnostic mentality you want your users to consume.
The above lines are written as a short summary of our lessons learned while implementing a couple of sprints of RAG effort in the multi-language space, and we still learn every day.
?
Thank you,
Larise