登录查看更多内容

Tips & Tricks: implementing multi-language with RAG ??

Dr. Lucia Stavarache

Executive Cognitive Architect & Technical Development Manager at IBM

发布日期: 2024年7月6日

As you delve into building a RAG solution, which encompasses adjacent families, patterns, and acronyms of RAG (Dense Retrieval, Sparse Retrieval, Hybrid Retrieval, Neural Retrieval, Memory-Augmented Generation, Contextual Retrieval, Hybrid, Agentic or modular RAG), it's important to anticipate that once the product stabilizes, the next request in line will be for multi-language support.

While this may seem straightforward in theory, with existing models supporting multi-language embeddings and powerful LLMs for multi-language inference, the practical implementation, especially for business use cases, presents some unique challenges. I've gathered some valuable lessons from our experience that I believe can assist in refining the results, as we've discovered that out-of-the-box solutions are not always sufficient.

Criteria to evaluate an Embedding Model

I included a few representatives in this table that we have considered for our RAG platform.

What are embedding types and their Use Cases association

*OOV = "Out-Of-Vocabulary." It refers to words or terms that are not present in the vocabulary of a model or system at the time of processing. Here are some key points about OOV words. This can happen frequently with names, rare words, slang, and newly coined terms.

Before selecting, answer these questions first

Free vs. Paid?
Self-managed (control) vs. API?
Community and how the training was obtained? when building for a large-scale client implementation use case, origination, lifecycle, and maintenance are essential factors to bear in mind;
Quality, what are the expectations?
RAG fitness, is the model best suited for the RAG/your use case?

Making the decision

With these in mind, let's walk through the most common issues to help you decide on your approach. Suppose there are enough resources and no requirement to control the model. In that case, it is always nice not to bother with any model's hosting, maintenance, and performance, especially multi-language embedders, which can be very large. GPU is a factor, and having the right skills to manage these models is equally important. However, suppose one would like to control the embedding model. In that case, self-manage becomes an option and can be cheaper on a scale, given that you pay only infrastructure, not cost per token (as stated in the table, the cost per token can be high at volume).

There are thousands of models out there, but very few for multi-language and a handful only that have good coverage across vocabularies, decent quality, and a verified origin; I tried to summarize the ones above. As we talked about quality, this aspect becomes particularly important in multi-language scenarios, as most of the embedding models have perfected or been tested on English or Latin vocabularies, but they do not support other language families. Let us see the techniques of creating such an embedding model and why some are more qualitative than others:

领英推荐

GPT4 Turbo vs. GPT 4o: Which New Model Is King?

CapeStart 7 个月前

HALCON 24.11: Power meets simplicity in machine vision

MVTec Software GmbH 5 个月前

Demystifying Artificial Intelligence

Digiata 11 个月前

In general, Pretrained Multilingual?models are generally considered the best for their broad language coverage, effectiveness in zero-shot and few-shot tasks, and ease of use. Training such models is not cheap and requires a comprehensive corpus of documents, annotations, linguistic experts, and a good processing pipeline to avoid the common problems:

Uneven Performance: perform inconsistently across different languages.
Large Model Size: requires significant computational resources.
Less Customization: limited ability to fine-tune for specific languages or tasks.

If your use case is not multilanguage or is narrowed to specific scenarios is best to use highly performant English models, as with multi-language embedders, there are 3 points of measurement when it comes to RAG, the performance of the embedding model itself: the multi-language quality and token length you need – and all these aspects need to be balanced.

Lessons learned along the exploration pathway: in our choice we selected E5-multilanguage-base due to the following reasons:

Cost plus we wanted to manage the model and be able to retrain
Balanced performance among the three measurement points
It had a friendly license with good community support

Conclusions & Observations

There are quality discrepancies not only among the different family of languages (e.g., Latin vs Mongolian) but also within the family;
You need native speakers when testing, especially with RAG;
Token size may need to be adjusted between families;
Vocabulary and grammar are essential, and you just cannot cut in Mongolian languages, as these are not separated by space, and the words will make no sense;
Cleansing parsing and normalization of your chunks and corpus for RAG becomes even more important as you may remove some characters for English that are required for other vocabularies
Necessity for multiple testers from all languages you want to support and constant feedback
It is necessary to have very good language detectors in place, especially when working with bilingual or multi-lingual text in the same document
More prone to HAPs, cultural biases and non-sense

?The lesson we learned is that, so far from our testing, no model perfectly balances the criteria, and most of the work comes from fine-tuning and classic NLP combined with understanding the language and cultural aspects. To conclude, whenever you need a multi-language use case, build your products with the same diverse and culturally agnostic mentality you want your users to consume.

The above lines are written as a short summary of our lessons learned while implementing a couple of sprints of RAG effort in the multi-language space, and we still learn every day.

Thank you,

Larise

要查看或添加评论，请登录

Dr. Lucia Stavarache的更多文章

Wait... Is That Really the Same Data?

2024年10月14日

Wait... Is That Really the Same Data?

How fine tuning sometimes feels…. I hear often the remark if client has data, lets fine-tune a model and we’ll get the…

2 条评论
Journey into Compassion

2024年4月24日

Journey into Compassion

Credit to DeepMind (levels of AGI) As 2024 unfolds, ~ 64 companies have already thrown their hats into the Artificial…
Trying the waters before jumping deep … Actionable steps on how to validate and test your GenAI use cases

2024年2月16日

Trying the waters before jumping deep … Actionable steps on how to validate and test your GenAI use cases

Whenever I think about GenAI, a couple of analogies cross my mind, and two stand out: “Sustainability” or “Buying an EV…

6 条评论
What if everything turns out to be OK with AI ...

2023年10月26日

What if everything turns out to be OK with AI ...

I recently read a Positive News article with this title (https://www.positive.

3 条评论
Well-being in the age of productivity

2023年7月6日

Well-being in the age of productivity

I often ask myself … How can you be productive when you cannot stop? and how can you stop when you always strive to…

8 条评论

See all articles

Tips & Tricks: implementing multi-language with RAG ??

Dr. Lucia Stavarache

Executive Cognitive Architect & Technical Development Manager at IBM

Criteria to evaluate an Embedding Model

What are embedding types and their Use Cases association

Before selecting, answer these questions first

Making the decision

领英推荐

Conclusions & Observations

Dr. Lucia Stavarache的更多文章

社区洞察

其他会员也浏览了

Retrieval-Augmented Generation (RAG) Patterns and Best Practices

How Underspecification Poses Difficulties for ML | Infogen Labs

OpenAI Unveils 'o1' Model: A Leap Towards Human-Like Reasoning?

Ensemble Methods in Practice: Combining the Strengths of Multiple Models and Making Decisions

2025 makes me feel I have to learn more RL...

Understanding Differences Between Encoding and Embedding

Domain oriented Metacognition - the missing skill for reason based prompt engineering with OpenAI o1

Artificial Intelligence #194

Thoughtful prompts (Layer of Thoughts, Chain of Thoughts,Tree of Thoughts …)

OpenAI’s o1 Model: A Revolutionary Leap in AI Reasoning

Criteria to evaluate an Embedding Model

What are embedding types and their Use Cases association

Before selecting, answer these questions first

Making the decision

领英推荐

Conclusions & Observations

Dr. Lucia Stavarache的更多文章

Wait... Is That Really the Same Data?

Journey into Compassion

Trying the waters before jumping deep … Actionable steps on how to validate and test your GenAI use cases

What if everything turns out to be OK with AI ...

Well-being in the age of productivity

社区洞察

其他会员也浏览了

Retrieval-Augmented Generation (RAG) Patterns and Best Practices

How Underspecification Poses Difficulties for ML | Infogen Labs

OpenAI Unveils 'o1' Model: A Leap Towards Human-Like Reasoning?

Ensemble Methods in Practice: Combining the Strengths of Multiple Models and Making Decisions

2025 makes me feel I have to learn more RL...

Understanding Differences Between Encoding and Embedding

Domain oriented Metacognition - the missing skill for reason based prompt engineering with OpenAI o1

Artificial Intelligence #194

Thoughtful prompts (Layer of Thoughts, Chain of Thoughts,Tree of Thoughts …)

OpenAI’s o1 Model: A Revolutionary Leap in AI Reasoning