Optimizing retrievers for AI

Optimizing retrievers for AI

Language models are available in different weights and forms but they are still incapable to understand your private and continuously growing data. Thus, despite openAI's GPT & now Deepseek which are close to human models, it is difficult to solve your business use cases without retrievers that pulls meaningful data for generation model to understand. Yes, you can fine tune, but again it will be limited to then available knowledge. That is the sole reason RAG (Retrieval Augment Generation) is most searchable term on the internet.

RAG searches as per google trend:

RAG search trend in last 3 months.

User's searching 'LLM fine tuning' or 'fine tune':


Fine tuning model, searches in last 3 months

Are you experiencing below struggle?

I am using a SOTA model that I have thoroughly tested, and it delivers the expected results when combined with the right prompts and knowledge. However, at runtime, the retriever is not selecting the most relevant content, leading to suboptimal inputs for the generation model. As a result, the model struggles to generate accurate and contextually appropriate outputs. My retriever is not providing the expected quality!

If you are semantically aligned with above statement then let's understand further.

Disclaimer : Don't use your embedding models(retrievers) yet to align semantically with above statement :) Let's first understand them.

Understanding retriever

Retrieval-Augmented Generation (RAG) is a technique that combines retrieval-based search with generative AI to produce more accurate, contextual, and updated responses. Unlike standard LLMs, which rely only on pre-trained knowledge, RAG dynamically retrieves external documents to improve its answers.

If you will do 1-gram on above 2 paras, you will see retrieval is occurring 2 times and both TF (Term Frequency) and KL divergence (relative importance on topic) adds weightage on the topic. Hence, retrievers are important to understand as per this mathematical evaluation, well other than obvious reasons :)

Please note that, as shared, retrieval can be performed from any external source that provides context to the topic in question. However, the most widely used and efficient method is semantic search or vector search. Below diagram depicts the vector search based retrieval flow.

Created by: Rahul Mathur

The above diagram classifies the segregation of retrievers or retriever-based generation flow and the offline knowledge base creation within vector stores. Offline knowledge embedding is a pre-processing step that requires:

crawlers, & scrappers, document loaders, Embeddings & vector stores.

On the other hand, a retrieval is a post processing knowledge selection method on top of vector store. A retriever is a combination of following :

  1. Embedding Model : Converts textual data into numerical vector representations. Embedding models operate in an n-dimensional vector space, where each dimension captures a specific characteristic of the data. Based on training, the model embeds a given query and searches for the most relevant vectors stored in the database using the same embedding model.
  2. Similarity search Algorithm : These algorithms are required to find the closest matching vectors to the query using similarity metrics. There are many algos, cosine similarity, euclidean distance, K-NN, approximate nearest neighbor(HNSW) and more...
  3. Vector Store/DB : You don't necessarily needs a vector database or you can use indexing framework like FAISS, however, vector databases are easier and efficient to manage. Vector DBs like Chroma, Milvius, Mosiac, Pinecone, etc are few examples. These databases help in scalability, out of the box ranking, sparse vector searches and seamless integration with embedding models.

Optimizations to achieve quality

It entirely depends on the use case whether optimizations are necessary. A simple use case can often be built using the basic options available through standard integrations. However, for more complex scenarios, you can explore the following options iteratively to enhance the quality of your RAG solution:

Best suited embedding model

Created by: Rahul Mathur

The embedding model you use can make or break your RAG solution. Select the best model that aligns with your use case. You can choose model as per the nature of following factors -

  1. Data: Text(BeRT, GPT, etc), Image(CLIP), specialized in areas like medical, legal, etc. Based on that you can choose a model which is trained on such datasets.
  2. Performance or latency: If speed and low-latency are key requirements for your solution, you might need to opt for lighter models like DistilBERT or ALBERT, which are optimized for fast inference at the cost of a slight trade-off in accuracy.
  3. Token size: Choose an embedding model whose token limit is compatible with the chunk size you require. This ensures the model can effectively process, store, and retrieve meaningful chunks without losing context. For example, GPT-3 has a higher token limit than BERT, which may make it more suitable for handling larger documents or queries.
  4. Leaderboards: Checking leaderboards and benchmarks can help you identify top-performing models in terms of specific metrics. For example, the Hugging Face MTEB (Model Evaluation Leaderboard) or other performance ranking platforms can give you an idea of how different models compare on common NLP tasks.
  5. Fine tuning potential: Consider whether the model allows fine-tuning based on your specific data. Pre-trained models like BERT or T5 can be fine-tuned for your particular domain or task, which might provide significant performance improvements.
  6. Scalability & resources: Evaluate how the embedding model scales with increasing data size. Some models might require substantial computational resources (like GPUs) to function effectively at scale, which can be an important consideration if you're dealing with large datasets or real-time processing.
  7. Model availability & ecosystem: Assess the availability of the model in the ecosystem you are working with. Some models, like GPT-3, are available through API-based access, while others may require hosting on a specific platform (e.g., Hugging Face, TensorFlow). Ensure the model integrates seamlessly into your tech stack.

Hybrid search with learned sparse retrieval

This technique enhances the final output by refining the results obtained from dense vector search. Methods such as BM25, TF-IDF are employed to retrieve documents containing the relevant tokens. Additionally, advanced techniques like term expansion can further improve retrieval effectiveness by leveraging these algorithms. I would love to write about these techniques in future articles.

Train Embedding Models

If pre-trained models don’t provide the accuracy you need, consider training your own embeddings. Training custom embeddings on domain-specific data allows the system to better understand your specific context and deliver more relevant results, improving the overall quality of your solution. By using frameworks like sentence-transformer and training triplet datasets, you can make model understand the nature of data that is available in your domain. Below pic illustrates that embedding models can be tuned to manage embedding space of your data.

Created by: Rahul Mathur

Multi Modal Approach

A multi-modal approach leverages various data types (text, images, audio, etc.) for retrieval, enriching the process by incorporating more contextual information and improving result relevance. Even when dealing solely with text, a multi-modal approach can be used to ensemble results for better accuracy. However, this comes with a trade-off in terms of increased latency and cost.

Evaluation

This is not an optimization but a key step in assuring the effectiveness. Always test and evaluate! Continuously assess the quality of your retrieval results and tweak your model, vectors, or approach as needed. Implement real-time feedback loops from users and use precision/recall metrics to fine-tune your RAG setup, ensuring it stays relevant and accurate.

Conclusion

Optimizing retrieval is not a one-size-fits-all approach—it depends on the complexity of the use case, data modality, and performance requirements. By iteratively refining your embedding models, retrieval techniques (dense & sparse), multi-modal approaches, and evaluation strategies, you can significantly enhance the quality of your RAG (Retrieval-Augmented Generation) solutions.

A hybrid retrieval approach—combining dense vector search for semantic understanding and sparse vector search for precise keyword matching—offers the best of both worlds. Additionally, custom-trained embeddings and multi-modal methods can further boost contextual relevance.

Ultimately, balancing accuracy, latency, and cost is key. The right retrieval optimizations ensure that AI-driven systems produce reliable, context-aware, and efficient responses, driving better decision-making and user experiences.


要查看或添加评论,请登录

Rahul Mathur的更多文章

  • It Gets More Creative as It Gets Hotter!

    It Gets More Creative as It Gets Hotter!

    Introduction & Context If you have already tried LLM models in your use cases then you must already be aware of a…

    1 条评论
  • Vertical AI - New Era of Boom

    Vertical AI - New Era of Boom

    What's the buzz? If you're using generative AI and find yourself wondering, "What's all the buzz about now?"—here's the…

    2 条评论
  • Gandhi's 3 wise Monkeys on AI

    Gandhi's 3 wise Monkeys on AI

    The philosophy of Gandhi's three wise monkeys—"See no evil, hear no evil, speak no evil"—originates from an ancient…

  • AI's FLOPS Show!

    AI's FLOPS Show!

    AI is undeniably shaping our future and is here to stay. Why FLOP? Because, it's FLOPS :) Well, in this article, we’ll…

    1 条评论
  • Add "FLAIR": Applying AI in FAIR Data Principles

    Add "FLAIR": Applying AI in FAIR Data Principles

    Lately, I've been thinking out loud (as one does) about the way we handle data and how AI is becoming such a natural…

    3 条评论

社区洞察

其他会员也浏览了