To Embed or not to Embed ...
Arun Krishnan
Entrepreneur, Technology Leader, Business Leader, Analytics, AI, GenAI, Author Experienced Data Science and AI professional and leader. Driving Data Science, AI and GenAI technology and business growth
Everyone by now, ought to be familiar with the Retrieval-Augmented Generation (RAG) approach, wherein documents or text files are broken into chunks, embedded into numerical vectors, stored in vector databases and relevant chunks retrieved, based on the query, by resorting to cosine similarity or some other similarity metric.
This has enabled the development of a wide range of applications without having to resort to the more expensive option of fine-tuning of models. The current wave of Generative AI applications is driven, in large part, by the RAG approach.
This works well for content creation, translations or summarisation of text. However, when it comes to numerical and structured data, this approach might not work two well.
Why? What's the difference? As far as the LLM is concerned, it can treat the numerical data as another form of text? Why should we not just embed numerical tables and use it in the same way?
All very valid questions. However, the devil, as always, is in the details. Let's take a csv file with rows and rows of comma-separated data. We know that the text will be broken up into chunks, of size, say, 1024 tokens each. When this happens, our nice, structured table, will get broken up into several, unrecognisable pieces. When these chunks are queried, the results can be, to put it mildly, surprising.
If the data that our query demands, does NOT reside on those chunk boundaries, we might, just might get correct information to your query. If, however, the data resides on those boundaries, then the responses might be no different from hallucinations.
领英推荐
A further disadvantage is that we can't ask complicated queries like say, "Show me the average sales by product by quarter, last year". The LLM will mostly have no way to unpack the start and end-dates of the dataset required to process such a query.
A much better way is to use the LLM as a front-end for analysing the question, and translating it into a database query. You would, of course, have to provide the prompt with enough information, so that the LLM is able to generate the query with the right table and field names.
Why would this work?
Because, the LLM, being a language model, is great at translations. And an SQL query is, after all, just another language. The LLM can easily translate between human language and a database query, provided all the relevant information is provided.
Want to learn more? Looking for something like this for your organisation? iLink can certainly help you with that. Let's connect!