The Altar of Embeddings
Retrieval Augmented Generation (RAG) has emerged as the hot concept when building applications using Large Language Models (LLMs). A key component of RAG pipelines is the use of embeddings. As I worked with Embeddings, I was surprised at their power and their inscrutability. I use visualize a few examples to build the intuition of embeddings and end with some insights on how to use them better.
RAG pipelines provide proprietary documents and data to the LLM and improve the relevance and accuracy of response from the LLM. A RAG Pipeline is the process of collecting all the proprietary documents, identifying relevant chunks of text for a question within the documents, passing the relevant chunks of text to the LLM along with the question and instructing the LLM to answer the question using the supplied chunks of text. Embeddings play a vital role in RAG pipelines. Embeddings encode text into vectors and preserve the semantic relationships within the chunks of text. Embedding text into vectors helps (quickly and cheaply) find similar documents or texts.
Jay Alammar’s excellent article, The Illustrated Word2vec https://jalammar.github.io/illustrated-word2vec/ is highly recommended for more details. It is his article which serves as the inspiration for visualizing vectors. Visualizing vectors provides the opportunity to see semantically (meaningfully) similar text close together.
In the first example, I search (top 5 nearest) for the word “queen” within the space of other words for aristocracy. In this vector space, the word “queen” (blue dot) is “nearer” to the words “king”, “prince”, “princess”,” baroness” and “duchess” than it is to the words “duke” or “baron” which is expected.
In the second example, I added more words (from the domain of chess, since the word “queen” is a common word that would appear in words related to aristocracy and chess) to the search space. As expected, the words for chess and aristocracy are clustered together indicating (semantic) similarity.
Let me extend the example to include names of rock bands (another space where the word queen would be found). Expectedly, rock bands form a distinct cluster, it also interesting the word “cream” seems to have formed its own small cluster along with the words “black” and “white” (likely the word “cream” in this space is more related to coffee ?? than the great rock supergroup). This is a preview of the unpredictable results we get with embeddings. I use the word unpredictable to suggest the results are not deterministic and not to suggest they are fickle/untrustworthy.
Embeddings are useful because they can be used to embed sentences too. In the following example, I am embedding an introductory sentence about different British rock bands and search for the sentence about "The Shadows". In the results, the sentence about “The Shadows” (formed in the 50s) is closer to the sentences about bands formed in the 60s (green dots) than it is about the bands formed later in the 70s, 80s and 90s. It is great that I can take a sentence and measure closeness of the sentence is to other sentences and quickly find the closest sentences. The “closeness” is concept of a semantic search. I can embed all sentences, embed a question, and retrieve the sentence (the embeddings of the sentences) closest to the question (embedding of the question) and subsequently use the closest sentences to form a response to the question.
Let me change the sentence about “The Shadows” to state that they were a band formed in 1984. The result from the search ignores “Blur” and other bands chronologically closer (from the 70s,80s and 90s) and indicates the closest match is with the sentences about the bands formed in the 60s.
领英推荐
In the first example with the bands, I assumed the year of formation of the band was the determinant influencing the closeness. Of course, it isn’t that simple, the embedding is influenced by information in the sentence (year of formation) and (possibly) the genre of music, or the occurrence of the phrase or words in the original training dataset. It is this unpredictability that needs to be realized and managed when using embeddings.
Let me test another example, in this next dataset, the vector space searches for “Jimi Hendrix” amongst other bands in the same era but distinct in their sound, from Softer (The Beach Boys, The Beatles) to harder (Cream, The Who). I also capitalize Queen and Cream so that they are treated as nouns. The results are surprising, “The Who” and “Cream” don’t even show up in the top 5 nearest matches when searching for “The Jimi Hendrix Experience”. ?I would expect Jimi Hendrix to be closer to The Who and Cream than the Monkees at least. Its quite clear that the results can be unpredictable, thus making it important to be cautious when using it for more complex use cases.
Embeddings are “relative”, the embedding of a word or sentence will change based on the dataset.? Finding nearest neighbors searches, though mathematically precise can provide unexpected results. It is useful to remember the diagram of Anscombe’s quartet, which also illustrates a situation where mathematical precision and human intuition diverge from each other. Moreover, embeddings are created through a neural network on the training dataset, and thus it is not possible to scrutinize the embedding of a particular word or sentence. There is no “explanation” on why the embedding for a word or a concept is what it is.?
A few basic steps helped improve the results when using search of embeddings.
Learning about embeddings has been a rewarding exercise, they provide a powerful tool for semantic search but also present a challenge in their inscrutability.? Embeddings (sometimes) yield unexpected results, but I guess it is to be expected since we are searching through words and concepts, and language does tend to be subjective. These simple examples (with instructive results) temper my enthusiasm when using embeddings. I can be confident that the results will be relevant, but they won’t be perfect. However, I am still going to be an enthusiastic trekker to the altar of embeddings, albeit with caution.
?
P.S. Anscombe’s quartet comprises four datasets that have nearly identical simple descriptive statistics (mean, variance, correlation) yet have very different distributions and appear very different when graphed. It is of my favorite concepts, and I tend to refer to it to provide context whenever the results of forecasting and model fit are counter intuitive.
Service Line Leader for SAP, S/4, BTP Technologies, SAP Licensing, Signavio, ALM, SAP Testing | Enterprise Architecture | Pre-sales I Consulting | Solution & Value Advisory | Acct. Mgmt
9 个月Incorporate the word “embeddings” too and search! ?? nice post. Refreshing to learn/see non-sap technologies ??
AI Product Management
9 个月Lucid , informative to the point. With the advent of combined robustness of LangChain & Retrieval Aug Generation - the ‘Value Add’ from Digital Transformation has increased multifold. This results in CS= CX+ CO