The Altar of Embeddings

Sharath B.

Revolutionizing Supply Chains with Digital Twins and AI/ML.

发布日期: 2024年6月6日

Retrieval Augmented Generation (RAG) has emerged as the hot concept when building applications using Large Language Models (LLMs). A key component of RAG pipelines is the use of embeddings. As I worked with Embeddings, I was surprised at their power and their inscrutability. I use visualize a few examples to build the intuition of embeddings and end with some insights on how to use them better.

RAG pipelines provide proprietary documents and data to the LLM and improve the relevance and accuracy of response from the LLM. A RAG Pipeline is the process of collecting all the proprietary documents, identifying relevant chunks of text for a question within the documents, passing the relevant chunks of text to the LLM along with the question and instructing the LLM to answer the question using the supplied chunks of text. Embeddings play a vital role in RAG pipelines. Embeddings encode text into vectors and preserve the semantic relationships within the chunks of text. Embedding text into vectors helps (quickly and cheaply) find similar documents or texts.

Jay Alammar’s excellent article, The Illustrated Word2vec https://jalammar.github.io/illustrated-word2vec/ is highly recommended for more details. It is his article which serves as the inspiration for visualizing vectors. Visualizing vectors provides the opportunity to see semantically (meaningfully) similar text close together.

In the first example, I search (top 5 nearest) for the word “queen” within the space of other words for aristocracy. In this vector space, the word “queen” (blue dot) is “nearer” to the words “king”, “prince”, “princess”,” baroness” and “duchess” than it is to the words “duke” or “baron” which is expected.

Note that the embedding model (nomic-embed-text-v1) uses a list of 768 real numbers or dimensions to embed the text, the vector is reduced to two dimensions so that it can be plotted on a (2 dimensional) scatter chart. Thus, we must ignore the visual distance between the dots on the scatter plot since the nearest neighbour was found in the 768-dimension vector space.

In the second example, I added more words (from the domain of chess, since the word “queen” is a common word that would appear in words related to aristocracy and chess) to the search space. As expected, the words for chess and aristocracy are clustered together indicating (semantic) similarity.

Embeddings words from the domain of chess and aristocracy shows how embeddings help. the words for the two domains are distant from each other while the common words (bishop, knight) appear in the center.

Let me extend the example to include names of rock bands (another space where the word queen would be found). Expectedly, rock bands form a distinct cluster, it also interesting the word “cream” seems to have formed its own small cluster along with the words “black” and “white” (likely the word “cream” in this space is more related to coffee ?? than the great rock supergroup). This is a preview of the unpredictable results we get with embeddings. I use the word unpredictable to suggest the results are not deterministic and not to suggest they are fickle/untrustworthy.

Embedding words from three different domains (chess, aristocracy, rock bands) and we observe the domains occupy distinct spaces. The search for queen results words from the aristocracy domain being proposed as the nearest.

Embeddings are useful because they can be used to embed sentences too. In the following example, I am embedding an introductory sentence about different British rock bands and search for the sentence about "The Shadows". In the results, the sentence about “The Shadows” (formed in the 50s) is closer to the sentences about bands formed in the 60s (green dots) than it is about the bands formed later in the 70s, 80s and 90s. It is great that I can take a sentence and measure closeness of the sentence is to other sentences and quickly find the closest sentences. The “closeness” is concept of a semantic search. I can embed all sentences, embed a question, and retrieve the sentence (the embeddings of the sentences) closest to the question (embedding of the question) and subsequently use the closest sentences to form a response to the question.

The sentence about the shadows is closest to sentences about other bands in the 60s. Is the year of formation the factor that determines the closeness ?

Let me change the sentence about “The Shadows” to state that they were a band formed in 1984. The result from the search ignores “Blur” and other bands chronologically closer (from the 70s,80s and 90s) and indicates the closest match is with the sentences about the bands formed in the 60s.

The year of formation isn't the key factor. There are other factors at play.

In the first example with the bands, I assumed the year of formation of the band was the determinant influencing the closeness. Of course, it isn’t that simple, the embedding is influenced by information in the sentence (year of formation) and (possibly) the genre of music, or the occurrence of the phrase or words in the original training dataset. It is this unpredictability that needs to be realized and managed when using embeddings.

Let me test another example, in this next dataset, the vector space searches for “Jimi Hendrix” amongst other bands in the same era but distinct in their sound, from Softer (The Beach Boys, The Beatles) to harder (Cream, The Who). I also capitalize Queen and Cream so that they are treated as nouns. The results are surprising, “The Who” and “Cream” don’t even show up in the top 5 nearest matches when searching for “The Jimi Hendrix Experience”. ?I would expect Jimi Hendrix to be closer to The Who and Cream than the Monkees at least. Its quite clear that the results can be unpredictable, thus making it important to be cautious when using it for more complex use cases.

The unexpected result when searching for Jimi Hendrix and expecting that the type of music is the factor for closeness suggests that its not.

Embeddings are “relative”, the embedding of a word or sentence will change based on the dataset.? Finding nearest neighbors searches, though mathematically precise can provide unexpected results. It is useful to remember the diagram of Anscombe’s quartet, which also illustrates a situation where mathematical precision and human intuition diverge from each other. Moreover, embeddings are created through a neural network on the training dataset, and thus it is not possible to scrutinize the embedding of a particular word or sentence. There is no “explanation” on why the embedding for a word or a concept is what it is.?

A few basic steps helped improve the results when using search of embeddings.

Keep the search space as relevant as possible by eliminating text and information not relevant to the search. I have found it useful to first use simple techniques such as key word matches (where possible) to find relevant chunks of text before depending on embeddings for semantic searches.
It is not a good practice to only add embeddings of the latest chunks of text to the vector database, it is best to re-embed the entire text corpus. Re-embedding results in increased processing time and cost.
Increasing the number of search results and casting a wider search to retrieve more sentences helps increase the chances of getting the best sentences.

Learning about embeddings has been a rewarding exercise, they provide a powerful tool for semantic search but also present a challenge in their inscrutability.? Embeddings (sometimes) yield unexpected results, but I guess it is to be expected since we are searching through words and concepts, and language does tend to be subjective. These simple examples (with instructive results) temper my enthusiasm when using embeddings. I can be confident that the results will be relevant, but they won’t be perfect. However, I am still going to be an enthusiastic trekker to the altar of embeddings, albeit with caution.

P.S. Anscombe’s quartet comprises four datasets that have nearly identical simple descriptive statistics (mean, variance, correlation) yet have very different distributions and appear very different when graphed. It is of my favorite concepts, and I tend to refer to it to provide context whenever the results of forecasting and model fit are counter intuitive.

Sridhar Oruganti

Service Line Leader for SAP, S/4, BTP Technologies, SAP Licensing, Signavio, ALM, SAP Testing | Enterprise Architecture | Pre-sales I Consulting | Solution & Value Advisory | Acct. Mgmt

9 个月

Incorporate the word “embeddings” too and search! ?? nice post. Refreshing to learn/see non-sap technologies ??

Karthik Govindavajulla

AI Product Management

9 个月

Lucid , informative to the point. With the advent of combined robustness of LangChain & Retrieval Aug Generation - the ‘Value Add’ from Digital Transformation has increased multifold. This results in CS= CX+ CO

查看更多评论

要查看或添加评论，请登录

Sharath B.的更多文章

An Introduction to Pricing

2024年8月6日

An Introduction to Pricing

Pricing is an fascinating topic and my recent reading of the book Game Changer by Jean-Manuel Izaret and Arnab Sinha…

1 条评论
The ABC to XYZ of Product Classification

2024年6月14日

The ABC to XYZ of Product Classification

In one of my earlier articles, Assistant (to the) Regional Planner I used a straightforward logic to classify products…

1 条评论
Assistant (to the) Regional Planner

2024年5月16日

Assistant (to the) Regional Planner

Putting the lessons from DeepLearning.AI to use by building an assistant to help with a series of tasks.
Memories of Industry

2024年5月7日

Memories of Industry

The rise in standards of design and creativity has meant that there are many excellent Infographics available. Memory…
Takeaways from Professor Mollick's excellent book Co-Intelligence

2024年5月3日

Takeaways from Professor Mollick's excellent book Co-Intelligence

Just finished reading the excellent book, Co-Intelligence by Professor Ethan Mollick. Can only show admiration and…
1%

2017年12月11日

1%

sharathbobba.wordpress.
BHEL and its peers

2016年2月18日

BHEL and its peers

https://sharathbobba.wordpress.

See all articles

Sharath B.的更多文章

An Introduction to Pricing

The ABC to XYZ of Product Classification

Assistant (to the) Regional Planner

Memories of Industry

Takeaways from Professor Mollick's excellent book Co-Intelligence

1%

BHEL and its peers

社区洞察