登录查看更多内容

The "Magical" Ingredient of LLMs: Vector Embeddings

Dr. Bernd Fritzke

AI @ DekaBank | Speaker

发布日期: 2024年6月30日

In the field of machine learning, vector embeddings have emerged as a central component of large language models (LLMs). These embeddings, learned during the pre-training phase, enable models to map text—be it words, sentences, or paragraphs—into a high-dimensional numerical vector space in a manner that preserves semantic meaning. This article explains vector embeddings, their creation, and their transformative role across various applications including Retrieval Augmented Generation (RAG) systems.

The Essence of Vector Embeddings

At the heart of every LLM lies the ability to generate vector embeddings. These are high-dimensional representations that encapsulate the semantic properties of text. During the pre-training phase, the model learns these embeddings by analyzing vast corpora of text data, gradually fine-tuning the mappings such that semantically similar text items are positioned close to one another in the vector space.

This is done using transformer networks specifically the encoder part of transformers as shown in the image below. The input (text) is tokenized (converted in to numbers corresponding to words or word-parts) and analyzed by several attention layers befor a feed-forward network creates the hidden representation which is another word for vector embedding (also denoted as latent space representation). By considering the context of words and sentences the LLM is able to deduce the semantic of words and also learn about synonyms.

Encoder Architecture (from: Attention is all you need, A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, AN Gomez, ? Kaiser, I Polosukhin, in Advances in Neural Information Processing Systems, 2017)

The resulting proximity-based mapping is a cornerstone of numerous applications. For instance, in recommender systems for music streaming, online shopping, job search portals, or online dating, the ability to find semantically related items quickly is crucial. When a user interacts with these systems, the underlying mechanism often involves a proximity search within this embedding space, identifying items that are near the user’s interests.

Learning and Storing Embeddings

The creation of these embeddings is entirely automatic, a by-product of the model's extensive pre-training. Once generated, these vectors are stored in a vector database, which supports rapid approximated proximity searches. It’s imperative that the same embedding model is used both for storing items and for querying, ensuring consistency and reliability in search results.

Traditional programming methods in computer science are unable to create such semantically-preserving embeddings. The complexity and the subtle nuances captured by transformer models are beyond the reach of traditional coding, illustrating the specific capabilities of machine learning techniques.

Visualizing Embeddings

To illustrate the concept, let's conduct an experiment by embedding six words each from five different topical areas: finance, fruits, animals, sports, and countries. The topical areas and the respective words embedded are:

- Finance: bank, mortgage, money, investment, loan, credit application

- Fruits: apple, banana, orange, grape, mango, pineapple

- Animals: lion, tiger, elephant, giraffe, zebra, kangaroo

- Sports: soccer, basketball, tennis, baseball, football, cricket

- Countries: USA, Canada, Mexico, Brazil, Argentina, UK

Below are the results for three different pre-trained embedding models which can be obtained from sites like Hugging Face (https://huggingface.co/). These embeddings, all 768-dimensional, are projected down to two dimensions using t-SNE (https://distill.pub/2016/misread-tsne/) for visualization purposes.

Embedding with "bert-base-uncased"

Huggingface Model card: https://huggingface.co/google-bert/bert-base-uncased

Below the complete Python code is shown for embedding some words using a specific model ( "bert-base-uncased") using the transformers library from Hugging Face. Similar code can be found in the Google Colab Notebook linked at the end of the article.

Example code for using a pretrained embedding model

Since 768-dimensional data is hard to view, we use t-SNE to project down to 2 dimensions. Moreover, for each topic group we shade the convex hull in order to see wether the members of each topic group are mapped to similar locations. For this model this is largely the case though the "animals" topic group overlaps with "fruits" and "finance". One should note, however, that an overlap in the t-SNE generated 2D-space does not imply any overlap in the 768-dimensional embedding space.

Embedding with "distilbert-base-uncased"

Hugging Face model card: https://huggingface.co/distilbert/distilbert-base-uncased

Here a perfect separation of the topic groups is acchieved illustrating that the encoder nodel has learned a semantic-preserving mapping from text to vectors.

Embedding with "all-mpnet-base-v2"

Hugging Face model card: https://huggingface.co/sentence-transformers/all-mpnet-base-v2

Also with "all-mpnet-base-v2" a perfect separation of the topic groups is acchieved. The arrangement is different from the results from other models (not only in the t-SNE representation shon but also in the highdimensional embedding space). This is no problem, however, since one should use always the same embedding model within one application context. Different embeddings are incompatible, therefore proximity searches only make sense among vectors stemming from the same embedding.

Embedding with "bge-base-en-v1.5"

Hugging Face model card: https://huggingface.co/BAAI/bge-base-en-v1.5

Again, a perfect, albeit differnt, separation of the topic groups is acchieved. A common property of all shown LLM-embeddings is that the topic areas "fruits" and "animals" appear in neigboring positions in the t-SNE display. This could be a coincidence, however.

领英推荐

How to Become an Agentic AI Developer?

Blockchain Council 1 个月前

Machine Learning in Flutter

Moon Technolabs 1 年前

Demystifying Machine Learning: What is it and why is…

swissQuant 1 年前

Embedding at random leads to unusable results

The above visualization of LLM-based models demonstrates how topic areas cluster into distinct regions, even after dimensionality reduction. This is very special as can be seen by comparing to a "dumb" random embedding. For this purpose we mapped each word to a random 768-dimensional vector. After applying t-SNE huge overlaps of the topic areas can be observed. No meaningful grouping is visible. The representation seems useless for further processing steps.

Application: Proximity Search in the Embedding Space

The utility of the above LLM-generated embeddings is evident in various practical scenarios. For example, we can retrieve semantically similar items by embedding a query term and searching for its neighbors in the embedding space. As mentioned earlier, this is the basis of various recommender systems in popular commercial websites (e.g. Spotify, Netflix, Amazon, Tinder).

Instead of using Euclidean distance, cosine similarity (https://en.wikipedia.org/wiki/Cosine_similarity) is often preferred as it only considers the angle between vectors, making it independent of vector length and solely focused on relative orientation. The embedding model used here is "bge-base-en-v1.5" (https://huggingface.co/BAAI/bge-base-en-v1.5).

Let's examine the nearest neighbors in embedding space for the following query words:

bitcoin, melon, gazelle, swimming, Uganda

Closest words to 'bitcoin':

money
investment
bank
USA
UK

These results show that "bitcoin" is closely related to financial terms and regions where cryptocurrency is prevalent.

[Disclosure: The above comment and the comments regarding the other query words are fully AI-generated. Also, the software for the examples was largely created by appropriately prompting ChatGPT 4o.]

Closest words to 'melon':

pineapple
banana
mango
orange
grape

"Melon" is naturally grouped with other fruits, illustrating the semantic understanding of the embedding.

Closest words to 'gazelle':

giraffe
tiger
kangaroo
lion
elephant

"Gazelle" is associated with other animals, demonstrating the model's grasp of zoological categories.

Closest words to 'swimming':

soccer
table tennis
orange
baseball
USA

Interestingly, "swimming" is grouped with sports, but also includes "orange" and "USA," highlighting how contextual nuances can influence embeddings.

Closest words to 'Uganda':

banana
giraffe
elephant
kangaroo
Argentina

"Uganda" is linked with a mix of regional and semantic associations, reflecting the model's complex understanding of geography and context.

Using Embedding Models in RAG Systems

In Retrieval-Augmented Generation (RAG) systems, as illustrated in the image below, Large Embeddings play a critical role. The process begins with a query, which is converted into a high-dimensional vector representation using an embedding model like the ones used in the examples above. This vector is then used to search a vector store index within a database, retrieving relevant context. The retrieved context is subsequently provided to the LLM, which generates a comprehensive answer by leveraging both the input query and the context. This approach enables the LLM to produce more accurate and informed responses by integrating specific, relevant information from the database, thus enhancing the quality and relevance of the generated answers.

excerpt from: https://medium.com/@krtarunsingh/advanced-rag-techniques-unlocking-the-next-level-040c205b95bc

Conclusion

These examples shown in this article underscore the power of vector embeddings in capturing and leveraging semantic relationships. Understanding these embeddings can help to develop ideas for applications of LLMs in a huge number of scenarios (e.g., recommender systems or RAG systems) where text-based information needs to be processed. One can state that these embeddings are one the of most important aspects of Large Language Models and modern AI.

Colab Notebook

All exeriments described above can be repeated and altered using this Google colab notebook: https://colab.research.google.com/drive/1Srp0nyETMDILnkAmSp3O70ecQnxd3I7j?usp=sharing

Please note that the obtained results may vary slightly due to different random generator seeds.

Patrick Hedfeld

8 个月

I love the visualisation with t-SNE.

要查看或添加评论，请登录

Dr. Bernd Fritzke的更多文章

AI Agents: To RAG or not to RAG?

2025年1月24日

AI Agents: To RAG or not to RAG?

?? AI agents are transforming industries—but can they reach their full potential without Retrieval-Augmented Generation…

2 条评论
KI vs. Adventskranz: Ein epischer Kampf

2025年1月10日

KI vs. Adventskranz: Ein epischer Kampf

Manchmal offenbaren kleine Beispiele die erstaunlichen Beschr?nkungen aktueller KI-Systeme. Im Dezember wollte ich ein…

9 条评论
Public Key Kryptographie und die Rolle von Zertifizierungsstellen (Certificate Authorities, CAs) bei der Verschlüsselung im Internet

2024年9月1日

Public Key Kryptographie und die Rolle von Zertifizierungsstellen (Certificate Authorities, CAs) bei der Verschlüsselung im Internet

(überarbeitete Version nach Feedback in den Kommentaren) In Zeiten von KI-generierten Falschinformationen immer…

4 条评论
Breathing K-Means: Superior K-Means Solutions through Dynamic K-Values

2024年8月20日

Breathing K-Means: Superior K-Means Solutions through Dynamic K-Values

Introduction Running a k-means algorithm on your numerical data is a common first step to get a compact representation…
Linear Regression (Less Linear Than You Might Think)

2024年2月12日

Linear Regression (Less Linear Than You Might Think)

For a very long time I associated Linear Regression with fitting a straight line (or hyperplane in higher dimensions)…

2 条评论
A Bird's Eye View of AI

2024年2月4日

A Bird's Eye View of AI

AI is more than a singular technological marvel; it's a symphony of capabilities that replicate and extend human…

4 条评论
Basic Machine Learning: The k-Nearest Neighbor (k-NN) Classifier

2024年1月28日

Basic Machine Learning: The k-Nearest Neighbor (k-NN) Classifier

Introduction Recently, I was in the situation to explain the foundations of machine learning to a group of students…

2 条评论
A Brief History Of AI (part 3)

2023年12月22日

A Brief History Of AI (part 3)

part 1, part 2 2017 Transformers ("Attention is all you Need") Illustration of a transformer model focusing its…
A Brief History Of AI (part 2)

2023年12月22日

A Brief History Of AI (part 2)

part 1, part 3 1997: IBM's Deep Blue defeats world chess champion Garry Kasparov. In 1997, a landmark event in the…
A Brief History Of AI (part 1)

2023年12月22日

A Brief History Of AI (part 1)

part 2, part 3 Sometimes, in a single day, there are so many reports on groundbreaking AI discoveries and novel…

See all articles

The "Magical" Ingredient of LLMs: Vector Embeddings

Dr. Bernd Fritzke

AI @ DekaBank | Speaker

The Essence of Vector Embeddings

Learning and Storing Embeddings

Visualizing Embeddings

Embedding with "bert-base-uncased"

Embedding with "distilbert-base-uncased"

Embedding with "all-mpnet-base-v2"

Embedding with "bge-base-en-v1.5"

领英推荐

Embedding at random leads to unusable results

Application: Proximity Search in the Embedding Space

Using Embedding Models in RAG Systems

Conclusion

Colab Notebook

Dr. Bernd Fritzke的更多文章

社区洞察

其他会员也浏览了

How Generative AI is Reshaping the Future of Software Development in 2024

From Data to Decisions: How Machine Learning Powers Modern Business

DeepSeek-R1: Enhancing LLM Reasoning with Reinforcement Learning

TimeGPT-1 Foundation Model For Time Series; Merge LLMs; Fusilli - Python Lib for Multi-Modal Data Fusion; and More

What is artificial intelligence (AI)?

Understanding Various Machine Learning Model Structures

ARTIFICIAL INTELLIGENCE VS. MACHINE LEARNING ALGORITHMS

DeepSeek's Revolutionary Approach: Crafting the R1 Model Through Reinforcement Learning

The Software Industry's "Kodak Moment" - When Code Writes Itself

OpenAI's o1 Model: Einstein in a Box - A Breakthrough in AI Reasoning

The Essence of Vector Embeddings

Learning and Storing Embeddings

Visualizing Embeddings

Embedding with "bert-base-uncased"

Embedding with "distilbert-base-uncased"

Embedding with "all-mpnet-base-v2"

Embedding with "bge-base-en-v1.5"

领英推荐

Embedding at random leads to unusable results

Application: Proximity Search in the Embedding Space

Using Embedding Models in RAG Systems

Conclusion

Colab Notebook

Dr. Bernd Fritzke的更多文章

AI Agents: To RAG or not to RAG?

KI vs. Adventskranz: Ein epischer Kampf

Public Key Kryptographie und die Rolle von Zertifizierungsstellen (Certificate Authorities, CAs) bei der Verschlüsselung im Internet

Breathing K-Means: Superior K-Means Solutions through Dynamic K-Values

Linear Regression (Less Linear Than You Might Think)

A Bird's Eye View of AI

Basic Machine Learning: The k-Nearest Neighbor (k-NN) Classifier

A Brief History Of AI (part 3)

A Brief History Of AI (part 2)

A Brief History Of AI (part 1)

社区洞察

其他会员也浏览了

How Generative AI is Reshaping the Future of Software Development in 2024

From Data to Decisions: How Machine Learning Powers Modern Business

DeepSeek-R1: Enhancing LLM Reasoning with Reinforcement Learning

TimeGPT-1 Foundation Model For Time Series; Merge LLMs; Fusilli - Python Lib for Multi-Modal Data Fusion; and More

What is artificial intelligence (AI)?

Understanding Various Machine Learning Model Structures

ARTIFICIAL INTELLIGENCE VS. MACHINE LEARNING ALGORITHMS

DeepSeek's Revolutionary Approach: Crafting the R1 Model Through Reinforcement Learning

The Software Industry's "Kodak Moment" - When Code Writes Itself

OpenAI's o1 Model: Einstein in a Box - A Breakthrough in AI Reasoning