The "Magical" Ingredient of LLMs: Vector Embeddings
generated with Dall-E 3

The "Magical" Ingredient of LLMs: Vector Embeddings

In the field of machine learning, vector embeddings have emerged as a central component of large language models (LLMs). These embeddings, learned during the pre-training phase, enable models to map text—be it words, sentences, or paragraphs—into a high-dimensional numerical vector space in a manner that preserves semantic meaning. This article explains vector embeddings, their creation, and their transformative role across various applications including Retrieval Augmented Generation (RAG) systems.

The Essence of Vector Embeddings

At the heart of every LLM lies the ability to generate vector embeddings. These are high-dimensional representations that encapsulate the semantic properties of text. During the pre-training phase, the model learns these embeddings by analyzing vast corpora of text data, gradually fine-tuning the mappings such that semantically similar text items are positioned close to one another in the vector space.

This is done using transformer networks specifically the encoder part of transformers as shown in the image below. The input (text) is tokenized (converted in to numbers corresponding to words or word-parts) and analyzed by several attention layers befor a feed-forward network creates the hidden representation which is another word for vector embedding (also denoted as latent space representation). By considering the context of words and sentences the LLM is able to deduce the semantic of words and also learn about synonyms.

Encoder Architecture (from: Attention is all you need, A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, AN Gomez, ? Kaiser, I Polosukhin, in Advances in Neural Information Processing Systems, 2017)

The resulting proximity-based mapping is a cornerstone of numerous applications. For instance, in recommender systems for music streaming, online shopping, job search portals, or online dating, the ability to find semantically related items quickly is crucial. When a user interacts with these systems, the underlying mechanism often involves a proximity search within this embedding space, identifying items that are near the user’s interests.

Learning and Storing Embeddings

The creation of these embeddings is entirely automatic, a by-product of the model's extensive pre-training. Once generated, these vectors are stored in a vector database, which supports rapid approximated proximity searches. It’s imperative that the same embedding model is used both for storing items and for querying, ensuring consistency and reliability in search results.

Traditional programming methods in computer science are unable to create such semantically-preserving embeddings. The complexity and the subtle nuances captured by transformer models are beyond the reach of traditional coding, illustrating the specific capabilities of machine learning techniques.

Visualizing Embeddings

To illustrate the concept, let's conduct an experiment by embedding six words each from five different topical areas: finance, fruits, animals, sports, and countries. The topical areas and the respective words embedded are:

- Finance: bank, mortgage, money, investment, loan, credit application

- Fruits: apple, banana, orange, grape, mango, pineapple

- Animals: lion, tiger, elephant, giraffe, zebra, kangaroo

- Sports: soccer, basketball, tennis, baseball, football, cricket

- Countries: USA, Canada, Mexico, Brazil, Argentina, UK

Below are the results for three different pre-trained embedding models which can be obtained from sites like Hugging Face (https://huggingface.co/). These embeddings, all 768-dimensional, are projected down to two dimensions using t-SNE (https://distill.pub/2016/misread-tsne/) for visualization purposes.

Embedding with "bert-base-uncased"

Huggingface Model card: https://huggingface.co/google-bert/bert-base-uncased

Below the complete Python code is shown for embedding some words using a specific model ( "bert-base-uncased") using the transformers library from Hugging Face. Similar code can be found in the Google Colab Notebook linked at the end of the article.

Example code for using a pretrained embedding model

Since 768-dimensional data is hard to view, we use t-SNE to project down to 2 dimensions. Moreover, for each topic group we shade the convex hull in order to see wether the members of each topic group are mapped to similar locations. For this model this is largely the case though the "animals" topic group overlaps with "fruits" and "finance". One should note, however, that an overlap in the t-SNE generated 2D-space does not imply any overlap in the 768-dimensional embedding space.

Embedding with "distilbert-base-uncased"

Hugging Face model card: https://huggingface.co/distilbert/distilbert-base-uncased

Here a perfect separation of the topic groups is acchieved illustrating that the encoder nodel has learned a semantic-preserving mapping from text to vectors.

Embedding with "all-mpnet-base-v2"

Hugging Face model card: https://huggingface.co/sentence-transformers/all-mpnet-base-v2

Also with "all-mpnet-base-v2" a perfect separation of the topic groups is acchieved. The arrangement is different from the results from other models (not only in the t-SNE representation shon but also in the highdimensional embedding space). This is no problem, however, since one should use always the same embedding model within one application context. Different embeddings are incompatible, therefore proximity searches only make sense among vectors stemming from the same embedding.

Embedding with "bge-base-en-v1.5"

Hugging Face model card: https://huggingface.co/BAAI/bge-base-en-v1.5

Again, a perfect, albeit differnt, separation of the topic groups is acchieved. A common property of all shown LLM-embeddings is that the topic areas "fruits" and "animals" appear in neigboring positions in the t-SNE display. This could be a coincidence, however.


Embedding at random leads to unusable results

The above visualization of LLM-based models demonstrates how topic areas cluster into distinct regions, even after dimensionality reduction. This is very special as can be seen by comparing to a "dumb" random embedding. For this purpose we mapped each word to a random 768-dimensional vector. After applying t-SNE huge overlaps of the topic areas can be observed. No meaningful grouping is visible. The representation seems useless for further processing steps.

Application: Proximity Search in the Embedding Space

The utility of the above LLM-generated embeddings is evident in various practical scenarios. For example, we can retrieve semantically similar items by embedding a query term and searching for its neighbors in the embedding space. As mentioned earlier, this is the basis of various recommender systems in popular commercial websites (e.g. Spotify, Netflix, Amazon, Tinder).

Instead of using Euclidean distance, cosine similarity (https://en.wikipedia.org/wiki/Cosine_similarity) is often preferred as it only considers the angle between vectors, making it independent of vector length and solely focused on relative orientation. The embedding model used here is "bge-base-en-v1.5" (https://huggingface.co/BAAI/bge-base-en-v1.5).

Let's examine the nearest neighbors in embedding space for the following query words:

bitcoin, melon, gazelle, swimming, Uganda        

Closest words to 'bitcoin':

  • money
  • investment
  • bank
  • USA
  • UK

These results show that "bitcoin" is closely related to financial terms and regions where cryptocurrency is prevalent.

[Disclosure: The above comment and the comments regarding the other query words are fully AI-generated. Also, the software for the examples was largely created by appropriately prompting ChatGPT 4o.]

Closest words to 'melon':

  • pineapple
  • banana
  • mango
  • orange
  • grape

"Melon" is naturally grouped with other fruits, illustrating the semantic understanding of the embedding.

Closest words to 'gazelle':

  • giraffe
  • tiger
  • kangaroo
  • lion
  • elephant

"Gazelle" is associated with other animals, demonstrating the model's grasp of zoological categories.

Closest words to 'swimming':

  • soccer
  • table tennis
  • orange
  • baseball
  • USA

Interestingly, "swimming" is grouped with sports, but also includes "orange" and "USA," highlighting how contextual nuances can influence embeddings.

Closest words to 'Uganda':

  • banana
  • giraffe
  • elephant
  • kangaroo
  • Argentina

"Uganda" is linked with a mix of regional and semantic associations, reflecting the model's complex understanding of geography and context.

Using Embedding Models in RAG Systems

In Retrieval-Augmented Generation (RAG) systems, as illustrated in the image below, Large Embeddings play a critical role. The process begins with a query, which is converted into a high-dimensional vector representation using an embedding model like the ones used in the examples above. This vector is then used to search a vector store index within a database, retrieving relevant context. The retrieved context is subsequently provided to the LLM, which generates a comprehensive answer by leveraging both the input query and the context. This approach enables the LLM to produce more accurate and informed responses by integrating specific, relevant information from the database, thus enhancing the quality and relevance of the generated answers.

excerpt from: https://medium.com/@krtarunsingh/advanced-rag-techniques-unlocking-the-next-level-040c205b95bc

Conclusion

These examples shown in this article underscore the power of vector embeddings in capturing and leveraging semantic relationships. Understanding these embeddings can help to develop ideas for applications of LLMs in a huge number of scenarios (e.g., recommender systems or RAG systems) where text-based information needs to be processed. One can state that these embeddings are one the of most important aspects of Large Language Models and modern AI.

Colab Notebook

All exeriments described above can be repeated and altered using this Google colab notebook: https://colab.research.google.com/drive/1Srp0nyETMDILnkAmSp3O70ecQnxd3I7j?usp=sharing

Please note that the obtained results may vary slightly due to different random generator seeds.


I love the visualisation with t-SNE.

回复

要查看或添加评论,请登录

Dr. Bernd Fritzke的更多文章

社区洞察

其他会员也浏览了