Understanding Word Embedding in NLP using Sentence Transformers
3D vector space with words like 'Apple', 'fruit', 'iPhone', and 'Mango' positioned to indicate their semantic relationships

Understanding Word Embedding in NLP using Sentence Transformers

Word embeddings are a crucial concept in Natural Language Processing (NLP) that involves representing words or phrases in a high-dimensional vector space. This representation enables us to capture the semantic similarity between different words or phrases based on their context. One of the popular ways to generate word embeddings is by using pre-trained models like sentence-transformers/all-MiniLM-L6-v2 from the Sentence Transformers library. In this article, we will explore how to use this model to create embeddings and measure similarity between sentences.

What is Word Embedding?

Word embedding is a technique used in NLP to map words or phrases to vectors of real numbers. This mapping is done in such a way that words with similar meanings are located close to each other in the vector space. This representation allows algorithms to understand the semantic relationships between words, making it easier to perform tasks like sentiment analysis, text classification, and more.

https://towardsdatascience.com/a-guide-to-word-embeddings-8a23817ab60f





How does it work?

The sentence-transformers/all-MiniLM-L6-v2 model is a pre-trained transformer model that has been fine-tuned for generating sentence embeddings. It takes a sentence as input and outputs a fixed-size vector representation of that sentence. This vector captures the semantic meaning of the sentence, allowing us to compare the similarity between different sentences.

Implementation in Python

First, we need to install the sentence_transformers library:

!pip install sentence_transformers        


Now, we can use the following code to generate embeddings and calculate the similarity between sentences:

from sentence_transformers import SentenceTransformer
import numpy as np

def text_embedding(text):
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    return model.encode(text, normalize_embeddings = True)

def vector_similarity(vec1, vec2):
    return np.dot(np.squeeze(np.array(vec1)),np.squeeze(np.array(vec2)))

phrase1    = "Apple is a fruit"
embedding1 = text_embedding(phrase1)
print(len(embedding1))

phrase2    = "Apple iPhone is expensive"
embedding2 = text_embedding(phrase2)
print(len(embedding2))

phrase3    = "Mango is a fruit"
embedding3 = text_embedding(phrase3) 
print(len(embedding3))

phrase4    = "There is a new Apple iPhone"
embedding4 = text_embedding(phrase4)
print(len(embedding4))

print(vector_similarity(embedding1,embedding3))
print(vector_similarity(embedding1,embedding4))

print(vector_similarity(embedding2,embedding3))
print(vector_similarity(embedding2,embedding4))
        
384
384
384
384
0.67738634
0.38097996
0.15007737
0.6433086        


In this code, we define two functions:

  • text_embedding(text): This function takes a text string as input and returns its embedding using the sentence-transformers/all-MiniLM-L6-v2 model.
  • vector_similarity(vec1, vec2): This function calculates the cosine similarity between two vectors, which is a measure of how similar the vectors are.

We then create embeddings for four different phrases and print their lengths to verify that they are all the same size. Finally, we calculate and print the similarity between different pairs of embeddings.

Let's go through it step by step:

In this example's first part, we compare two sentences: "Apple is a Fruit" and "Apple iPhone is expensive." Both have the word Apple but in completely different contexts. The first is the fruit, and the second is the iPhone.

If we check the length of both vectors, we will find they are both 384, which is the constant vector length in this model. However, it will capture the semantic meaning in this embedding.

The goal of the previous example is to find the distance between vectors of four sentences: "Apple is a fruit," "Apple iPhone is expensive," "Mango is a fruit," and "There is a new Apple iPhone." The closer the vector, the closer the meaning is.

The following function (dot product) can return the distance between the two vectors. The value 1 means they are identical. The closer the value to 1, the closer they are closer in meaning and vice versa.

def vector_similarity(vec1, vec2):
    return np.dot(np.squeeze(np.array(vec1)),np.squeeze(np.array(vec2)))        

So, to measure the difference between the first and second sentence

As you can see, it is .35, which means they are not close. The embedding was able to differentiate between Apple the iPhone and Apple the Fruit. However, if you tried sentences 2 and 4 ("Apple iPhone is expensive" and "There is a new Apple iPhone"), you will get a higher score of 64%


Application in Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is a technique used in NLP that combines retrieval and generation to improve the performance of language models. In RAG, a retriever first fetches relevant documents or sentences based on the input query, and then a generator uses this retrieved information to generate a response.

Word embeddings play a crucial role in the retrieval step of RAG. By representing sentences as embeddings, we can efficiently search for the most relevant documents or sentences to a given query. This is typically done by calculating the similarity between the query embedding and the embeddings of the documents in the database, and retrieving the ones with the highest similarity.

By using pre-trained models like sentence-transformers/all-MiniLM-L6-v2, we can leverage the power of transformer architectures to generate high-quality embeddings that capture the semantic meaning of sentences, making them highly effective for retrieval tasks in RAG.

In summary, word embeddings are a fundamental concept in NLP that enables machines to understand the semantic relationships between words and sentences. By using pre-trained models like sentence-transformers/all-MiniLM-L6-v2, we can easily generate embeddings for use in various NLP tasks, including Retrieval Augmented Generation.


Additional Resources:










要查看或添加评论,请登录

Rany ElHousieny, PhD???的更多文章

社区洞察

其他会员也浏览了