Understanding Word Embedding in NLP using Sentence Transformers
Rany ElHousieny, PhD???
Generative AI ENGINEERING MANAGER | ex-Microsoft | AI Solutions Architect | Generative AI & NLP Expert | Proven Leader in AI-Driven Innovation | Former Microsoft Research & Azure AI | Software Engineering Manager
Word embeddings are a crucial concept in Natural Language Processing (NLP) that involves representing words or phrases in a high-dimensional vector space. This representation enables us to capture the semantic similarity between different words or phrases based on their context. One of the popular ways to generate word embeddings is by using pre-trained models like sentence-transformers/all-MiniLM-L6-v2 from the Sentence Transformers library. In this article, we will explore how to use this model to create embeddings and measure similarity between sentences.
What is Word Embedding?
Word embedding is a technique used in NLP to map words or phrases to vectors of real numbers. This mapping is done in such a way that words with similar meanings are located close to each other in the vector space. This representation allows algorithms to understand the semantic relationships between words, making it easier to perform tasks like sentiment analysis, text classification, and more.
How does it work?
The sentence-transformers/all-MiniLM-L6-v2 model is a pre-trained transformer model that has been fine-tuned for generating sentence embeddings. It takes a sentence as input and outputs a fixed-size vector representation of that sentence. This vector captures the semantic meaning of the sentence, allowing us to compare the similarity between different sentences.
Implementation in Python
First, we need to install the sentence_transformers library:
!pip install sentence_transformers
Now, we can use the following code to generate embeddings and calculate the similarity between sentences:
from sentence_transformers import SentenceTransformer
import numpy as np
def text_embedding(text):
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
return model.encode(text, normalize_embeddings = True)
def vector_similarity(vec1, vec2):
return np.dot(np.squeeze(np.array(vec1)),np.squeeze(np.array(vec2)))
phrase1 = "Apple is a fruit"
embedding1 = text_embedding(phrase1)
print(len(embedding1))
phrase2 = "Apple iPhone is expensive"
embedding2 = text_embedding(phrase2)
print(len(embedding2))
phrase3 = "Mango is a fruit"
embedding3 = text_embedding(phrase3)
print(len(embedding3))
phrase4 = "There is a new Apple iPhone"
embedding4 = text_embedding(phrase4)
print(len(embedding4))
print(vector_similarity(embedding1,embedding3))
print(vector_similarity(embedding1,embedding4))
print(vector_similarity(embedding2,embedding3))
print(vector_similarity(embedding2,embedding4))
384
384
384
384
0.67738634
0.38097996
0.15007737
0.6433086
In this code, we define two functions:
We then create embeddings for four different phrases and print their lengths to verify that they are all the same size. Finally, we calculate and print the similarity between different pairs of embeddings.
Let's go through it step by step:
In this example's first part, we compare two sentences: "Apple is a Fruit" and "Apple iPhone is expensive." Both have the word Apple but in completely different contexts. The first is the fruit, and the second is the iPhone.
If we check the length of both vectors, we will find they are both 384, which is the constant vector length in this model. However, it will capture the semantic meaning in this embedding.
The goal of the previous example is to find the distance between vectors of four sentences: "Apple is a fruit," "Apple iPhone is expensive," "Mango is a fruit," and "There is a new Apple iPhone." The closer the vector, the closer the meaning is.
领英推荐
The following function (dot product) can return the distance between the two vectors. The value 1 means they are identical. The closer the value to 1, the closer they are closer in meaning and vice versa.
def vector_similarity(vec1, vec2):
return np.dot(np.squeeze(np.array(vec1)),np.squeeze(np.array(vec2)))
So, to measure the difference between the first and second sentence
As you can see, it is .35, which means they are not close. The embedding was able to differentiate between Apple the iPhone and Apple the Fruit. However, if you tried sentences 2 and 4 ("Apple iPhone is expensive" and "There is a new Apple iPhone"), you will get a higher score of 64%
Application in Retrieval Augmented Generation (RAG)
Retrieval Augmented Generation (RAG) is a technique used in NLP that combines retrieval and generation to improve the performance of language models. In RAG, a retriever first fetches relevant documents or sentences based on the input query, and then a generator uses this retrieved information to generate a response.
Word embeddings play a crucial role in the retrieval step of RAG. By representing sentences as embeddings, we can efficiently search for the most relevant documents or sentences to a given query. This is typically done by calculating the similarity between the query embedding and the embeddings of the documents in the database, and retrieving the ones with the highest similarity.
By using pre-trained models like sentence-transformers/all-MiniLM-L6-v2, we can leverage the power of transformer architectures to generate high-quality embeddings that capture the semantic meaning of sentences, making them highly effective for retrieval tasks in RAG.
In summary, word embeddings are a fundamental concept in NLP that enables machines to understand the semantic relationships between words and sentences. By using pre-trained models like sentence-transformers/all-MiniLM-L6-v2, we can easily generate embeddings for use in various NLP tasks, including Retrieval Augmented Generation.
Additional Resources:
Mayo In Motion
10 个月Very inspired by embeddings: https://johnmayosmith.substack.com/p/what-if-chatgpt-is-actually-a-tour-190