登录查看更多内容

Understanding Word Embedding in NLP using Sentence Transformers

Rany ElHousieny, PhD???

Generative AI ENGINEERING MANAGER | ex-Microsoft | AI Solutions Architect | Generative AI & NLP Expert | Proven Leader in AI-Driven Innovation | Former Microsoft Research & Azure AI | Software Engineering Manager

发布日期: 2024年4月10日

Word embeddings are a crucial concept in Natural Language Processing (NLP) that involves representing words or phrases in a high-dimensional vector space. This representation enables us to capture the semantic similarity between different words or phrases based on their context. One of the popular ways to generate word embeddings is by using pre-trained models like sentence-transformers/all-MiniLM-L6-v2 from the Sentence Transformers library. In this article, we will explore how to use this model to create embeddings and measure similarity between sentences.

What is Word Embedding?

Word embedding is a technique used in NLP to map words or phrases to vectors of real numbers. This mapping is done in such a way that words with similar meanings are located close to each other in the vector space. This representation allows algorithms to understand the semantic relationships between words, making it easier to perform tasks like sentiment analysis, text classification, and more.

https://towardsdatascience.com/a-guide-to-word-embeddings-8a23817ab60f

How does it work?

The sentence-transformers/all-MiniLM-L6-v2 model is a pre-trained transformer model that has been fine-tuned for generating sentence embeddings. It takes a sentence as input and outputs a fixed-size vector representation of that sentence. This vector captures the semantic meaning of the sentence, allowing us to compare the similarity between different sentences.

Implementation in Python

First, we need to install the sentence_transformers library:

!pip install sentence_transformers

Now, we can use the following code to generate embeddings and calculate the similarity between sentences:

from sentence_transformers import SentenceTransformer
import numpy as np

def text_embedding(text):
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    return model.encode(text, normalize_embeddings = True)

def vector_similarity(vec1, vec2):
    return np.dot(np.squeeze(np.array(vec1)),np.squeeze(np.array(vec2)))

phrase1    = "Apple is a fruit"
embedding1 = text_embedding(phrase1)
print(len(embedding1))

phrase2    = "Apple iPhone is expensive"
embedding2 = text_embedding(phrase2)
print(len(embedding2))

phrase3    = "Mango is a fruit"
embedding3 = text_embedding(phrase3) 
print(len(embedding3))

phrase4    = "There is a new Apple iPhone"
embedding4 = text_embedding(phrase4)
print(len(embedding4))

print(vector_similarity(embedding1,embedding3))
print(vector_similarity(embedding1,embedding4))

print(vector_similarity(embedding2,embedding3))
print(vector_similarity(embedding2,embedding4))

In this code, we define two functions:

text_embedding(text): This function takes a text string as input and returns its embedding using the sentence-transformers/all-MiniLM-L6-v2 model.
vector_similarity(vec1, vec2): This function calculates the cosine similarity between two vectors, which is a measure of how similar the vectors are.

We then create embeddings for four different phrases and print their lengths to verify that they are all the same size. Finally, we calculate and print the similarity between different pairs of embeddings.

Let's go through it step by step:

In this example's first part, we compare two sentences: "Apple is a Fruit" and "Apple iPhone is expensive." Both have the word Apple but in completely different contexts. The first is the fruit, and the second is the iPhone.

If we check the length of both vectors, we will find they are both 384, which is the constant vector length in this model. However, it will capture the semantic meaning in this embedding.

The goal of the previous example is to find the distance between vectors of four sentences: "Apple is a fruit," "Apple iPhone is expensive," "Mango is a fruit," and "There is a new Apple iPhone." The closer the vector, the closer the meaning is.

领英推荐

What is Sentiment Analysis in NLP? How It Works, Its…

Codalien Technologies 9 个月前

UNDERSTANDING NATURAL LANGUAGE PROCESSING (NLP) IN WEB…

Webalar 1 年前

Embeddings in Natural Language Processing (NLP)

Sanjay Kumar MBA,MS,PhD 11 个月前

The following function (dot product) can return the distance between the two vectors. The value 1 means they are identical. The closer the value to 1, the closer they are closer in meaning and vice versa.

def vector_similarity(vec1, vec2):
    return np.dot(np.squeeze(np.array(vec1)),np.squeeze(np.array(vec2)))

So, to measure the difference between the first and second sentence

As you can see, it is .35, which means they are not close. The embedding was able to differentiate between Apple the iPhone and Apple the Fruit. However, if you tried sentences 2 and 4 ("Apple iPhone is expensive" and "There is a new Apple iPhone"), you will get a higher score of 64%

Application in Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is a technique used in NLP that combines retrieval and generation to improve the performance of language models. In RAG, a retriever first fetches relevant documents or sentences based on the input query, and then a generator uses this retrieved information to generate a response.

Word embeddings play a crucial role in the retrieval step of RAG. By representing sentences as embeddings, we can efficiently search for the most relevant documents or sentences to a given query. This is typically done by calculating the similarity between the query embedding and the embeddings of the documents in the database, and retrieving the ones with the highest similarity.

By using pre-trained models like sentence-transformers/all-MiniLM-L6-v2, we can leverage the power of transformer architectures to generate high-quality embeddings that capture the semantic meaning of sentences, making them highly effective for retrieval tasks in RAG.

In summary, word embeddings are a fundamental concept in NLP that enables machines to understand the semantic relationships between words and sentences. By using pre-trained models like sentence-transformers/all-MiniLM-L6-v2, we can easily generate embeddings for use in various NLP tasks, including Retrieval Augmented Generation.

Additional Resources:

AI Synergy Insights

553 位关注者

John Mayo-Smith

Mayo In Motion

10 个月

Very inspired by embeddings: https://johnmayosmith.substack.com/p/what-if-chatgpt-is-actually-a-tour-190

要查看或添加评论，请登录

Rany ElHousieny, PhD???的更多文章

Getting Started with LangChain.js: A Hello World Example

2025年2月18日

Getting Started with LangChain.js: A Hello World Example

LangChain.js is a powerful library that enables seamless interaction with Large Language Models (LLMs) in JavaScript…
LangChain Chains: Powering AI with Structured Execution ????

2025年2月16日

LangChain Chains: Powering AI with Structured Execution ????

When building AI-powered applications, we often need to process user inputs, format prompts, retrieve relevant data…
LangChain Memory in a React AI Joke Generator: A Beginner’s Guide ????

2025年2月16日

LangChain Memory in a React AI Joke Generator: A Beginner’s Guide ????

Wouldn’t it be cool if your AI remembered what it told you before? Imagine asking an AI for a joke, and instead of…
Mastering LangChain.js Prompt Templates: A Beginner's Guide for Frontend Developers

2025年2月16日

Mastering LangChain.js Prompt Templates: A Beginner's Guide for Frontend Developers

?? What if you could customize AI responses dynamically in your React app? Instead of sending hardcoded prompts to…
Getting Started with LangChain.js: Calling OpenAI to Tell a Joke

2025年2月15日

Getting Started with LangChain.js: Calling OpenAI to Tell a Joke

Artificial Intelligence is becoming more accessible for frontend developers, thanks to LangChain.js.
AI Development for Frontend Developers with React and LangChain: Hands-On project

2025年2月15日

AI Development for Frontend Developers with React and LangChain: Hands-On project

In my previous article, I explained how to build a Resume Coach application that helps job seekers optimize their…

3 条评论
Getting Started with OpenHands Code Assistance on Mac

2025年2月14日

Getting Started with OpenHands Code Assistance on Mac

OpenHands is an AI-powered code assistance tool designed to streamline development workflows. This guide will walk you…

1 条评论
CodiumAI Windsurf Code Assistant: Getting Started

2025年2月6日

CodiumAI Windsurf Code Assistant: Getting Started

In the ever-evolving landscape of software development, integrating advanced tools can significantly enhance…
Deploying DeepSeek-R1 on Azure

2025年2月6日

Deploying DeepSeek-R1 on Azure

DeepSeek-R1 is a powerful reasoning model designed for complex tasks like language processing, scientific reasoning…
Getting Started with LocalStack: A Beginner's Guide

2025年1月10日

Getting Started with LocalStack: A Beginner's Guide

LocalStack is an open-source tool that emulates AWS services locally, enabling you to develop and test your…

See all articles

Understanding Word Embedding in NLP using Sentence Transformers

Rany ElHousieny, PhD???

Generative AI ENGINEERING MANAGER | ex-Microsoft | AI Solutions Architect | Generative AI & NLP Expert | Proven Leader in AI-Driven Innovation | Former Microsoft Research & Azure AI | Software Engineering Manager

What is Word Embedding?

How does it work?

Implementation in Python

Let's go through it step by step:

领英推荐

Application in Retrieval Augmented Generation (RAG)

Additional Resources:

AI Synergy Insights

553 位关注者

Rany ElHousieny, PhD???的更多文章

社区洞察

其他会员也浏览了

NLP: Embedding Layer - Part II

Evolution of Word Embeddings: A Journey Through NLP History

Understanding Quarrio’s Multi-Level Parser and Grammar Architecture ????

???? What exactly is Natural Language Processing?

BERT's Token Embedding Layer: WordPiece Algorithm and Its Impact on NLP Models

Engineers Guide to AI - Tokenization

Introduction to Word2Vec and GloVe for Beginners

Dense and Sparse Embeddings: A Comprehensive Overview

From Text to Vectors: Mastering Word Embeddings for Natural Language Processing

The ABCs of BERTopic: A Beginner's Guide

What is Word Embedding?

How does it work?

Implementation in Python

Let's go through it step by step:

领英推荐

Application in Retrieval Augmented Generation (RAG)

Additional Resources:

AI Synergy Insights

553 位关注者

Rany ElHousieny, PhD???的更多文章

Getting Started with LangChain.js: A Hello World Example

LangChain Chains: Powering AI with Structured Execution ????

LangChain Memory in a React AI Joke Generator: A Beginner’s Guide ????

Mastering LangChain.js Prompt Templates: A Beginner's Guide for Frontend Developers

Getting Started with LangChain.js: Calling OpenAI to Tell a Joke

AI Development for Frontend Developers with React and LangChain: Hands-On project

Getting Started with OpenHands Code Assistance on Mac

CodiumAI Windsurf Code Assistant: Getting Started

Deploying DeepSeek-R1 on Azure

Getting Started with LocalStack: A Beginner's Guide

社区洞察

其他会员也浏览了

NLP: Embedding Layer - Part II

Evolution of Word Embeddings: A Journey Through NLP History

Understanding Quarrio’s Multi-Level Parser and Grammar Architecture ????

???? What exactly is Natural Language Processing?

BERT's Token Embedding Layer: WordPiece Algorithm and Its Impact on NLP Models

Engineers Guide to AI - Tokenization

Introduction to Word2Vec and GloVe for Beginners

Dense and Sparse Embeddings: A Comprehensive Overview

From Text to Vectors: Mastering Word Embeddings for Natural Language Processing

The ABCs of BERTopic: A Beginner's Guide