登录查看更多内容

Unpacking Word Embeddings: A Journey Through Modern NLP Techniques

Aashish Singh

?? Lead ML Engineer @ Orange Business ?? MBA | Generative AI Innovator| Tech Blogger| Helping Community Grow ?? Certified ML Developer | AI Solutions Pioneer

发布日期: 2024年8月13日

In the world of Natural Language Processing (NLP), word embeddings have revolutionized the way we understand and process text. Whether you’re building chatbots, recommendation systems, or sentiment analysis tools, word embeddings form the backbone of many modern NLP applications.

In this article, we'll dive deep into what word embeddings are, explore different techniques to create them, and provide Python code examples for each. By the end, you'll have a solid understanding of how to leverage these powerful tools in your own projects.

"Words are like living beings. The closer you observe, the more nuanced they become."

What Are Word Embeddings?

Word embeddings are dense vector representations of words that capture their meanings, syntactic properties, and relationships with other words. Unlike traditional one-hot encoding, where words are represented as sparse vectors, word embeddings map words into a continuous vector space where semantically similar words are closer together.

Imagine a 3D space where the word "king" is near "queen," and "man" is near "woman." Word embeddings allow for such meaningful relationships to be represented in a form that algorithms can easily process.

Different Techniques for Creating Word Embeddings

1. Word2Vec

Word2Vec is one of the most popular techniques for creating word embeddings. Developed by Google, it uses a shallow, two-layer neural network to learn word associations from a large corpus of text.

Word2Vec has two main approaches:

Skip-gram: Predicts surrounding words given a current word.
Continuous Bag of Words (CBOW): Predicts the current word given surrounding words.

import gensim
from gensim.models import Word2Vec

# Sample sentences
sentences = [["machine", "learning", "is", "fun"], 
             ["python", "is", "a", "great", "language"], 
             ["deep", "learning", "is", "a", "subset", "of", "machine", "learning"]]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0)  # sg=0 for CBOW, sg=1 for Skip-gram

# Get word embeddings
vector = model.wv['machine']
print(vector)

Diagrammatic Representation:

To visualize the relationships between words using Word2Vec, you can use t-SNE to reduce the dimensionality of the word vectors.

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Sample words to visualize
words = ['machine', 'learning', 'python', 'language', 'deep', 'fun', 'great', 'subset']

# Get vectors for the words
word_vectors = [model.wv[word] for word in words]

# Apply t-SNE to reduce dimensions to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
word_vecs_2d = tsne.fit_transform(word_vectors)

# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(word_vecs_2d[:, 0], word_vecs_2d[:, 1], marker='o')

for i, word in enumerate(words):
    plt.annotate(word, xy=(word_vecs_2d[i, 0], word_vecs_2d[i, 1]))

plt.title('2D Visualization of Word Embeddings using t-SNE')
plt.show()

Graph Explanation:

Nodes: Represent words in the vocabulary(Unique words in corpus).
Edges: Represent relationships between words based on their co-occurrences in the corpus.

Real-World Application:

Search Engines: Improving query results by understanding the context of search terms.

2. GloVe (Global Vectors for Word Representation)

GloVe is another widely-used method for word embeddings. Developed by Stanford, GloVe captures global statistical information by leveraging word co-occurrence matrices. Unlike Word2Vec, which focuses on local context, GloVe looks at the global context of words in a corpus.

import numpy as np
import gensim.downloader as api

# Load GloVe pre-trained model
glove_model = api.load("glove-wiki-gigaword-100")

# Get word embedding
vector = glove_model['learning']
print(vector)

Diagrammatic Representation:

You can also visualize GloVe embeddings similarly using t-SNE.

# Sample words to visualize
words = ['king', 'queen', 'man', 'woman', 'apple', 'orange', 'fruit', 'computer']

# Get vectors for the words
word_vectors = [glove_model[word] for word in words]

# Apply t-SNE to reduce dimensions to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
word_vecs_2d = tsne.fit_transform(word_vectors)

# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(word_vecs_2d[:, 0], word_vecs_2d[:, 1], marker='o')

for i, word in enumerate(words):
    plt.annotate(word, xy=(word_vecs_2d[i, 0], word_vecs_2d[i, 1]))

plt.title('2D Visualization of GloVe Embeddings using t-SNE')
plt.show()

Graph Explanation:

Co-occurrence Matrix: Shows how often each word appears with every other word in the corpus.
Vector Representation: Derived from matrix factorization, capturing global word relationships.

Real-World Application:

Recommendation Systems: Understanding user preferences by analyzing reviews or feedback.

3. FastText

FastText, developed by Facebook, is an extension of Word2Vec that considers subword information, making it robust to rare words or misspellings. FastText represents words as bags of character n-grams, which means it can generate embeddings for out-of-vocabulary (OOV) words by using the n-grams that compose them.

HirePort AI 1 年前

Embeddings in Natural Language Processing (NLP)

Sanjay Kumar MBA,MS,PhD 6 个月前

Text Preprocessing in NLP

Olalekan Fagbuyi, MBA, MMA 3 个月前

from gensim.models import FastText

# Sample sentences
sentences = [["machine", "learning", "is", "fun"], 
             ["python", "is", "a", "great", "language"], 
             ["deep", "learning", "is", "a", "subset", "of", "machine", "learning"]]

# Train FastText model
model = FastText(sentences, vector_size=100, window=5, min_count=1)

# Get word embedding
vector = model.wv['learning']
print(vector)

Diagrammatic Representation:

To visualize FastText embeddings, you can again use t-SNE.

# Sample words to visualize
words = ['machine', 'learning', 'python', 'language', 'deep', 'fun', 'great', 'subset']

# Get vectors for the words
word_vectors = [model.wv[word] for word in words]

# Apply t-SNE to reduce dimensions to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
word_vecs_2d = tsne.fit_transform(word_vectors)

# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(word_vecs_2d[:, 0], word_vecs_2d[:, 1], marker='o')

for i, word in enumerate(words):
    plt.annotate(word, xy=(word_vecs_2d[i, 0], word_vecs_2d[i, 1]))

plt.title('2D Visualization of FastText Embeddings using t-SNE')
plt.show()

Graph Explanation:

Character n-grams: Represent subword components that form the basis for word embeddings.
Word Vector Composition: Derived from the sum or average of n-gram vectors.

Real-World Application:

Spell Checkers: Providing suggestions for misspelled words based on learned embeddings.

4. BERT (Bidirectional Encoder Representations from Transformers)

BERT, developed by Google, takes word embeddings to the next level by providing context-aware embeddings. Unlike previous models, BERT understands the context of a word in a sentence by looking at both the left and right sides of the word.

from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample sentence
sentence = "Machine learning is fascinating."

# Tokenize and encode the sentence
input_ids = torch.tensor(tokenizer.encode(sentence)).unsqueeze(0)

# Get BERT embeddings
with torch.no_grad():
    outputs = model(input_ids)
    last_hidden_states = outputs.last_hidden_state

print(last_hidden_states)

Diagrammatic Representation:

To visualize BERT embeddings, we'll use t-SNE after pooling the word embeddings.

import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Sample sentences
sentences = ["Machine learning is fascinating.", 
             "Deep learning models require large amounts of data.",
             "Natural language processing is a subfield of AI."]

# Tokenize and encode sentences
tokens = [tokenizer.encode(sentence, add_special_tokens=True) for sentence in sentences]

# Pad sequences to the same length
max_len = max([len(token) for token in tokens])
padded_tokens = [token + [0]*(max_len - len(token)) for token in tokens]

# Convert to tensor
input_ids = torch.tensor(padded_tokens)

# Get BERT embeddings
with torch.no_grad():
    outputs = model(input_ids)
    last_hidden_states = outputs.last_hidden_state

# Pool the last hidden states
sentence_embeddings = torch.mean(last_hidden_states, dim=1).numpy()

# Apply t-SNE to reduce dimensions to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
sentence_vecs_2d = tsne.fit_transform(sentence_embeddings)

# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(sentence_vecs_2d[:, 0], sentence_vecs_2d[:, 1], marker='o')

for i, sentence in enumerate(sentences):
    plt.annotate(f"Sentence {i+1}", xy=(sentence_vecs_2d[i, 0], sentence_vecs_2d[i, 1]))

plt.title('2D Visualization of BERT Sentence Embeddings using t-SNE')
plt.show()

Graph Explanation:

Transformer Architecture: Processes input text by considering the context from all directions.
Dynamic Embeddings: Captures context-dependent meanings for words.

Real-World Application:

Chatbots: Understanding user intent by interpreting the context of words in sentences.

Limitations

Bias Inherited from Training Data: Word embeddings are typically trained on large text corpus, which can contain various biases present in the data. These biases—whether related to gender, race, or other social factors—get encoded into the embeddings, leading to models that may inadvertently propagate or amplify these biases. For instance, embeddings might associate "doctor" with male pronouns more strongly than with female pronouns, reflecting and perpetuating societal stereotypes.
Challenges in Capturing Complex Word Relationships: While word embeddings can effectively capture linear relationships (e.g., "king" is to "queen" as "man" is to "woman"), they struggle with more complex, non-linear relationships. For example, embeddings may not effectively differentiate between words with multiple meanings (polysemy) or understand the context-dependent nuances of words in different sentences. This limitation arises because embeddings generally map words to fixed points in a vector space, without accounting for the dynamic nature of language where meaning can shift based on context.

Please Note, these challenges can be overcome by using Regularization, contextual exmbeddings like BERT, GPT etc.

Conclusion

Word embeddings have transformed NLP by providing a way to capture the semantic meaning of words in a form that machines can understand. From Word2Vec to BERT, each technique offers unique strengths and applications. Whether you're working on search engines, recommendation systems, or chatbots, understanding and leveraging word embeddings can significantly enhance the performance of your models.

Feel free to experiment with the Python code provided in this article to create your own word embeddings and visualizations. Happy coding!

Final Thoughts

"The true power of language lies in its ability to connect words and meanings seamlessly. Word embeddings unlock this potential for machines, bringing us closer to truly intelligent systems."

Stay Connected!!

Happy Learning!

Follow on Medium

Kiran Rathod

Co-Founder, Data Science & Analytics @Agera Consultants

1 个月

I particularly appreciate the real-world applications you mentioned for each technique, making the concepts relatable to various NLP tasks. Here are some additional thoughts you might consider adding: Briefly mention the limitations of word embeddings, such as bias inherited from training data and challenges in capturing complex word relationships. P.S:- Thank-You For Sharing Aashish Singh

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Unpacking Word Embeddings: A Journey Through Modern NLP Techniques

Aashish Singh

?? Lead ML Engineer @ Orange Business ?? MBA | Generative AI Innovator| Tech Blogger| Helping Community Grow ?? Certified ML Developer | AI Solutions Pioneer

"Words are like living beings. The closer you observe, the more nuanced they become."

What Are Word Embeddings?

Different Techniques for Creating Word Embeddings

1. Word2Vec

2. GloVe (Global Vectors for Word Representation)

3. FastText

领英推荐

4. BERT (Bidirectional Encoder Representations from Transformers)

Limitations

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Evolution of Word Embeddings: A Journey Through NLP History

Understanding Quarrio’s Multi-Level Parser and Grammar Architecture ????

Unlocking the Power of Data: How NLP Enhances Business Intelligence. BI Business Intelligence, Big Data, and Natural Language Processing (NLP)

Text Similarity

???? What exactly is Natural Language Processing?

Introduction to Word2Vec and GloVe for Beginners

Natural Language Processing _ Part 5

Power of Data with Semantics: How Semantic Analysis is Revolutionizing Data Science

Natural Language (NLP) Processing with Python Use Case

NLP Text Preprocessing Techniques in Python for Sentiment Analysis

"Words are like living beings. The closer you observe, the more nuanced they become."

What Are Word Embeddings?

Different Techniques for Creating Word Embeddings

1. Word2Vec

2. GloVe (Global Vectors for Word Representation)

3. FastText

领英推荐

4. BERT (Bidirectional Encoder Representations from Transformers)

Limitations

Conclusion

Revolutionizing AI Fine-Tuning: How to Fine-Tune Large Language Models in Minutes with Minimal Data

2024年9月16日

Data Transformations in Machine Learning: A Deep Dive with the Breast Cancer Dataset

2024年8月20日

Unveiling the Art of Feature Selection in Machine Learning

2024年8月16日

Embracing the Future: How LLMs and RAG Systems are Transforming AI in 2024

2024年8月15日

A Deep Dive into Optimizers in Deep Learning: Roles, Mathematics, Applications and Pseudo Python?Code

2024年8月14日

Demystifying Hypothesis Testing: A Guide for Data Enthusiasts

2024年8月12日

Unlocking the Power of Confidence Intervals: Why They Matter and How to Use Them

2024年8月11日

Understanding Transformer Architecture: The Backbone of Modern AI

2024年8月10日

Understanding the Central Limit Theorem: A Deep Dive with Python Code & Real-World Examples

2024年8月9日

Unleashing the Power of Temporal Fusion Transformers in Time Series Forecasting

2024年8月8日

社区洞察

其他会员也浏览了

Evolution of Word Embeddings: A Journey Through NLP History

Understanding Quarrio’s Multi-Level Parser and Grammar Architecture ????

Unlocking the Power of Data: How NLP Enhances Business Intelligence. BI Business Intelligence, Big Data, and Natural Language Processing (NLP)

Text Similarity

???? What exactly is Natural Language Processing?

Introduction to Word2Vec and GloVe for Beginners

Natural Language Processing _ Part 5

Power of Data with Semantics: How Semantic Analysis is Revolutionizing Data Science

Natural Language (NLP) Processing with Python Use Case

NLP Text Preprocessing Techniques in Python for Sentiment Analysis