Unpacking Word Embeddings: A Journey Through Modern NLP Techniques
Aashish Singh
?? Lead ML Engineer @ Orange Business ?? MBA | Generative AI Innovator| Tech Blogger| Helping Community Grow ?? Certified ML Developer | AI Solutions Pioneer
In the world of Natural Language Processing (NLP), word embeddings have revolutionized the way we understand and process text. Whether you’re building chatbots, recommendation systems, or sentiment analysis tools, word embeddings form the backbone of many modern NLP applications.
In this article, we'll dive deep into what word embeddings are, explore different techniques to create them, and provide Python code examples for each. By the end, you'll have a solid understanding of how to leverage these powerful tools in your own projects.
"Words are like living beings. The closer you observe, the more nuanced they become."
What Are Word Embeddings?
Word embeddings are dense vector representations of words that capture their meanings, syntactic properties, and relationships with other words. Unlike traditional one-hot encoding, where words are represented as sparse vectors, word embeddings map words into a continuous vector space where semantically similar words are closer together.
Imagine a 3D space where the word "king" is near "queen," and "man" is near "woman." Word embeddings allow for such meaningful relationships to be represented in a form that algorithms can easily process.
Different Techniques for Creating Word Embeddings
1. Word2Vec
Word2Vec is one of the most popular techniques for creating word embeddings. Developed by Google, it uses a shallow, two-layer neural network to learn word associations from a large corpus of text.
Word2Vec has two main approaches:
import gensim
from gensim.models import Word2Vec
# Sample sentences
sentences = [["machine", "learning", "is", "fun"],
["python", "is", "a", "great", "language"],
["deep", "learning", "is", "a", "subset", "of", "machine", "learning"]]
# Train Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0) # sg=0 for CBOW, sg=1 for Skip-gram
# Get word embeddings
vector = model.wv['machine']
print(vector)
Diagrammatic Representation:
To visualize the relationships between words using Word2Vec, you can use t-SNE to reduce the dimensionality of the word vectors.
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Sample words to visualize
words = ['machine', 'learning', 'python', 'language', 'deep', 'fun', 'great', 'subset']
# Get vectors for the words
word_vectors = [model.wv[word] for word in words]
# Apply t-SNE to reduce dimensions to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
word_vecs_2d = tsne.fit_transform(word_vectors)
# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(word_vecs_2d[:, 0], word_vecs_2d[:, 1], marker='o')
for i, word in enumerate(words):
plt.annotate(word, xy=(word_vecs_2d[i, 0], word_vecs_2d[i, 1]))
plt.title('2D Visualization of Word Embeddings using t-SNE')
plt.show()
Graph Explanation:
Real-World Application:
2. GloVe (Global Vectors for Word Representation)
GloVe is another widely-used method for word embeddings. Developed by Stanford, GloVe captures global statistical information by leveraging word co-occurrence matrices. Unlike Word2Vec, which focuses on local context, GloVe looks at the global context of words in a corpus.
import numpy as np
import gensim.downloader as api
# Load GloVe pre-trained model
glove_model = api.load("glove-wiki-gigaword-100")
# Get word embedding
vector = glove_model['learning']
print(vector)
Diagrammatic Representation:
You can also visualize GloVe embeddings similarly using t-SNE.
# Sample words to visualize
words = ['king', 'queen', 'man', 'woman', 'apple', 'orange', 'fruit', 'computer']
# Get vectors for the words
word_vectors = [glove_model[word] for word in words]
# Apply t-SNE to reduce dimensions to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
word_vecs_2d = tsne.fit_transform(word_vectors)
# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(word_vecs_2d[:, 0], word_vecs_2d[:, 1], marker='o')
for i, word in enumerate(words):
plt.annotate(word, xy=(word_vecs_2d[i, 0], word_vecs_2d[i, 1]))
plt.title('2D Visualization of GloVe Embeddings using t-SNE')
plt.show()
Graph Explanation:
Real-World Application:
3. FastText
FastText, developed by Facebook, is an extension of Word2Vec that considers subword information, making it robust to rare words or misspellings. FastText represents words as bags of character n-grams, which means it can generate embeddings for out-of-vocabulary (OOV) words by using the n-grams that compose them.
领英推荐
from gensim.models import FastText
# Sample sentences
sentences = [["machine", "learning", "is", "fun"],
["python", "is", "a", "great", "language"],
["deep", "learning", "is", "a", "subset", "of", "machine", "learning"]]
# Train FastText model
model = FastText(sentences, vector_size=100, window=5, min_count=1)
# Get word embedding
vector = model.wv['learning']
print(vector)
Diagrammatic Representation:
To visualize FastText embeddings, you can again use t-SNE.
# Sample words to visualize
words = ['machine', 'learning', 'python', 'language', 'deep', 'fun', 'great', 'subset']
# Get vectors for the words
word_vectors = [model.wv[word] for word in words]
# Apply t-SNE to reduce dimensions to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
word_vecs_2d = tsne.fit_transform(word_vectors)
# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(word_vecs_2d[:, 0], word_vecs_2d[:, 1], marker='o')
for i, word in enumerate(words):
plt.annotate(word, xy=(word_vecs_2d[i, 0], word_vecs_2d[i, 1]))
plt.title('2D Visualization of FastText Embeddings using t-SNE')
plt.show()
Graph Explanation:
Real-World Application:
4. BERT (Bidirectional Encoder Representations from Transformers)
BERT, developed by Google, takes word embeddings to the next level by providing context-aware embeddings. Unlike previous models, BERT understands the context of a word in a sentence by looking at both the left and right sides of the word.
from transformers import BertTokenizer, BertModel
import torch
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Sample sentence
sentence = "Machine learning is fascinating."
# Tokenize and encode the sentence
input_ids = torch.tensor(tokenizer.encode(sentence)).unsqueeze(0)
# Get BERT embeddings
with torch.no_grad():
outputs = model(input_ids)
last_hidden_states = outputs.last_hidden_state
print(last_hidden_states)
Diagrammatic Representation:
To visualize BERT embeddings, we'll use t-SNE after pooling the word embeddings.
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Sample sentences
sentences = ["Machine learning is fascinating.",
"Deep learning models require large amounts of data.",
"Natural language processing is a subfield of AI."]
# Tokenize and encode sentences
tokens = [tokenizer.encode(sentence, add_special_tokens=True) for sentence in sentences]
# Pad sequences to the same length
max_len = max([len(token) for token in tokens])
padded_tokens = [token + [0]*(max_len - len(token)) for token in tokens]
# Convert to tensor
input_ids = torch.tensor(padded_tokens)
# Get BERT embeddings
with torch.no_grad():
outputs = model(input_ids)
last_hidden_states = outputs.last_hidden_state
# Pool the last hidden states
sentence_embeddings = torch.mean(last_hidden_states, dim=1).numpy()
# Apply t-SNE to reduce dimensions to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
sentence_vecs_2d = tsne.fit_transform(sentence_embeddings)
# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(sentence_vecs_2d[:, 0], sentence_vecs_2d[:, 1], marker='o')
for i, sentence in enumerate(sentences):
plt.annotate(f"Sentence {i+1}", xy=(sentence_vecs_2d[i, 0], sentence_vecs_2d[i, 1]))
plt.title('2D Visualization of BERT Sentence Embeddings using t-SNE')
plt.show()
Graph Explanation:
Real-World Application:
Limitations
Please Note, these challenges can be overcome by using Regularization, contextual exmbeddings like BERT, GPT etc.
Conclusion
Word embeddings have transformed NLP by providing a way to capture the semantic meaning of words in a form that machines can understand. From Word2Vec to BERT, each technique offers unique strengths and applications. Whether you're working on search engines, recommendation systems, or chatbots, understanding and leveraging word embeddings can significantly enhance the performance of your models.
Feel free to experiment with the Python code provided in this article to create your own word embeddings and visualizations. Happy coding!
Final Thoughts
"The true power of language lies in its ability to connect words and meanings seamlessly. Word embeddings unlock this potential for machines, bringing us closer to truly intelligent systems."
Stay Connected!!
Happy Learning!
Co-Founder, Data Science & Analytics @Agera Consultants
1 个月I particularly appreciate the real-world applications you mentioned for each technique, making the concepts relatable to various NLP tasks. Here are some additional thoughts you might consider adding: Briefly mention the limitations of word embeddings, such as bias inherited from training data and challenges in capturing complex word relationships. P.S:- Thank-You For Sharing Aashish Singh