Unpacking Word Embeddings: A Journey Through Modern NLP Techniques
Word Relationship

Unpacking Word Embeddings: A Journey Through Modern NLP Techniques


In the world of Natural Language Processing (NLP), word embeddings have revolutionized the way we understand and process text. Whether you’re building chatbots, recommendation systems, or sentiment analysis tools, word embeddings form the backbone of many modern NLP applications.

In this article, we'll dive deep into what word embeddings are, explore different techniques to create them, and provide Python code examples for each. By the end, you'll have a solid understanding of how to leverage these powerful tools in your own projects.


"Words are like living beings. The closer you observe, the more nuanced they become."


What Are Word Embeddings?

Word embeddings are dense vector representations of words that capture their meanings, syntactic properties, and relationships with other words. Unlike traditional one-hot encoding, where words are represented as sparse vectors, word embeddings map words into a continuous vector space where semantically similar words are closer together.

Imagine a 3D space where the word "king" is near "queen," and "man" is near "woman." Word embeddings allow for such meaningful relationships to be represented in a form that algorithms can easily process.


Different Techniques for Creating Word Embeddings

1. Word2Vec

Word2Vec is one of the most popular techniques for creating word embeddings. Developed by Google, it uses a shallow, two-layer neural network to learn word associations from a large corpus of text.

Word2Vec has two main approaches:

  • Skip-gram: Predicts surrounding words given a current word.
  • Continuous Bag of Words (CBOW): Predicts the current word given surrounding words.


import gensim
from gensim.models import Word2Vec

# Sample sentences
sentences = [["machine", "learning", "is", "fun"], 
             ["python", "is", "a", "great", "language"], 
             ["deep", "learning", "is", "a", "subset", "of", "machine", "learning"]]

# Train Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0)  # sg=0 for CBOW, sg=1 for Skip-gram

# Get word embeddings
vector = model.wv['machine']
print(vector)
        

Diagrammatic Representation:

To visualize the relationships between words using Word2Vec, you can use t-SNE to reduce the dimensionality of the word vectors.


from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Sample words to visualize
words = ['machine', 'learning', 'python', 'language', 'deep', 'fun', 'great', 'subset']

# Get vectors for the words
word_vectors = [model.wv[word] for word in words]

# Apply t-SNE to reduce dimensions to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
word_vecs_2d = tsne.fit_transform(word_vectors)

# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(word_vecs_2d[:, 0], word_vecs_2d[:, 1], marker='o')

for i, word in enumerate(words):
    plt.annotate(word, xy=(word_vecs_2d[i, 0], word_vecs_2d[i, 1]))

plt.title('2D Visualization of Word Embeddings using t-SNE')
plt.show()
        


Relationship between Words

Graph Explanation:

  • Nodes: Represent words in the vocabulary(Unique words in corpus).
  • Edges: Represent relationships between words based on their co-occurrences in the corpus.

Real-World Application:

  • Search Engines: Improving query results by understanding the context of search terms.


2. GloVe (Global Vectors for Word Representation)

GloVe is another widely-used method for word embeddings. Developed by Stanford, GloVe captures global statistical information by leveraging word co-occurrence matrices. Unlike Word2Vec, which focuses on local context, GloVe looks at the global context of words in a corpus.

import numpy as np
import gensim.downloader as api

# Load GloVe pre-trained model
glove_model = api.load("glove-wiki-gigaword-100")

# Get word embedding
vector = glove_model['learning']
print(vector)
        

Diagrammatic Representation:

You can also visualize GloVe embeddings similarly using t-SNE.


# Sample words to visualize
words = ['king', 'queen', 'man', 'woman', 'apple', 'orange', 'fruit', 'computer']

# Get vectors for the words
word_vectors = [glove_model[word] for word in words]

# Apply t-SNE to reduce dimensions to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
word_vecs_2d = tsne.fit_transform(word_vectors)

# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(word_vecs_2d[:, 0], word_vecs_2d[:, 1], marker='o')

for i, word in enumerate(words):
    plt.annotate(word, xy=(word_vecs_2d[i, 0], word_vecs_2d[i, 1]))

plt.title('2D Visualization of GloVe Embeddings using t-SNE')
plt.show()
        


Glove Embeddings

Graph Explanation:

  • Co-occurrence Matrix: Shows how often each word appears with every other word in the corpus.
  • Vector Representation: Derived from matrix factorization, capturing global word relationships.

Real-World Application:

  • Recommendation Systems: Understanding user preferences by analyzing reviews or feedback.


3. FastText

FastText, developed by Facebook, is an extension of Word2Vec that considers subword information, making it robust to rare words or misspellings. FastText represents words as bags of character n-grams, which means it can generate embeddings for out-of-vocabulary (OOV) words by using the n-grams that compose them.

from gensim.models import FastText

# Sample sentences
sentences = [["machine", "learning", "is", "fun"], 
             ["python", "is", "a", "great", "language"], 
             ["deep", "learning", "is", "a", "subset", "of", "machine", "learning"]]

# Train FastText model
model = FastText(sentences, vector_size=100, window=5, min_count=1)

# Get word embedding
vector = model.wv['learning']
print(vector)
        

Diagrammatic Representation:

To visualize FastText embeddings, you can again use t-SNE.

# Sample words to visualize
words = ['machine', 'learning', 'python', 'language', 'deep', 'fun', 'great', 'subset']

# Get vectors for the words
word_vectors = [model.wv[word] for word in words]

# Apply t-SNE to reduce dimensions to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
word_vecs_2d = tsne.fit_transform(word_vectors)

# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(word_vecs_2d[:, 0], word_vecs_2d[:, 1], marker='o')

for i, word in enumerate(words):
    plt.annotate(word, xy=(word_vecs_2d[i, 0], word_vecs_2d[i, 1]))

plt.title('2D Visualization of FastText Embeddings using t-SNE')
plt.show()
        


FastText Embeddings


Graph Explanation:

  • Character n-grams: Represent subword components that form the basis for word embeddings.
  • Word Vector Composition: Derived from the sum or average of n-gram vectors.

Real-World Application:

  • Spell Checkers: Providing suggestions for misspelled words based on learned embeddings.


4. BERT (Bidirectional Encoder Representations from Transformers)

BERT, developed by Google, takes word embeddings to the next level by providing context-aware embeddings. Unlike previous models, BERT understands the context of a word in a sentence by looking at both the left and right sides of the word.


from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample sentence
sentence = "Machine learning is fascinating."

# Tokenize and encode the sentence
input_ids = torch.tensor(tokenizer.encode(sentence)).unsqueeze(0)

# Get BERT embeddings
with torch.no_grad():
    outputs = model(input_ids)
    last_hidden_states = outputs.last_hidden_state

print(last_hidden_states)
        

Diagrammatic Representation:

To visualize BERT embeddings, we'll use t-SNE after pooling the word embeddings.


import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Sample sentences
sentences = ["Machine learning is fascinating.", 
             "Deep learning models require large amounts of data.",
             "Natural language processing is a subfield of AI."]

# Tokenize and encode sentences
tokens = [tokenizer.encode(sentence, add_special_tokens=True) for sentence in sentences]

# Pad sequences to the same length
max_len = max([len(token) for token in tokens])
padded_tokens = [token + [0]*(max_len - len(token)) for token in tokens]

# Convert to tensor
input_ids = torch.tensor(padded_tokens)

# Get BERT embeddings
with torch.no_grad():
    outputs = model(input_ids)
    last_hidden_states = outputs.last_hidden_state

# Pool the last hidden states
sentence_embeddings = torch.mean(last_hidden_states, dim=1).numpy()

# Apply t-SNE to reduce dimensions to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
sentence_vecs_2d = tsne.fit_transform(sentence_embeddings)

# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(sentence_vecs_2d[:, 0], sentence_vecs_2d[:, 1], marker='o')

for i, sentence in enumerate(sentences):
    plt.annotate(f"Sentence {i+1}", xy=(sentence_vecs_2d[i, 0], sentence_vecs_2d[i, 1]))

plt.title('2D Visualization of BERT Sentence Embeddings using t-SNE')
plt.show()
        


Bert Embeddings Representation


Graph Explanation:

  • Transformer Architecture: Processes input text by considering the context from all directions.
  • Dynamic Embeddings: Captures context-dependent meanings for words.

Real-World Application:

  • Chatbots: Understanding user intent by interpreting the context of words in sentences.


Limitations


  1. Bias Inherited from Training Data: Word embeddings are typically trained on large text corpus, which can contain various biases present in the data. These biases—whether related to gender, race, or other social factors—get encoded into the embeddings, leading to models that may inadvertently propagate or amplify these biases. For instance, embeddings might associate "doctor" with male pronouns more strongly than with female pronouns, reflecting and perpetuating societal stereotypes.
  2. Challenges in Capturing Complex Word Relationships: While word embeddings can effectively capture linear relationships (e.g., "king" is to "queen" as "man" is to "woman"), they struggle with more complex, non-linear relationships. For example, embeddings may not effectively differentiate between words with multiple meanings (polysemy) or understand the context-dependent nuances of words in different sentences. This limitation arises because embeddings generally map words to fixed points in a vector space, without accounting for the dynamic nature of language where meaning can shift based on context.

Please Note, these challenges can be overcome by using Regularization, contextual exmbeddings like BERT, GPT etc.


Conclusion

Word embeddings have transformed NLP by providing a way to capture the semantic meaning of words in a form that machines can understand. From Word2Vec to BERT, each technique offers unique strengths and applications. Whether you're working on search engines, recommendation systems, or chatbots, understanding and leveraging word embeddings can significantly enhance the performance of your models.

Feel free to experiment with the Python code provided in this article to create your own word embeddings and visualizations. Happy coding!


Final Thoughts

"The true power of language lies in its ability to connect words and meanings seamlessly. Word embeddings unlock this potential for machines, bringing us closer to truly intelligent systems."


Stay Connected!!

Happy Learning!

Follow on Medium

Kiran Rathod

Co-Founder, Data Science & Analytics @Agera Consultants

1 个月

I particularly appreciate the real-world applications you mentioned for each technique, making the concepts relatable to various NLP tasks. Here are some additional thoughts you might consider adding: Briefly mention the limitations of word embeddings, such as bias inherited from training data and challenges in capturing complex word relationships. P.S:- Thank-You For Sharing Aashish Singh

要查看或添加评论,请登录

社区洞察

其他会员也浏览了