Evolution of Word Embeddings: A Journey Through NLP History
Rany ElHousieny, PhD???
Senior Software Engineering Manager (EX-Microsoft) | Generative AI Leader @ Clearwater Analytics | Generative AI, Conversational AI Solutions Architect
Word embeddings have revolutionized the field of natural language processing (NLP) by providing a way to represent words in a continuous vector space, capturing semantic and syntactic relationships. This article will explore the historical progression of word embedding techniques, highlighting key papers and providing Python examples for each method.
Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. They are a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.
The Beginning: One-Hot Encoding
One-Hot Encoding is the simplest form of word representation, where each word is encoded as a unique vector. Each word is represented as a vector of zeros with a single one at the index corresponding to the word in the vocabulary. However, this method does not capture any semantic information.
from sklearn.preprocessing import OneHotEncoder
import numpy as np
# Sample vocabulary
vocab = ['cat', 'dog', 'fish']
# One-hot encoding
encoder = OneHotEncoder(sparse=False)
one_hot = encoder.fit_transform(np.array(vocab).reshape(-1, 1))
print(one_hot)
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
This code demonstrates how to perform one-hot encoding on a sample vocabulary using the OneHotEncoder class from the scikit-learn library. Here's a breakdown of each step:
For example, if the output one_hot looks like this:
[[1. 0. 0.] [0. 1. 0.] [0. 0. 1.]]
It means that 'cat' is represented as [1, 0, 0], 'dog' as [0, 1, 0], and 'fish' as [0, 0, 1]. Each word is represented as a vector with a 1 in the position corresponding to its category and 0s elsewhere.
Advancement: TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) was an early attempt to capture the importance of words in a document. This method weighs words based on their frequency in a document and across all documents, giving more importance to rare words. It's more informative than one-hot encoding but still lacks capturing word semantics.
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
docs = ['the cat sat on the mat', 'the dog barked at the cat']
# TF-IDF encoding
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(docs)
print(tfidf.toarray())
[[0. 0. 0.30253071 0. 0.42519636 0.42519636
0.42519636 0.60506143]
[0.42519636 0.42519636 0.30253071 0.42519636 0. 0.
0. 0.60506143]]
This code demonstrates how to perform TF-IDF (Term Frequency-Inverse Document Frequency) encoding on a sample set of documents using the TfidfVectorizer class from the scikit-learn library. Here's a breakdown of each step:
Import the TfidfVectorizer class:
The TfidfVectorizer class is imported from the sklearn.feature_extraction.text module. This class is used to convert a collection of raw documents into a matrix of TF-IDF features.
Define the sample documents:
A list called docs is created, containing two strings. Each string represents a document.
Instantiate the TfidfVectorizer:
An instance of the TfidfVectorizer class is created, named vectorizer. This object will be used to compute the TF-IDF matrix.
Compute the TF-IDF matrix:
The fit_transform method of the vectorizer object is called with the docs list as an argument. This method first fits the vectorizer to the data (i.e., learns the vocabulary and idf vector) and then transforms the data into a TF-IDF matrix. The result is stored in the variable tfidf.
Print the TF-IDF matrix:
The toarray() method is called on the tfidf sparse matrix to convert it into a dense numpy array, which is then printed.
Output Explanation:
The output is a 2x8 matrix, where each row corresponds to a document in docs, and each column corresponds to a unique word in the vocabulary of the documents. The values in the matrix represent the TF-IDF scores of the words in the documents.
The exact values of the TF-IDF scores depend on the formula used by the TfidfVectorizer, which takes into account the frequency of the word in the document (TF) and the inverse document frequency (IDF), which decreases the weight of words that appear in many documents.
Breakthrough: Word2Vec
Word2Vec, developed by Google, used neural networks to learn word associations.
!pip install gensim
from gensim.models import Word2Vec
# Sample sentences
sentences = [['the', 'cat', 'sat', 'on', 'the', 'mat'],
['the', 'dog', 'barked', 'at', 'the', 'cat']]
# Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
word_vectors = model.wv
print(word_vectors['cat'])
Refinement: GloVe
GloVe (Global Vectors for Word Representation) combined matrix factorization with context-based learning. Developed by Stanford, GloVe is another popular method that combines the advantages of matrix factorization techniques (like those used in latent semantic analysis) with the context-based learning of Word2Vec. It is trained on the co-occurrence matrix of words in a corpus.
Note: Python example for GloVe typically involves using pre-trained vectors due to the computational complexity of training.
import gensim.downloader as api
# Load pre-trained GloVe model
glove_model = api.load('glove-wiki-gigaword-100')
print(glove_model['cat'])
Expansion: FastText
Developed by Facebook, FastText extends Word2Vec to consider subword information, such as character n-grams. This allows it to capture the meaning of shorter words and suffixes/prefixes, making it more effective for languages with rich morphology.
from gensim.models import FastText
# FastText model
fasttext_model = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4)
print(fasttext_model.wv['cat'])
Deep Contextualization: ELMo
ELMo (Embeddings from Language Models) introduced deep contextualized word representations. Developed by the Allen Institute for AI, ELMo is a deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics) and how these uses vary across linguistic contexts (i.e., to model polysemy).
Note: Due to the complexity of ELMo, we typically use pre-trained models.
from allennlp.commands.elmo import ElmoEmbedder
elmo = ElmoEmbedder()
# Get ELMo embeddings for a sentence
tokens = ["I", "have", "a", "cat"]
vectors = elmo.embed_sentence(tokens)
print(vectors.shape)
Revolution: BERT and Transformers
BERT (Bidirectional Encoder Representations from Transformers) revolutionized NLP with its transformer architecture. Developed by Google, BERT uses the Transformer architecture to create context-sensitive embeddings by considering both left and right context in all layers of the model.
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Encode text
input_text = "I have a cat"
encoded_input
Transformer-based Embeddings:
Following BERT, several other transformer-based models like RoBERTa, GPT (Generative Pretrained Transformer), T5 (Text-to-Text Transfer Transformer), and others have been developed, which provide powerful and context-aware word embeddings.
1. GPT (Generative Pretrained Transformer)
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
2. RoBERTa (Robustly Optimized BERT Approach)
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
3. T5 (Text-to-Text Transfer Transformer)
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')
input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
4. BART (Bidirectional and Auto-Regressive Transformers)
from transformers import BartTokenizer, BartForConditionalGeneration
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large')
inputs = tokenizer("The <mask> is very tall.", return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])
loss = outputs.loss
logits = outputs.logits
These models have continued to push the boundaries of what's possible in NLP, offering improvements in various tasks such as language understanding, generation, and translation.