Evolution of Word Embeddings: A Journey Through NLP History

Evolution of Word Embeddings: A Journey Through NLP History

Word embeddings have revolutionized the field of natural language processing (NLP) by providing a way to represent words in a continuous vector space, capturing semantic and syntactic relationships. This article will explore the historical progression of word embedding techniques, highlighting key papers and providing Python examples for each method.


Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. They are a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

The Beginning: One-Hot Encoding

One-Hot Encoding is the simplest form of word representation, where each word is encoded as a unique vector. Each word is represented as a vector of zeros with a single one at the index corresponding to the word in the vocabulary. However, this method does not capture any semantic information.

  • Date: Pre-2000s
  • Key Paper: Not applicable (common knowledge)

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample vocabulary
vocab = ['cat', 'dog', 'fish']

# One-hot encoding
encoder = OneHotEncoder(sparse=False)
one_hot = encoder.fit_transform(np.array(vocab).reshape(-1, 1))

print(one_hot)
        
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]        

This code demonstrates how to perform one-hot encoding on a sample vocabulary using the OneHotEncoder class from the scikit-learn library. Here's a breakdown of each step:

  • Import the OneHotEncoder class: The OneHotEncoder class is imported from the sklearn.preprocessing module. This class is used to convert categorical variables into a one-hot numerical representation.
  • Define the sample vocabulary: A list called vocab is created, containing three words: 'cat', 'dog', and 'fish'. These words represent the categorical variables that we want to encode.
  • Instantiate the OneHotEncoder: An instance of the OneHotEncoder class is created, named encoder. The parameter sparse=False is used to ensure that the output is a dense numpy array instead of a sparse matrix, which is the default output.
  • Reshape the vocabulary and perform one-hot encoding: The vocab list is converted into a numpy array and reshaped to have a single column (using .reshape(-1, 1)). This is necessary because fit_transform expects a 2D array as input. The reshaped array is then passed to the fit_transform method of the encoder object. This method first fits the encoder to the data (i.e., learns the unique categories) and then transforms the data into a one-hot encoded array.

  • Result (one_hot): The variable one_hot now contains the one-hot encoded representation of the vocab list. Each row in one_hot corresponds to a word in vocab, and each column corresponds to a unique category (word). A value of 1 in a column indicates the presence of that category in the corresponding word, while 0 indicates its absence.

For example, if the output one_hot looks like this:

[[1. 0. 0.] [0. 1. 0.] [0. 0. 1.]]        


It means that 'cat' is represented as [1, 0, 0], 'dog' as [0, 1, 0], and 'fish' as [0, 0, 1]. Each word is represented as a vector with a 1 in the position corresponding to its category and 0s elsewhere.


Advancement: TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) was an early attempt to capture the importance of words in a document. This method weighs words based on their frequency in a document and across all documents, giving more importance to rare words. It's more informative than one-hot encoding but still lacks capturing word semantics.

  • Date: 1970s
  • Key Paper: Sparck Jones, K. (1972). "A statistical interpretation of term specificity and its application in retrieval."

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
docs = ['the cat sat on the mat', 'the dog barked at the cat']

# TF-IDF encoding
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(docs)

print(tfidf.toarray())
        
[[0.         0.         0.30253071 0.         0.42519636 0.42519636
  0.42519636 0.60506143]
 [0.42519636 0.42519636 0.30253071 0.42519636 0.         0.
  0.         0.60506143]]        


This code demonstrates how to perform TF-IDF (Term Frequency-Inverse Document Frequency) encoding on a sample set of documents using the TfidfVectorizer class from the scikit-learn library. Here's a breakdown of each step:


Import the TfidfVectorizer class:

The TfidfVectorizer class is imported from the sklearn.feature_extraction.text module. This class is used to convert a collection of raw documents into a matrix of TF-IDF features.


Define the sample documents:

A list called docs is created, containing two strings. Each string represents a document.

Instantiate the TfidfVectorizer:

An instance of the TfidfVectorizer class is created, named vectorizer. This object will be used to compute the TF-IDF matrix.


Compute the TF-IDF matrix:

The fit_transform method of the vectorizer object is called with the docs list as an argument. This method first fits the vectorizer to the data (i.e., learns the vocabulary and idf vector) and then transforms the data into a TF-IDF matrix. The result is stored in the variable tfidf.

Print the TF-IDF matrix:

The toarray() method is called on the tfidf sparse matrix to convert it into a dense numpy array, which is then printed.

Output Explanation:

The output is a 2x8 matrix, where each row corresponds to a document in docs, and each column corresponds to a unique word in the vocabulary of the documents. The values in the matrix represent the TF-IDF scores of the words in the documents.


The exact values of the TF-IDF scores depend on the formula used by the TfidfVectorizer, which takes into account the frequency of the word in the document (TF) and the inverse document frequency (IDF), which decreases the weight of words that appear in many documents.

Breakthrough: Word2Vec

Word2Vec, developed by Google, used neural networks to learn word associations.

  • Date: 2013
  • Key Paper: Mikolov, T., et al. (2013). "Efficient Estimation of Word Representations in Vector Space."

!pip install gensim        


from gensim.models import Word2Vec

# Sample sentences
sentences = [['the', 'cat', 'sat', 'on', 'the', 'mat'],
             ['the', 'dog', 'barked', 'at', 'the', 'cat']]

# Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
word_vectors = model.wv

print(word_vectors['cat'])
        


Refinement: GloVe

GloVe (Global Vectors for Word Representation) combined matrix factorization with context-based learning. Developed by Stanford, GloVe is another popular method that combines the advantages of matrix factorization techniques (like those used in latent semantic analysis) with the context-based learning of Word2Vec. It is trained on the co-occurrence matrix of words in a corpus.

  • Date: 2014
  • Key Paper: Pennington, J., Socher, R., & Manning, C. (2014). "GloVe: Global Vectors for Word Representation."

Note: Python example for GloVe typically involves using pre-trained vectors due to the computational complexity of training.

import gensim.downloader as api

# Load pre-trained GloVe model
glove_model = api.load('glove-wiki-gigaword-100')

print(glove_model['cat'])

        


Expansion: FastText

Developed by Facebook, FastText extends Word2Vec to consider subword information, such as character n-grams. This allows it to capture the meaning of shorter words and suffixes/prefixes, making it more effective for languages with rich morphology.

  • Date: 2016
  • Key Paper: Bojanowski, P., et al. (2016). "Enriching Word Vectors with Subword Information."

from gensim.models import FastText

# FastText model
fasttext_model = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4)

print(fasttext_model.wv['cat'])

        


Deep Contextualization: ELMo

ELMo (Embeddings from Language Models) introduced deep contextualized word representations. Developed by the Allen Institute for AI, ELMo is a deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics) and how these uses vary across linguistic contexts (i.e., to model polysemy).

  • Date: 2018
  • Key Paper: Peters, M. E., et al. (2018). "Deep contextualized word representations."

Note: Due to the complexity of ELMo, we typically use pre-trained models.

from allennlp.commands.elmo import ElmoEmbedder
elmo = ElmoEmbedder()

# Get ELMo embeddings for a sentence
tokens = ["I", "have", "a", "cat"]
vectors = elmo.embed_sentence(tokens)

print(vectors.shape)

        


Revolution: BERT and Transformers

BERT (Bidirectional Encoder Representations from Transformers) revolutionized NLP with its transformer architecture. Developed by Google, BERT uses the Transformer architecture to create context-sensitive embeddings by considering both left and right context in all layers of the model.

  • Date: 2018
  • Key Paper: Devlin, J., et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding."

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Encode text
input_text = "I have a cat"
encoded_input

        



Transformer-based Embeddings:

Following BERT, several other transformer-based models like RoBERTa, GPT (Generative Pretrained Transformer), T5 (Text-to-Text Transfer Transformer), and others have been developed, which provide powerful and context-aware word embeddings.

1. GPT (Generative Pretrained Transformer)

  • Date: June 2018
  • Key Paper: Radford, A., et al. (2018). "Improving Language Understanding by Generative Pre-training."
  • Python Example:

from transformers import GPT2Tokenizer, GPT2Model

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

        

2. RoBERTa (Robustly Optimized BERT Approach)

  • Date: July 2019
  • Key Paper: Liu, Y., et al. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach."
  • Python Example:

from transformers import RobertaTokenizer, RobertaModel

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

        

3. T5 (Text-to-Text Transfer Transformer)

  • Date: October 2019
  • Key Paper: Raffel, C., et al. (2019). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer."
  • Python Example:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

input_ids = tokenizer("translate English to German: The house is wonderful.", return_tensors="pt").input_ids
outputs = model.generate(input_ids)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

        

4. BART (Bidirectional and Auto-Regressive Transformers)

  • Date: October 2019
  • Key Paper: Lewis, M., et al. (2019). "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension."
  • Python Example:

from transformers import BartTokenizer, BartForConditionalGeneration

tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large')

inputs = tokenizer("The <mask> is very tall.", return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])

loss = outputs.loss
logits = outputs.logits

        

These models have continued to push the boundaries of what's possible in NLP, offering improvements in various tasks such as language understanding, generation, and translation.



要查看或添加评论,请登录

社区洞察