Word Embeddings: Making Text Understandable to Machines
Varghese Chacko
Technology Executive | Director of Engineering & AI Strategy | Enterprise AI, GenAI & Automation Leader | Scaling AI-Powered Cloud & DevOps | Digital Transformation
Words hold meaning. When humans read a sentence, we understand more than just the individual words – we comprehend the sentiment, the narrative, the nuances, and the rich tapestry of language. But for a computer, words are just symbols, devoid of any intrinsic meaning. This is where word embeddings come into play.
Introduction to Word Vectors
Before we dive deep into the popular models, it’s essential to understand the foundation: the word vector. At its core, a word vector is a numerical representation of a word. This numeric form allows the machine to use algebraic operations on words, potentially revealing semantic relationships
The principle behind word vectors is that the semantic meaning of a word can be represented by its context. As the adage goes in NLP, “You shall know a word by the company it keeps.†So, words appearing in similar contexts would have vectors close to each other in the embedding space.
Word2Vec
Developed by a team led by Tomas Mikolov at Google, Word2Vec is perhaps the most popular technique to learn word embeddings. It employs neural networks
- Continuous Bag of Words (CBOW): Predicts target words (e.g., ‘apple’) from surrounding context words (‘I ate an … pie’).
- Skip-Gram: Does the inverse, predicts context words from the target words.
The elegance of Word2Vec is its simplicity and scalability, making it one of the first choices for researchers and industries to produce word embedding.
GloVe (Global Vectors for Word Representation)
Developed by Stanford, GloVe constructs explicit word-word co-occurrence statistics from massive datasets. The central idea here is to capture the global statistical information
For instance, the probability ratio of “ice†to “solid†would be closer to the ratio of “steam†to “gas†than, say, the ratio of “ice†to “fashion.â€
FastText
FastText, introduced by Facebook’s AI Research lab, takes a slightly different approach. Unlike Word2Vec, which considers a word as the smallest unit to train on, FastText looks at a level below, at subword units
Consider the word “apple-ishâ€. While Word2Vec might treat it as an entirely new word, FastText would understand it’s related to “apple†because it can break it down to subwords.
As we delve deeper into the realms of text analytics and representation
You can install gensim using both pip and conda:
Using pip:
pip install gensim
领英推è
Using conda:
If you’re using Anaconda or Miniconda, you can install gensim from the conda-forge channel:
conda install -c conda-forge gensim
You can generate word embeddings using Word2Vec, GloVe, and FastText:
1. Word2Vec using Gensim
from gensim.models import Word2Vec
sentences = [["I", "love", "JotLore"],
["Word", "embeddings", "are", "useful"],
["Gensim", "provides", "easy", "tools", "for", "Word2Vec"]]
# Train Word2Vec model
model_w2v = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
model_w2v.save("word2vec.model")
# Retrieve vector for 'NLP'
vector_nlp = model_w2v.wv['NLP']
print(vector_nlp)
2. GloVe using Gensim
To use GloVe embeddings in Python, one common approach is to convert pre-trained GloVe vectors to Word2Vec format and then use Gensim to load and manipulate them.
First, you need to convert GloVe vectors to the Word2Vec format. You can do this using the glove2word2vec script provided by Gensim.
from gensim.scripts.glove2word2vec import glove2word2vec
# Assuming you have downloaded the GloVe vectors and they are stored in 'glove.txt'
glove_input_file = 'glove.txt'
word2vec_output_file = 'glove_word2vec.txt'
glove2word2vec(glove_input_file, word2vec_output_file)
from gensim.models import KeyedVectors
# Load the converted GloVe vectors
glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)
3. FastText using Gensim
from gensim.models import FastText
sentences = [["I", "love", "JotLore"],
["Word", "embeddings", "are", "useful"],
["Gensim", "provides", "easy", "tools", "for", "FastText"]]
# Train FastText model
model_ft = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4)
model_ft.save("fasttext.model")
# Retrieve vector for 'JotLore'
vector_nlp_ft = model_ft.wv['JotLore']
print(vector_nlp_ft)
Please note that while this demonstrates the usage, real-world applications would involve much larger datasets to effectively capture the intricacies of the language. Ensure that you have the necessary libraries installed (gensim in this case) using pip or conda before you run the code.
Word embeddings are a cornerstone of modern Natural Language Processing. Their ability to capture semantic relationships and contextual nuances
The source code for all the examples discussed is readily available on GitHub. Dive in, experiment, and enhance your practical understanding by accessing the real-time code snippets. Happy coding! View Source Code in GitHub
Next: Deep Learning in NLP: From RNNs to Transformers