ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Word Embeddings: Making Text Understandable to Machines

Varghese Chacko

Technology Executive | Director of Engineering & AI Strategy | Enterprise AI, GenAI & Automation Leader | Scaling AI-Powered Cloud & DevOps | Digital Transformation

å‘å¸ƒæ—¥æœŸ: 2023å¹´8æœˆ31æ—¥

Words hold meaning. When humans read a sentence, we understand more than just the individual words â€“ we comprehend the sentiment, the narrative, the nuances, and the rich tapestry of language. But for a computer, words are just symbols, devoid of any intrinsic meaning. This is where word embeddings come into play.

Introduction to Word Vectors

Before we dive deep into the popular models, itâ€™s essential to understand the foundation: the word vector. At its core, a word vector is a numerical representation of a word. This numeric form allows the machine to use algebraic operations on words, potentially revealing semantic relationships. For example, in a perfect world, a high-quality embedding might allow us to compute â€œKingâ€ - â€œManâ€ + â€œWomanâ€ and obtain a result close to â€œQueen.â€

The principle behind word vectors is that the semantic meaning of a word can be represented by its context. As the adage goes in NLP, â€œYou shall know a word by the company it keeps.â€ So, words appearing in similar contexts would have vectors close to each other in the embedding space.

Word2Vec

Developed by a team led by Tomas Mikolov at Google, Word2Vec is perhaps the most popular technique to learn word embeddings. It employs neural networks and comes in two primary training algorithms:

Continuous Bag of Words (CBOW): Predicts target words (e.g., â€˜appleâ€™) from surrounding context words (â€˜I ate an â€¦ pieâ€™).
Skip-Gram: Does the inverse, predicts context words from the target words.

The elegance of Word2Vec is its simplicity and scalability, making it one of the first choices for researchers and industries to produce word embedding.

GloVe (Global Vectors for Word Representation)

Developed by Stanford, GloVe constructs explicit word-word co-occurrence statistics from massive datasets. The central idea here is to capture the global statistical information of a corpus. GloVe operates on the assumption that the ratios of word co-occurrence probabilities carry meaning.

For instance, the probability ratio of â€œiceâ€ to â€œsolidâ€ would be closer to the ratio of â€œsteamâ€ to â€œgasâ€ than, say, the ratio of â€œiceâ€ to â€œfashion.â€

FastText

FastText, introduced by Facebookâ€™s AI Research lab, takes a slightly different approach. Unlike Word2Vec, which considers a word as the smallest unit to train on, FastText looks at a level below, at subword units. This method is incredibly useful for morphologically rich languages and words that werenâ€™t seen during training.

Consider the word â€œapple-ishâ€. While Word2Vec might treat it as an entirely new word, FastText would understand itâ€™s related to â€œappleâ€ because it can break it down to subwords.

As we delve deeper into the realms of text analytics and representation, itâ€™s essential to be equipped with the right tools. While NLTK, spaCy, and TextBlob have their strengths and will continue to be staples in the NLP toolkit, Gensim emerges as a powerhouse specifically tailored for semantic modeling on a large scale. Whether youâ€™re looking to create dense word embeddings, discover latent topics in huge text corpora, or embark on other vector space adventures, Gensim might just be the ally you need!

You can install gensim using both pip and conda:

Using pip:

pip install gensim

é¢†è‹±æŽ¨è

The Sparks of AGI May Catch Fire

Michael Spencer 1 å¹´å‰

What's New in NLP? #7 Cohere Rerank, LivePerson Partnership and More

What's New in NLP? #7 Cohere Rerank, LivePersonâ€¦

Cohere 1 å¹´å‰

?? What Next-Gen RAG Is About

Pascal Biese 6 ä¸ªæœˆå‰

Using conda:

If youâ€™re using Anaconda or Miniconda, you can install gensim from the conda-forge channel:

conda install -c conda-forge gensim

You can generate word embeddings using Word2Vec, GloVe, and FastText:

1. Word2Vec using Gensim

from gensim.models import Word2Vec
sentences = [["I", "love", "JotLore"],
             ["Word", "embeddings", "are", "useful"],
             ["Gensim", "provides", "easy", "tools", "for", "Word2Vec"]]

# Train Word2Vec model
model_w2v = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
model_w2v.save("word2vec.model")

# Retrieve vector for 'NLP'
vector_nlp = model_w2v.wv['NLP']
print(vector_nlp)

2. GloVe using Gensim

To use GloVe embeddings in Python, one common approach is to convert pre-trained GloVe vectors to Word2Vec format and then use Gensim to load and manipulate them.

First, you need to convert GloVe vectors to the Word2Vec format. You can do this using the glove2word2vec script provided by Gensim.

from gensim.scripts.glove2word2vec import glove2word2vec

# Assuming you have downloaded the GloVe vectors and they are stored in 'glove.txt'
glove_input_file = 'glove.txt'
word2vec_output_file = 'glove_word2vec.txt'
glove2word2vec(glove_input_file, word2vec_output_file)

from gensim.models import KeyedVectors

# Load the converted GloVe vectors
glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

3. FastText using Gensim

from gensim.models import FastText

sentences = [["I", "love", "JotLore"],
             ["Word", "embeddings", "are", "useful"],
             ["Gensim", "provides", "easy", "tools", "for", "FastText"]]

# Train FastText model
model_ft = FastText(sentences, vector_size=100, window=5, min_count=1, workers=4)
model_ft.save("fasttext.model")

# Retrieve vector for 'JotLore'
vector_nlp_ft = model_ft.wv['JotLore']
print(vector_nlp_ft)

Please note that while this demonstrates the usage, real-world applications would involve much larger datasets to effectively capture the intricacies of the language. Ensure that you have the necessary libraries installed (gensim in this case) using pip or conda before you run the code.

Word embeddings are a cornerstone of modern Natural Language Processing. Their ability to capture semantic relationships and contextual nuances makes them a favorite tool in the NLP toolkit. Whether you choose Word2Vecâ€™s neural approach, GloVeâ€™s statistical method, or FastTextâ€™s subword magic, the essence remains the same: converting words into numbers with meaning. As we continue to advance in the realm of artificial intelligence, these embeddings will play a pivotal role in helping machines understand us a little bit better.

The source code for all the examples discussed is readily available on GitHub. Dive in, experiment, and enhance your practical understanding by accessing the real-time code snippets. Happy coding! View Source Code in GitHub

Next: Deep Learning in NLP: From RNNs to Transformers

Previous: Basic Text Representation: Bag of Words & TF-IDF

å¸¦æœ‰æ¤å›¾æ ‡çš„é“¾æŽ¥ ç”±é¢†è‹±åˆ›å»ºï¼Œä¸å¸¦æ¤å›¾æ ‡çš„é“¾æŽ¥ç”±ä½œè€…æ·»åŠ ã€‚

Jot Lore: Inspiring Innovation

1,078 ä½å…³æ³¨è€…

è®¢é˜…

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Varghese Chackoçš„æ›´å¤šæ–‡ç«

Automating Regression Testing with AI to Maintain System Integrity

2025å¹´3æœˆ18æ—¥

Automating Regression Testing with AI to Maintain System Integrity

In todayâ€™s fast-paced digital landscape, financial institutions and enterprises are under constant pressure to deliverâ€¦

3 æ¡è¯„è®º
AI-Powered Unit Testing: Transforming Software Reliability

2025å¹´3æœˆ11æ—¥

AI-Powered Unit Testing: Transforming Software Reliability

In the ever-evolving landscape of software development, ensuring application reliability and performance is paramountâ€¦
Using Generative AI to Generate Test Cases for Financial Applications

2025å¹´3æœˆ4æ—¥

Using Generative AI to Generate Test Cases for Financial Applications

The financial industry operates in a highly regulated and fast-paced environment where software reliability andâ€¦

1 æ¡è¯„è®º
How AI can automate QA and reduce backlogs in finance companies

2025å¹´2æœˆ25æ—¥

How AI can automate QA and reduce backlogs in finance companies

Quality Assurance (QA) plays a critical role in ensuring the reliability, accuracy, and compliance of financialâ€¦

2 æ¡è¯„è®º
The Hidden Pitfalls of Generative AI in RAG â€“ And How to Fix Them

2025å¹´2æœˆ21æ—¥

The Hidden Pitfalls of Generative AI in RAG â€“ And How to Fix Them

Generative AI (GenAI) is revolutionizing how businesses access and generate information, but itâ€™s far fromâ€¦
Using AI to Predict Financial System Outages Before They Happen

2025å¹´2æœˆ20æ—¥

Using AI to Predict Financial System Outages Before They Happen

Financial institutions rely heavily on robust and uninterrupted IT systems to manage critical operations, fromâ€¦
Using Generative AI to Generate Test Cases for Financial Applications

2025å¹´2æœˆ19æ—¥

Using Generative AI to Generate Test Cases for Financial Applications

How Generative AI is Transforming Software Testing in the Financial Industry The financial industry operates in aâ€¦

3 æ¡è¯„è®º
Why good coding practices matter? (And how we can make them happen)

2025å¹´2æœˆ18æ—¥

Why good coding practices matter? (And how we can make them happen)

Ever had a project that started off great but turned into a tangled mess over time? Maybe new features take longer andâ€¦
The DRY Principle in Software Development: Writing Clean, Maintainable Code

2025å¹´2æœˆ14æ—¥

The DRY Principle in Software Development: Writing Clean, Maintainable Code

In software development, complexity is inevitable. But unnecessary repetition? Thatâ€™s something we canâ€”andâ€¦
How Fear Silently Kills Productivityâ€”And What Leaders Can Do About It

2025å¹´2æœˆ13æ—¥

How Fear Silently Kills Productivityâ€”And What Leaders Can Do About It

Have you ever been in a meeting where you hesitated to speak upâ€”not because you lacked confidence, but because youâ€¦

See all articles

Word Embeddings: Making Text Understandable to Machines

Varghese Chacko

Technology Executive | Director of Engineering & AI Strategy | Enterprise AI, GenAI & Automation Leader | Scaling AI-Powered Cloud & DevOps | Digital Transformation

Introduction to Word Vectors

Word2Vec

GloVe (Global Vectors for Word Representation)

FastText

Using pip:

é¢†è‹±æŽ¨è

Using conda:

1. Word2Vec using Gensim

2. GloVe using Gensim

3. FastText using Gensim

Jot Lore: Inspiring Innovation

1,078 ä½å…³æ³¨è€…

Varghese Chackoçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

How Zomato improved its search using NLP

LLM Watch#11: Equipping LLMs with Better Long-Term Memory

Watch#8: Extreme Teachers and Mixing Tokens, not Experts

??Top ML Papers of the Week

Exploring RAG with LangChain

Retrieval Augmented Generation (RAG): The Ultimate Guide

TF-IDF: How Machines Understand What Matters in Text ?

A Deep Dive into Retrieval-Augmented Multi-modal Chain-of-Thought Reasoning

Cool tech is not enough. Whatâ€™s the exact problem youâ€™re trying to solve and for whom?

Battle of the AI Titans: LangChain vs Semantic Kernel...who will win?

Introduction to Word Vectors

Word2Vec

GloVe (Global Vectors for Word Representation)

FastText

Using pip:

é¢†è‹±æŽ¨è

Using conda:

1. Word2Vec using Gensim

2. GloVe using Gensim

3. FastText using Gensim

Jot Lore: Inspiring Innovation

1,078 ä½å…³æ³¨è€…

Varghese Chackoçš„æ›´å¤šæ–‡ç«

Automating Regression Testing with AI to Maintain System Integrity

AI-Powered Unit Testing: Transforming Software Reliability

Using Generative AI to Generate Test Cases for Financial Applications

How AI can automate QA and reduce backlogs in finance companies

The Hidden Pitfalls of Generative AI in RAG â€“ And How to Fix Them

Using AI to Predict Financial System Outages Before They Happen

Using Generative AI to Generate Test Cases for Financial Applications

Why good coding practices matter? (And how we can make them happen)

The DRY Principle in Software Development: Writing Clean, Maintainable Code

How Fear Silently Kills Productivityâ€”And What Leaders Can Do About It

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

How Zomato improved its search using NLP

LLM Watch#11: Equipping LLMs with Better Long-Term Memory

Watch#8: Extreme Teachers and Mixing Tokens, not Experts

??Top ML Papers of the Week

Exploring RAG with LangChain

Retrieval Augmented Generation (RAG): The Ultimate Guide

TF-IDF: How Machines Understand What Matters in Text ?

A Deep Dive into Retrieval-Augmented Multi-modal Chain-of-Thought Reasoning

Cool tech is not enough. Whatâ€™s the exact problem youâ€™re trying to solve and for whom?

Battle of the AI Titans: LangChain vs Semantic Kernel...who will win?

é¢†è‹±æŽ¨è

1,078 ä½å…³æ³¨è€…

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†