登录查看更多内容

A Short Introduction to Using Word2Vec for Text Classification

Mike Tamir, PhD

SVP/Chief ML Scientist, ML Faculty at UC Berkeley

发布日期: 2016年2月21日

Machine learning applications on natural language are an extremely important tool in the data scientist’s toolbox. Use cases can include auto-detecting the language of a website, detecting spam in your spam filter, or auto-completing search queries. When you’re working with text data, an important use case is text classification, where the data scientist is tasked with creating an algorithm that can figure out what a bit of text is all about (what is the tagline) based on what is written in the document. This can be used in a myriad of examples we see everyday, tagging things such as blog articles, app descriptions, and reviews.

In many cases traditional text classification can be difficult to scale, because as the order of the taxonomy count increases, the amount of training required increases as well. Moreover, with taxonomy counts in the thousands or tens of thousands, it can become increasingly expensive to gather a sufficient volume of labeled text examples for each taxonomic class.

One solution to this problem is to move to Word2Vec for the processing of your unstructured text data. Word2Vec (W2V) is an algorithm that takes every word in your vocabulary—that is, the text you are classifying—and turns it into a unique vector that can be added, subtracted, and manipulated in other ways just like a vector in space.

At a high level, W2V embeddings of your vocabulary into a vector space is a kind of “side effect” of building certain neural net algorithms designed to do tasks like autocompletion or detecting likely adjacent words in a document. As the neural net “reads” through document after document, learning how to represent the vocabulary into a format that it can process in its “hidden layer(s)” in order to predict the most likely missing words, the algorithm learns something about the relations that each of the terms in the vocabulary have with respect to one another, based on the frequencies with which they come together. These patterns end up getting encoded into a matrix that after a while is able to map any word in the vocabulary to a vector in a much lower dimensional vector space.

Once embedded, these word-vectors end up displaying very interesting relationships between one another. Since vectors can be added and subtracted, we can ask questions by creating word vector equations, like what happens if we add and subtract the word vectors for 'King,' 'Man,' and 'Woman' as follows:

When you take the vector for 'King' and add it to difference vector produced from the subtracting the 'Man' vector from the 'Woman' vector, the resulting vector turns out land negligibly close to the the word embedding for the term 'Queen.'

This works because the way that the neural network ended up learning about related frequencies of terms ended up getting encoded into the W2V matrix. Analogous relationships like the differences in relative occurrences of 'Man' and 'Woman' end up matching the relative occurrences of 'King' and 'Queen' in certain ways that the W2V captures.

Doc2Vec is an application of Word2Vec that takes the tool and expands it to be used on entire document, such as an article. In the simplest form, “naive” Doc2Vec takes the Word2Vec vectors of every word in your text and aggregates them together by taking a normalized sum or arithmetic mean of the terms. As you add your word vectors together over and over again, most of the terms will only show as noise and cancel each other out—a random walk, so to speak. But while the walk is mostly random, it will actually have a bit of drift. By looking at that drift—the aggregate direction of the text’s vectors—you end up getting the total topic direction.

For example: we might imagine averaging all the words in the book A Tale of Two Cities. If you convert the entire text to Word2Vec vectors, the direction of the resulting single aggregate vector will drift towards embedded concepts such as “class struggle,” representing the major theme of the book.

In this way Doc2Vec allows the data scientist to represent entire documents as single vectors, while retaining much (if not all) of the semantic information about kings, queens, etc that might otherwise be only sparsely encoded in a word frequency count representations of one's documents. Moreover, this low-dimensional representation can have great advantages in avoiding the curse of dimensionality pitfalls that a data scientist would have to struggle with.

By leveraging even Naive Doc2Vec techniques, a data scientist has a method for text classification for a host of text samples, enabling her to classify cheaply and efficiently anything from, blog articles, to social media posts, to app descriptions and more.

This article was written in with Bo Moore and originally posted on the Galvanize blog.

Paulo de Assis Nascimento

Berater für Systemintegration bei T-Systems do Brasil

6 年

Very good explanation about how to put all together with word2vec. I didn't knew this. Therefore I use only doc2vec, but it has no parameter to set if I want to use skipgram or cbow.

1 次回应

William B. Claster

Associate Professor of Data Science, School of Engineering at Northeastern University | PhD in Data Science, Prof. Emeritus: APU, formerly at Waseda and Sophia

7 年

Thanks Mike. Simple, clear and suggestive.

Satya Prasad Gunnam

Support Engineering Leader | Operations & Strategy | Generative AI Thought Leader

8 年

Thanks Mike ..Just came across this while reading about Doc2Vec. Just wanted to know your thoughts on using Doc2Vec for finding related cases/articles when a new case is created in the context of technical support..a case is a technical support case created by a customer/user..The idea is to see if we can provide the customer/support engineer with the relevant/related articles by using Doc2Vec instead of they searching on Keywords so that it is automated ( no manual search) and the context/semantic meaning is used.

1 次回应

查看更多评论

要查看或添加评论，请登录

Mike Tamir, PhD的更多文章

MLBP Joining the IBM Data Science Community!

2019年8月1日

MLBP Joining the IBM Data Science Community!

We’re excited to announce that the Machine Learning Blueprint is joining the IBM DataScience Community! We’ve always…

1 条评论
MLBP 9: ONNX Shakes up the Deep Learning Landscape and Numpy Drops Support for Python 2.7

2017年12月1日

MLBP 9: ONNX Shakes up the Deep Learning Landscape and Numpy Drops Support for Python 2.7

Machine Learning Blueprint Facebook Group Subscribe Here Forward to Friends Spotlight Machine Learning Articles ONNX:…

2 条评论
MLBP 8: Uber AI Open Sources Pyro- Probabilistic Deep Learning in Python

2017年11月17日

MLBP 8: Uber AI Open Sources Pyro- Probabilistic Deep Learning in Python

Machine Learning Blueprint Facebook Group Subscribe Here Forward to Friends Spotlight Machine Learning Articles Uber AI…

1 条评论
MLBP 7: TensorFlow’s moves towards PyTorch + How Hinton’s new CapNets might change everything

2017年11月12日

MLBP 7: TensorFlow’s moves towards PyTorch + How Hinton’s new CapNets might change everything

Machine Learning Blueprint Facebook Group Subscribe Here Forward to Friends Spotlight Machine Learning Articles…
2016 Most Popular: Data Science and Machine Learning Articles

2016年12月26日

2016 Most Popular: Data Science and Machine Learning Articles

1. An introduction to Bayesian time series analysis with Python.

16 条评论
Top 10: Data Science and Machine Learning Articles in Aug

2016年9月18日

Top 10: Data Science and Machine Learning Articles in Aug

Top 10 most popular posts in Data Science and Machine Learning made in the month of Aug. Detecting Money Laundering bit.

3 条评论
Top 10: Data Science and Machine Learning Articles in July

2016年8月2日

Top 10: Data Science and Machine Learning Articles in July

Top 10 most popular posts in Data Science and Machine Learning made in the month of July. Data Science in Netflix…

2 条评论
Top 10 posts: Data Science and Machine Learning in May

2016年6月8日

Top 10 posts: Data Science and Machine Learning in May

Top 10 most popular posts in Data Science and Machine Learning made in the month of May. Visualizations of Uncertainty…

1 条评论
Top 10 most popular Data Science and Machine Learning posts in March

2016年4月12日

Top 10 most popular Data Science and Machine Learning posts in March

Here is the top 10 list of the most popular posts in Data Science and Machine Learning made in the month of March:…
What Counting Jelly Beans Can Teach Us About Machine Learning

2016年3月15日

What Counting Jelly Beans Can Teach Us About Machine Learning

Remember that old carnival game, the one where you attempt to guess the number of jelly beans in a jar? While it often…

See all articles

A Short Introduction to Using Word2Vec for Text Classification

Mike Tamir, PhD

SVP/Chief ML Scientist, ML Faculty at UC Berkeley

Mike Tamir, PhD的更多文章

社区洞察

其他会员也浏览了

Formulation of Node Embeddings in Graphs: Node2Vec Algorithm - Part 6 of X of my notes

A deep dive on Vector Search and its implementation

Building RAG-Enhanced LLMs: A Guide to Essential Libraries and Tools

DATA Pill #088 - NLP vs GAI, Text-to-Graph via LLM, CI/CD for Modern Data Engineering

Enhancing Data Science with Large Language Models within Select Industries.

Roadmap of skills required to create AI Agent

RAG || !2 RAG

Which Vector Database Should You Use? Choosing the Best One for Your Needs

Understanding Vector Databases: What They Are and How They Work

Mike Tamir, PhD的更多文章

MLBP Joining the IBM Data Science Community!

MLBP 9: ONNX Shakes up the Deep Learning Landscape and Numpy Drops Support for Python 2.7

MLBP 8: Uber AI Open Sources Pyro- Probabilistic Deep Learning in Python

MLBP 7: TensorFlow’s moves towards PyTorch + How Hinton’s new CapNets might change everything

2016 Most Popular: Data Science and Machine Learning Articles

Top 10: Data Science and Machine Learning Articles in Aug

Top 10: Data Science and Machine Learning Articles in July

Top 10 posts: Data Science and Machine Learning in May

Top 10 most popular Data Science and Machine Learning posts in March

What Counting Jelly Beans Can Teach Us About Machine Learning

社区洞察

其他会员也浏览了

Formulation of Node Embeddings in Graphs: Node2Vec Algorithm - Part 6 of X of my notes

A deep dive on Vector Search and its implementation

Building RAG-Enhanced LLMs: A Guide to Essential Libraries and Tools

DATA Pill #088 - NLP vs GAI, Text-to-Graph via LLM, CI/CD for Modern Data Engineering

Enhancing Data Science with Large Language Models within Select Industries.

Roadmap of skills required to create AI Agent

RAG || !2 RAG

Which Vector Database Should You Use? Choosing the Best One for Your Needs

Understanding Vector Databases: What They Are and How They Work