A Short Introduction to Using Word2Vec for Text Classification

A Short Introduction to Using Word2Vec for Text Classification

Machine learning applications on natural language are an extremely important tool in the data scientist’s toolbox. Use cases can include auto-detecting the language of a website, detecting spam in your spam filter, or auto-completing search queries. When you’re working with text data, an important use case is text classification, where the data scientist is tasked with creating an algorithm that can figure out what a bit of text is all about (what is the tagline) based on what is written in the document. This can be used in a myriad of examples we see everyday, tagging things such as blog articles, app descriptions, and reviews.

In many cases traditional text classification can be difficult to scale, because as the order of the taxonomy count increases, the amount of training required increases as well. Moreover, with taxonomy counts in the thousands or tens of thousands, it can become increasingly expensive to gather a sufficient volume of labeled text examples for each taxonomic class.

One solution to this problem is to move to Word2Vec for the processing of your unstructured text data. Word2Vec (W2V) is an algorithm that takes every word in your vocabulary—that is, the text you are classifying—and turns it into a unique vector that can be added, subtracted, and manipulated in other ways just like a vector in space.

At a high level, W2V embeddings of your vocabulary into a vector space is a kind of “side effect” of building certain neural net algorithms designed to do tasks like autocompletion or detecting likely adjacent words in a document. As the neural net “reads” through document after document, learning how to represent the vocabulary into a format that it can process in its “hidden layer(s)” in order to predict the most likely missing words, the algorithm learns something about the relations that each of the terms in the vocabulary have with respect to one another, based on the frequencies with which they come together. These patterns end up getting encoded into a matrix that after a while is able to map any word in the vocabulary to a vector in a much lower dimensional vector space.

Once embedded, these word-vectors end up displaying very interesting relationships between one another. Since vectors can be added and subtracted, we can ask questions by creating word vector equations, like what happens if we add and subtract the word vectors for 'King,' 'Man,' and 'Woman' as follows:

 

 

When you take the vector for 'King' and add it to difference vector produced from the subtracting the 'Man' vector from the 'Woman' vector, the resulting vector turns out land negligibly close to the the word embedding for the term 'Queen.'

 

 

This works because the way that the neural network ended up learning about related frequencies of terms ended up getting encoded into the W2V matrix. Analogous relationships like the differences in relative occurrences of 'Man' and 'Woman' end up matching the relative occurrences of 'King' and 'Queen' in certain ways that the W2V captures.    

Doc2Vec is an application of Word2Vec that takes the tool and expands it to be used on entire document, such as an article. In the simplest form, “naive” Doc2Vec takes the Word2Vec vectors of every word in your text and aggregates them together by taking a normalized sum or arithmetic mean of the terms. As you add your word vectors together over and over again, most of the terms will only show as noise and cancel each other out—a random walk, so to speak. But while the walk is mostly random, it will actually have a bit of drift. By looking at that drift—the aggregate direction of the text’s vectors—you end up getting the total topic direction.

For example: we might imagine averaging all the words in the book A Tale of Two Cities. If you convert the entire text to Word2Vec vectors, the direction of the resulting single aggregate vector will drift towards embedded concepts such as “class struggle,” representing the major theme of the book.  

In this way Doc2Vec allows the data scientist to represent entire documents as single vectors, while retaining much (if not all) of the semantic information about kings, queens, etc that might otherwise be only sparsely encoded in a word frequency count representations of one's documents.  Moreover, this low-dimensional representation can have great advantages in avoiding the curse of dimensionality pitfalls that a data scientist would have to struggle with.

By leveraging even Naive Doc2Vec techniques, a data scientist has a method for text classification for a host of text samples, enabling her to classify cheaply and efficiently anything from, blog articles, to social media posts, to app descriptions and more.

 

This article was written in with Bo Moore and originally posted on the Galvanize blog.

Paulo de Assis Nascimento

Berater für Systemintegration bei T-Systems do Brasil

6 年

Very good explanation about how to put all together with word2vec. I didn't knew this. Therefore I use only doc2vec, but it has no parameter to set if I want to use skipgram or cbow.

William B. Claster

Associate Professor of Data Science, School of Engineering at Northeastern University | PhD in Data Science, Prof. Emeritus: APU, formerly at Waseda and Sophia

7 年

Thanks Mike. Simple, clear and suggestive.

回复
Satya Prasad Gunnam

Support Engineering Leader | Operations & Strategy | Generative AI Thought Leader

8 年

Thanks Mike ..Just came across this while reading about Doc2Vec. Just wanted to know your thoughts on using Doc2Vec for finding related cases/articles when a new case is created in the context of technical support..a case is a technical support case created by a customer/user..The idea is to see if we can provide the customer/support engineer with the relevant/related articles by using Doc2Vec instead of they searching on Keywords so that it is automated ( no manual search) and the context/semantic meaning is used.

要查看或添加评论,请登录

Mike Tamir, PhD的更多文章

社区洞察

其他会员也浏览了