Word2vec - Word embeddings used in NLP
This isn't a "This week in Machine Learning"(TWiML) article, but is one to demystify myself (complete novice in this space) and hopefully a few others with the jargons floating around . Of late, I came across outside the commonly used phrases like "neural network", "convolutional neural net", "natural language processing",etc….the phrase "word2vec/embedding".
What is "Word2vec" and how is it applicable in the world of AI? This is super cool to begin with.
As I understood it, the original idea comes from linguists (credits to J R Firth back ~1950) although "Word2vec" is fairly new (credits to Google in 2013) . The basic idea which drives this is that "you will know the meaning of word by the company it keeps"!
If you want to know what the word means, look at the context….simple :)…..it gives the clue….well this is how I teach my son…simple to human….but not for a computer.
Word2vec, takes advantage of this by saying the "embedding" of the word is defined by the context it appears. So words appearing in the same context are related (i.e. equivalent), so will have vectors which are equivalent in a corpus.
For example (could you try filling in)
I went for a walk ______
Could the answer be "yesterday" or "outside"…but the main take away is the context which drives it here :)
Also could this as well solve the problem of guessing when to use singular/plural or present or past tense?
Step back, why vectors?
Ok, everything so far we talked about is around words, hence we want computers to extract words from a huge text blob. For a computer we need to represent this input data (huge text blob) with a numerical representation as we know computers work with numbers. So effectively word embedding converts words into vectors. So given a word such as "yesterday" this would be represented as 64 numbers. Effectively each word is reduced to a vector and for word embedding to work we need relative words to be close to each other for example "yesterday" and "today".
To take this to next level, if we take vector of man subtract that from the vector for woman and the vector for queen would it result in king?
MAN - WOMAN + QUEEN = KING
This is where word2vec comes in, as a form of word embedding. It’s essentially a neural network.
At a high level, you feed in a word which then produces a vector (word embedding) and the output is a context word.
For instance, if we feed in "walk" from our example above to word2vec the output we would expect is "yesterday" or "today" or "outside" using neural networks based on the corpus.
Effectively, Word2vec looks for the word embedding with similar value (i.e. similar vector values) to find the output contextual word.
The beauty of this is that, you needn't know anything about your text to find a result, so basically unsupervised.
Just to add, besides "Word2vec" there are other ways to generate word embedding and most based on "co-occurrence matrix". Ok this is where personally I needed to pay attention at school to understand linear algebra and matrices. Not going into it…duh!...Matrix decomposition at the heart, Singular Value Decomposition, Gradient Descent etc etc to get word vectors.
A word is represented as the row and the context is represented as column in matrix. In a recommended system it uses the same mathematical model. (ratings as row vs users as column in a matrix).
Let me hope this has helped someone out there as else it would turn out to be technical/mathematical article and I am not the right person to advocate it. Below are some links.
Word2vec paper: https://arxiv.org/abs/1301.3781
GloVe paper: https://nlp.stanford.edu/pubs/glove.pdf
GloVe webpage: https://nlp.stanford.edu/projects/glove/