ML : Recommendation System
Objective
We will try to find the similarity of a new document with existing documents in the corpus. There is python library which can solve this problem in few lines of code. Similar libraries might exist in other programming languages too. But we will not use any of these. We will solve this problem with our own implementation. This is just to understand maths behind it. Use existing tested library for production code.
Maths Refresher
We will be using bit of maths to do it. Below is just refresher of maths concepts used in this problem.
Vector
Note - Please read a12 as a1 square . b12 as b1 square
For layman consider Vector as list of numbers. You might want to see how vector can be represented in a matrix. It is either row vector or column vector.
Multiplication of 2 Vectors
If we have two vectors as below (using 3 dimensional space , its same for any number of dimensional space)
v1 = a1, b1, c1
v2 = a2, b2, c2
Then multiplication of v1 and v2 is
v1 * v2 = a1 * a2 + b1 * b2 + c1 * c2
Length of a Vector
Length of vector eg v1 which is represented by ||v1|| is = a12 + b12 + c12
Cosine Similarity Between Two Vectors
(There are other ways to find similarity than just cosine eg euclidean)
Vector And Document
We will talk about vectors below. We will convert a document into a vector. Vector similarity discussed in this section will be used to elaborate document similarity after we convert documents into vectors.
Projection Of A Vector Over Another Vector
We will talk about projection of a vector over another vector. More the angle between vectors less projection amongst between them.
Consider above three images.
First Image - There is 90 deg angle between two vectors. Hence these are entirely 2 different vectors with no similarity at all. There is no projection of vertical line over horizontal line.
Second Image - The second line has some projection over horizontal line. Hence there is more similarity between vectors as compared to first image.
Third Image - The second line has even more projection over horizontal line hence more similarity over 2nd image.
If both vectors overlap each other . In this case angle is zero and projection is 100%, ie these are duplicate documents.
Now consider this
- When angle is 0 which function results in 1 (100% similarity) and
- When angle is 90 which function results in 0 (0% similarity)
It is “cosine”.
Cosine Similarity Equation
Cosine Similarity between 2 vectors v1 and v2 is measured by this equation
cosine(v1, v2) = ( v1 . v2 ) / ||v1|| . ||v2||
v1.v2 = dot product of vectors. We discussed it above
||v1|| . ||v2|| = Multiplication of length of 2 vectors. This too we discussed above.
We can also write above equation as below for cosine similarity
cosSim(v1, v2)=( a1 * a2 + b1 * b2 + c1 * c2) / ( square of a1 + square of b1 + square of c1) * (square of a2+ square of b2 + square of c2 )
Problem Statement
Let’s assume we have below 3 documents in corpus already.
Doc1= carrot, spinach, onion, ginger
Doc2= carrot,ice cream, cereals, bread
Doc3= eggs,chicken, ginger, bread
Now we have 4th document below,
Doc4= carrot, spinach, onion, chicken
Problem - We want to match Doc4 with existing documents in corpus. And want to find out to which corpus document it has maximum similarity.
(We can use same concept for page ranking or to suggest ‘users who added x also added y” in cart or “there already exists similar document” in some other business scenarios)
Concepts - Tf - Idf
Tf - Term Frequency
Term frequency is number of times a term appear in a document. Lets draw a table for above input documents
Term Frequency adjusted with document length
To give equal importance to large and small documents we use term frequency adjusted for document length.
Hence we can define term frequency as
tf(t,d) = Number of times term appear in document / Total number of terms in document
We will rewrite above tables as below. Zero divided by any number is zero hence keeping values of zero cells as zero.
(Note - In this example, tf adjusted with doc length might not be needed as all doc lengths are same)
Tf for Terms
Idf - Inverse Document Frequency
There are many words which are repeated and won't be helpful in determining cosine similarity eg is , the etc.
Idf measures of how important term is. It weighs down frequent terms and scale up rare terms. Formula for calculating Idf for term t is is
idf(t) = log (Total Number Of Documents / Number Of Documents Having Term t in it)
We will use log of base 2. Lets calculate idf for each term in input documents
Idf for terms
Total Number of documents we have in corpus = 3
Carrot ( appears in 2 documents ) = log2 (3/2) = 0.585
Spinach( appears in 1 document ) = log2 (3/1) = 1.585
Onion ( appears in 1 document ) = log2 (3/1) = 1.585
Ginger( appears in 2 documents ) = log2 (3/2) = 0.585
Ice Cream ( appears in 1 document ) = log2 (3/1) = 1.585
Cereals( appears in 1 document ) = log2 (3/1) = 1.585
Bread ( appears in 2 documents ) = log2 (3/2) = 0.585
Egg( appears in 1 document ) = log2 (3/1) = 1.585
Chicken( appears in 1 document ) = log2 (3/1) = 1.585
We can draw idf table like this for corpus documents. It is just to help us visualize.
Tf-Idf
As obvious tf-idf is “term frequency-inverse document frequency”.
Its frequency of a term in the document weighted down by number of document in which it occurs.
Tf - idf = tf multiplied by idf
Tf-idf table
How - Summary
How are we going to find similarity of doc4 with existing doc1, doc2 and doc3 ?
We will find the frequency of each word in each document. We will weight down each of these words using idf.
After this, we will have vector for each document consisting of tf-idf of each word. We will also create tf-idf vector of new document.
Now all we have to do it, find cosine similarity of new document vector with each of existing ones (of doc1 , 2 and 3) . With whosoever we get max similarity is closest to doc4.
In below sections, lots of maths is used. If you are not in mood for maths then feel free to stop reading this post further as you got the concept already.
If you wish to continue then if possible use notebook and pen to validate each step below to get hands on experience.
Find Similarity Of New Doc With Existing Ones
The problem at hand is to find which existing document is similar to this new document
Doc4= carrot, spinach, onion, chicken
Tf for new document
Lets calculate tf for this document
Tf - Idf for new document
Calculate idf wrt existing corpus . We will use existing idf values from corpus
Find Similarity
Remember the cosine distance formula. Pasting here again for reference
cosSim(v1, v2)=( a1 * a2 + b1 * b2 + c1 * c2) / ( square of a1 + square of b1 + square of c1) * (square of a2+ square of b2 + square of c2 )
We will calculate the cosine similarity of doc4 with each of doc1, doc2 and doc3.
cosSim(doc4, doc1)
Note - square of 0 is 0
cosSim(doc4, doc1) =
(0.14625*0.14625 + 0.39625*0.39625 + 0.39625* 0.39625 + 0.14625*0 + 0*0+ 0*0 + 0*0 + 0*0 + 0*0.39625 )
/
( 0.146252+0.396252+0.396252+ 0.146252+0+0+0+ 0 0 * 0.146252+0.396252+ 0.396252+0+0 +0+0+0.396252
= 0.178 / ( 0.6 * 0.7) = 0.178 / 0.42 = 0.423
So, cosSim(doc4, doc1) = 0.423
cosSim(doc4, doc2)
Note
- square of 0 is 0
- 0 multiplied by any number is zero. Hence below 0 * anyNum is just marked as 0.
cosSim(doc4, doc2) =
(0.14625*0.14625 + 0 + 0 + 0 + 0 +0 + 0 + 0 + 0 )
/
( 0.146252+0 +0+ 0+0.396252+0.396252+0.146252+ 0+ 0 * 0.146252+0.396252+0+0 +0.396252+0+0+0+0.396252
= 0.021 / ( 0.6 * 0.7) = 0.021 / 0.42 = 0.05
So, cosSim(doc4, doc2) = 0.05
cosSim(doc4, doc3)
Note
- square of 0 is 0
- 0 multiplied by any number is zero. Hence below 0 * anyNum is just marked as 0.
cosSim(doc4, doc3) =
(0 + 0 + 0 + 0 + 0 +0 + 0 + 0 + 0.39625 * 0.39625)
/
( 0+0+0+ 0.146252+0 +0+0.146252+ 0.396252+ 0.396252 * 0.146252+0.396252+0+0 +0.396252+0+0+0+0.396252
= 0.16 / ( 0.6 * 0.7) = 0.16 / 0.42 = 0.381
So, cosSim(doc4, doc3) = 0.381
Conclusion
cosSim(doc4, doc1) = 0.423
cosSim(doc4, doc2) = 0.05
cosSim(doc4, doc3) = 0.381
Hence Doc4 is more similar to Doc1 compared to Doc2 and Doc3.
PS - I have written this in one sitting and did maths myself. If there is a mistake then feel free to point. thanks.
Additional Superintending Engineer at Punjab State Power Corporation Limited
5 年wah ji wah !!!