登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

ML : Recommendation System

Sandeep Khurana

Vice President, Senior Principal Engineer at Arcesium.

发布日期: 2019年4月20日

Objective

We will try to find the similarity of a new document with existing documents in the corpus. There is python library which can solve this problem in few lines of code. Similar libraries might exist in other programming languages too. But we will not use any of these. We will solve this problem with our own implementation. This is just to understand maths behind it. Use existing tested library for production code.

Maths Refresher

We will be using bit of maths to do it. Below is just refresher of maths concepts used in this problem.

Vector

Note - Please read a12 as a1 square . b12 as b1 square

For layman consider Vector as list of numbers. You might want to see how vector can be represented in a matrix. It is either row vector or column vector.

Multiplication of 2 Vectors

If we have two vectors as below (using 3 dimensional space , its same for any number of dimensional space)

v1 = a1, b1, c1

v2 = a2, b2, c2

Then multiplication of v1 and v2 is

v1 * v2 = a1 * a2 + b1 * b2 + c1 * c2

Length of a Vector

Length of vector eg v1 which is represented by ||v1|| is = a12 + b12 + c12

Cosine Similarity Between Two Vectors

(There are other ways to find similarity than just cosine eg euclidean)

Vector And Document

We will talk about vectors below. We will convert a document into a vector. Vector similarity discussed in this section will be used to elaborate document similarity after we convert documents into vectors.

Projection Of A Vector Over Another Vector

We will talk about projection of a vector over another vector. More the angle between vectors less projection amongst between them.

Consider above three images.

First Image - There is 90 deg angle between two vectors. Hence these are entirely 2 different vectors with no similarity at all. There is no projection of vertical line over horizontal line.

Second Image - The second line has some projection over horizontal line. Hence there is more similarity between vectors as compared to first image.

Third Image - The second line has even more projection over horizontal line hence more similarity over 2nd image.

If both vectors overlap each other . In this case angle is zero and projection is 100%, ie these are duplicate documents.

Now consider this

When angle is 0 which function results in 1 (100% similarity) and
When angle is 90 which function results in 0 (0% similarity)

It is “cosine”.

Cosine Similarity Equation

Cosine Similarity between 2 vectors v1 and v2 is measured by this equation

cosine(v1, v2) = ( v1 . v2 ) / ||v1|| . ||v2||

v1.v2 = dot product of vectors. We discussed it above

||v1|| . ||v2|| = Multiplication of length of 2 vectors. This too we discussed above.

We can also write above equation as below for cosine similarity

cosSim(v1, v2)=( a1 * a2 + b1 * b2 + c1 * c2) / ( square of a1 + square of b1 + square of c1) * (square of a2+ square of b2 + square of c2 )

Problem Statement

Let’s assume we have below 3 documents in corpus already.

Doc1= carrot, spinach, onion, ginger

Doc2= carrot,ice cream, cereals, bread

Doc3= eggs,chicken, ginger, bread

Now we have 4th document below,

Doc4= carrot, spinach, onion, chicken

Problem - We want to match Doc4 with existing documents in corpus. And want to find out to which corpus document it has maximum similarity.

(We can use same concept for page ranking or to suggest ‘users who added x also added y” in cart or “there already exists similar document” in some other business scenarios)

Concepts - Tf - Idf

Tf - Term Frequency

Term frequency is number of times a term appear in a document. Lets draw a table for above input documents

Term Frequency adjusted with document length

To give equal importance to large and small documents we use term frequency adjusted for document length.

Hence we can define term frequency as

tf(t,d) = Number of times term appear in document / Total number of terms in document

We will rewrite above tables as below. Zero divided by any number is zero hence keeping values of zero cells as zero.

(Note - In this example, tf adjusted with doc length might not be needed as all doc lengths are same)

Tf for Terms

Idf - Inverse Document Frequency

There are many words which are repeated and won't be helpful in determining cosine similarity eg is , the etc.

Idf measures of how important term is. It weighs down frequent terms and scale up rare terms. Formula for calculating Idf for term t is is

idf(t) = log (Total Number Of Documents / Number Of Documents Having Term t in it)

We will use log of base 2. Lets calculate idf for each term in input documents

Idf for terms

Total Number of documents we have in corpus = 3

Carrot ( appears in 2 documents ) = log2 (3/2) = 0.585

Spinach( appears in 1 document ) = log2 (3/1) = 1.585

Onion ( appears in 1 document ) = log2 (3/1) = 1.585

Ginger( appears in 2 documents ) = log2 (3/2) = 0.585

Ice Cream ( appears in 1 document ) = log2 (3/1) = 1.585

Cereals( appears in 1 document ) = log2 (3/1) = 1.585

Bread ( appears in 2 documents ) = log2 (3/2) = 0.585

Egg( appears in 1 document ) = log2 (3/1) = 1.585

Chicken( appears in 1 document ) = log2 (3/1) = 1.585

We can draw idf table like this for corpus documents. It is just to help us visualize.

Tf-Idf

As obvious tf-idf is “term frequency-inverse document frequency”.

Its frequency of a term in the document weighted down by number of document in which it occurs.

Tf - idf = tf multiplied by idf

Tf-idf table

How - Summary

How are we going to find similarity of doc4 with existing doc1, doc2 and doc3 ?

We will find the frequency of each word in each document. We will weight down each of these words using idf.

After this, we will have vector for each document consisting of tf-idf of each word. We will also create tf-idf vector of new document.

Now all we have to do it, find cosine similarity of new document vector with each of existing ones (of doc1 , 2 and 3) . With whosoever we get max similarity is closest to doc4.

In below sections, lots of maths is used. If you are not in mood for maths then feel free to stop reading this post further as you got the concept already.

If you wish to continue then if possible use notebook and pen to validate each step below to get hands on experience.

Find Similarity Of New Doc With Existing Ones

The problem at hand is to find which existing document is similar to this new document

Doc4= carrot, spinach, onion, chicken

Tf for new document

Lets calculate tf for this document

Tf - Idf for new document

Calculate idf wrt existing corpus . We will use existing idf values from corpus

Find Similarity

Remember the cosine distance formula. Pasting here again for reference

cosSim(v1, v2)=( a1 * a2 + b1 * b2 + c1 * c2) / ( square of a1 + square of b1 + square of c1) * (square of a2+ square of b2 + square of c2 )

We will calculate the cosine similarity of doc4 with each of doc1, doc2 and doc3.

cosSim(doc4, doc1)

Note - square of 0 is 0

cosSim(doc4, doc1) =

(0.14625*0.14625 + 0.39625*0.39625 + 0.39625* 0.39625 + 0.14625*0 + 0*0+ 0*0 + 0*0 + 0*0 + 0*0.39625 )

( 0.146252+0.396252+0.396252+ 0.146252+0+0+0+ 0 0 * 0.146252+0.396252+ 0.396252+0+0 +0+0+0.396252

= 0.178 / ( 0.6 * 0.7) = 0.178 / 0.42 = 0.423

So, cosSim(doc4, doc1) = 0.423

cosSim(doc4, doc2)

Note

square of 0 is 0
0 multiplied by any number is zero. Hence below 0 * anyNum is just marked as 0.

cosSim(doc4, doc2) =

(0.14625*0.14625 + 0 + 0 + 0 + 0 +0 + 0 + 0 + 0 )

( 0.146252+0 +0+ 0+0.396252+0.396252+0.146252+ 0+ 0 * 0.146252+0.396252+0+0 +0.396252+0+0+0+0.396252

= 0.021 / ( 0.6 * 0.7) = 0.021 / 0.42 = 0.05

So, cosSim(doc4, doc2) = 0.05

cosSim(doc4, doc3)

Note

square of 0 is 0
0 multiplied by any number is zero. Hence below 0 * anyNum is just marked as 0.

cosSim(doc4, doc3) =

(0 + 0 + 0 + 0 + 0 +0 + 0 + 0 + 0.39625 * 0.39625)

( 0+0+0+ 0.146252+0 +0+0.146252+ 0.396252+ 0.396252 * 0.146252+0.396252+0+0 +0.396252+0+0+0+0.396252

= 0.16 / ( 0.6 * 0.7) = 0.16 / 0.42 = 0.381

So, cosSim(doc4, doc3) = 0.381

Conclusion

cosSim(doc4, doc1) = 0.423

cosSim(doc4, doc2) = 0.05

cosSim(doc4, doc3) = 0.381

Hence Doc4 is more similar to Doc1 compared to Doc2 and Doc3.

PS - I have written this in one sitting and did maths myself. If there is a mistake then feel free to point. thanks.

Neeraj Gera

Additional Superintending Engineer at Punjab State Power Corporation Limited

5 年

wah ji wah !!!

1 次回应

要查看或添加评论，请登录

Sandeep Khurana的更多文章

Spark - Save Dataset In Memory Outside Heap

2021年9月12日

Spark - Save Dataset In Memory Outside Heap

This article is for people who have some idea of Spark , Dataset / Dataframe. I am going to show how to persist a…

2 条评论
Local Kafka Setup Using Docker

2021年2月10日

Local Kafka Setup Using Docker

Objective This write up explains step for setting up Kafka on local machine using docker. Couple of the many reasons to…

1 条评论
Docker - Tomcat and Mysql setup

2020年2月9日

Docker - Tomcat and Mysql setup

If you want to work on tomcat and mysql in dev environment and dont want to take the pain of setting these up then this…
Logarithmic Intuition

2019年9月25日

Logarithmic Intuition

Thinking about logarithm while working on a problem, it helps to have a visual picture in mind. Below is one such…
Airflow : JDBC Connection with Hive Example

2018年11月26日

Airflow : JDBC Connection with Hive Example

This article is about using airflow to connect to DB using JDBC. The example below connects to hive.
Support Vector Machine - Part 2

2017年12月1日

Support Vector Machine - Part 2

This article is continuation of part 1 of this series at https://www.linkedin.
AI - Support Vector Machine - Part 1

2017年9月28日

AI - Support Vector Machine - Part 1

Note - Since it is a big topic, I will write a series of 3 to 4 posts to cover this. What is support vector machine Its…

1 条评论
MicroServices Architecture - An Example

2017年9月2日

MicroServices Architecture - An Example

Objective There are various frameworks available which can be used in a microservices Architecture. Additionally…

2 条评论
Machine Learning : Decision Tree using Spark For Layman

2017年8月22日

Machine Learning : Decision Tree using Spark For Layman

Objective We will go over the definition, intuition and algorithm of a Decision Tree in this article. Then we will…
Single Variant Linear Regression with example

2017年8月3日

Single Variant Linear Regression with example

What is machine learning and Linear Regression for layman Machine Learning - Fictional story. Once there was a doctor.

1 条评论

See all articles