ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Measuring Text Similarity in Python

Ravi Shankar

Machine Learning Manager | RecSys, LLM, CV, NLP | Scalable AI/ML

å‘å¸ƒæ—¥æœŸ: 2017å¹´5æœˆ15æ—¥

Note: This article has been taken from a post on my blog.

A while ago, I shared a paper on LinkedIn that talked about measuring similarity between two text strings using something called Word Moving Distance (WMD). The paper can be found here.

In this post, I'll talk about different methods to calculate similarity between text strings. In general, computers can't understand text the same way they could understand numbers, so the text needs to be converted to vectors which is then used for most of the text based functions.

Metric Type I

## example in Python 2.7.11 (required modules sklearn, pandas)
?>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> import pandas as pd

## initialize TFIDFVectorizer. More can read at
## https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn-feature-extraction-text-tfidfvectorizer?
>>> tfidf = TfidfVectorizer()
>>> string = 'This is a small sentence to show how text is converted to vector representation'
>>> y = tfidf.fit_transform([x])
## convert to a sparse matrix form (not visible here but in large corpus will be)
## to know how these tfidf values are created, please google, this has
## two components - tf and idf?
?>>> y.toarray()
array([[ 0.24253563,  0.24253563,  0.48507125,  0.24253563,  0.24253563,
         0.24253563,  0.24253563,  0.24253563,  0.24253563,  0.48507125,
         0.24253563]])

## look at the words in vocabulary and their indices corresponding to the array
## above?
>>> tfidf.vocabulary_
{u'show': 5, u'this': 8, u'text': 7, u'is': 2, u'sentence': 4, u'to': 9, u'vector': 10, u'how': 1, u'small': 6, u'representation': 3, u'converted': 0}

## get the feature names with the correct indices
>>> tfidf.get_feature_names()
[u'converted', u'how', u'is', u'representation', u'sentence', u'show', u'small', u'text', u'this', u'to', u'vector']
## convert the tfidf vector to a pandas dataframe
?>>> df = pd.DataFrame(y.toarray(), columns = tfidf.get_feature_names())

## print the dataframe
>>> df
   converted       how        is  representation  sentence      show  \
0   0.242536  0.242536  0.485071        0.242536  0.242536  0.242536   

      small      text      this        to    vector  
0  0.242536  0.242536  0.242536  0.485071  0.242536

The small code above shows how to convert a string to a vector representation which could then be fed to machine learning algorithms. However, there is a downside of the above representation, the vectors don't convey the exact order of the sentence, meaning even if the words are shuffled in the sentence, the vector representation would remain the same. Imagine this sentence as a point in a N-dimensional space just we have a point a 2D or 3D space.

Now, using the above vector representation, there are different ways in which similarities between two strings could be calculated:

Cosine - It is a measure that calculates the cosine of the angle between them or in mathematical terms the dot product between two vectors. Just as we had a vector representation of one sentence above, other sentences too will have their own representation which is used for similarity calculation.
Euclidean - It is the "ordinary" straight-line distance between two points in Euclidean space. As I said before, each vector representation could be assumed as a point in a N-dimensional space and the distance between two of such points gives an idea how far/ near they are relative to other strings.

Other useful metrics include - manhattan distance, chebyshev, minkowski, jaccard, mahalanobis. The mathematics for these are below (taken from sklearn's website):

These vector based methods scale really well with the length of the text.

Metric Type II

There exists a fuzzywuzzy logic that compares two strings character by character. It has implementation in both R (called fuzzywuzzyR) and Python (called difflib). Using this we can calculate different ratios which give a perspective of relative similarity of different strings. The following are the ratios that could be calculated:

Partial token set ratio
Partial token sort ratio
QRATIO
WRATIO
partial token sort ratio
partial token set ratio
UWRATIO
UQRATIO
Token set ratio

Details of each ratio could be read here. However, one thing to keep in mind is these methods don't really scale well with the length of text.

Metric Type III

Another way of measuring similarity between text strings is by taking them as sequences. These include Levenshtein, Hamming, Jaccard, and Sorensen and more and the distance package in Python could be used for this. These distances work distance measure the minimum number of single-character edits (insertions, deletions or substitutions) required to change one text into the other and each of these edits have different weights assigned.

>>> distance.levenshtein("lenvestein", "levenshtein")
3
>>> distance.hamming("hamming", "hamning")
1

These metrics don't really scale well with the length of the text.

Metric Type IV

Lately, word embedding have been used to calculate the similarity between text strings. Take into account two strings - "Trump speaks to the media in Dallas" & "The President greets the press in Texas". All the methods discussed above will convey that these two texts are not similar, but they are. Word embedding (such as word2vec and glove) can successfully convey this information.

Code for all the above approaches could be found at my github https://github.com/analyticsbot/machine-learning/tree/master/quora_question_pairs

Andor Wirth, PMP?

IT product manager | agile professional | API & CLOUD solutions

7 å¹´

minor issue, instead of y = tfidf.fit_transform([x]) / y = tfidf.fit_transform([string]) will work properly

èµž

å›žå¤

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Ravi Shankarçš„æ›´å¤šæ–‡ç«

How I started with Deep Learning?

2017å¹´5æœˆ22æ—¥

How I started with Deep Learning?

Note: In this post, I talk about my learning in deep learning, the courses I took to understand, and the widely usedâ€¦

4 æ¡è¯„è®º
Getting started with Apache Spark

2017å¹´4æœˆ14æ—¥

Getting started with Apache Spark

If you are in the big data space, you must have head of these two Apache Projects â€“ Hadoop & Spark. To read more onâ€¦
Intuitive Explanation of "MapReduce"

2017å¹´2æœˆ21æ—¥

Intuitive Explanation of "MapReduce"

How many unique words are there in this sentence which you are reading? The answer which you will say is 12 (Note: wordâ€¦
Getting started with Hadoop

2017å¹´2æœˆ15æ—¥

Getting started with Hadoop

Note: This is a long post. It talks about big data as a concept, what is Apache Hadoop, "Hello World" program of Hadoopâ€¦

7 æ¡è¯„è®º
What is the Most Complex thing in the Universe?

2017å¹´2æœˆ5æ—¥

What is the Most Complex thing in the Universe?

What is the most complex piece of creation (natural/artificial) in this universe? Is it the human brain? But if theâ€¦

11 æ¡è¯„è®º
Automate Finding Items on Craigslist || Python & Selenium to the Rescue

2017å¹´1æœˆ28æ—¥

Automate Finding Items on Craigslist || Python & Selenium to the Rescue

If necessity is the mother of invention, then laziness is sometimes its father! Craigslist, especially in the Unitedâ€¦

7 æ¡è¯„è®º
Getting Started with Python!

2017å¹´1æœˆ23æ—¥

Getting Started with Python!

Note: This post is only for Python beginners. If you are comfortable with it, there might be nothing new to learn.

2 æ¡è¯„è®º
L1, L2 Regularization â€“ Why needed/What it does/How it helps?

2017å¹´1æœˆ14æ—¥

L1, L2 Regularization â€“ Why needed/What it does/How it helps?

Simple is better! Thatâ€™s the whole notion behind regularization. I recently wrote about Linear Regression and Biasâ€¦

4 æ¡è¯„è®º
Bias-Variance Tradeoff: What is it and why is it important?

2017å¹´1æœˆ3æ—¥

Bias-Variance Tradeoff: What is it and why is it important?

What is Bias- Variance Tradeoff? The bias-variance tradeoff is an important aspect of machine/statistical learning. Allâ€¦

7 æ¡è¯„è®º
Understanding Linear Regression

2016å¹´12æœˆ25æ—¥

Understanding Linear Regression

In my recent post on my blog, I tried to present my understanding of linear regression with charts and tables. Here'sâ€¦

See all articles

Measuring Text Similarity in Python

Ravi Shankar

Machine Learning Manager | RecSys, LLM, CV, NLP | Scalable AI/ML

Ravi Shankarçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Class 8 - STRING MANIPULATION & BASIC STRUCTURES IN PYTHON Notes from the AI Advance course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

How to build Gradient Boosting Regressor in?Python?

Can one Line of Python Win a Contest at Microprediction.Org?

Python for AI/ML - Day 4

Machine Learning 101 All Algorithms in python (Linear Regression)

Fine Tuning Your Own Sentence Transformers with Python

Building Your First Machine Learning Model: A Step-by-Step Tutorial

Introduction to Floating-Point Arithmetic in Python by MarsDevs.

Numpy

Random Forest: Introduction & Implementation in Python

Ravi Shankarçš„æ›´å¤šæ–‡ç«

How I started with Deep Learning?

Getting started with Apache Spark

Intuitive Explanation of "MapReduce"

Getting started with Hadoop

What is the Most Complex thing in the Universe?

Automate Finding Items on Craigslist || Python & Selenium to the Rescue

Getting Started with Python!

L1, L2 Regularization â€“ Why needed/What it does/How it helps?

Bias-Variance Tradeoff: What is it and why is it important?

Understanding Linear Regression

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Class 8 - STRING MANIPULATION & BASIC STRUCTURES IN PYTHON Notes from the AI Advance course by Irfan Malik & Dr Sheraz Naseer (Xeven Solutions)

How to build Gradient Boosting Regressor in?Python?

Can one Line of Python Win a Contest at Microprediction.Org?

Python for AI/ML - Day 4

Machine Learning 101 All Algorithms in python (Linear Regression)

Fine Tuning Your Own Sentence Transformers with Python

Building Your First Machine Learning Model: A Step-by-Step Tutorial

Introduction to Floating-Point Arithmetic in Python by MarsDevs.

Numpy

Random Forest: Introduction & Implementation in Python

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†