Measuring Text Similarity in Python

Measuring Text Similarity in Python

Note: This article has been taken from a post on my blog.

A while ago, I shared a paper on LinkedIn that talked about measuring similarity between two text strings using something called Word Moving Distance (WMD). The paper can be found here.

In this post, I'll talk about different methods to calculate similarity between text strings. In general, computers can't understand text the same way they could understand numbers, so the text needs to be converted to vectors which is then used for most of the text based functions.

Metric Type I

## example in Python 2.7.11 (required modules sklearn, pandas)
?>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> import pandas as pd

## initialize TFIDFVectorizer. More can read at
## https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn-feature-extraction-text-tfidfvectorizer?
>>> tfidf = TfidfVectorizer()
>>> string = 'This is a small sentence to show how text is converted to vector representation'
>>> y = tfidf.fit_transform([x])
## convert to a sparse matrix form (not visible here but in large corpus will be)
## to know how these tfidf values are created, please google, this has
## two components - tf and idf?
?>>> y.toarray()
array([[ 0.24253563,  0.24253563,  0.48507125,  0.24253563,  0.24253563,
         0.24253563,  0.24253563,  0.24253563,  0.24253563,  0.48507125,
         0.24253563]])

## look at the words in vocabulary and their indices corresponding to the array
## above?
>>> tfidf.vocabulary_
{u'show': 5, u'this': 8, u'text': 7, u'is': 2, u'sentence': 4, u'to': 9, u'vector': 10, u'how': 1, u'small': 6, u'representation': 3, u'converted': 0}

## get the feature names with the correct indices
>>> tfidf.get_feature_names()
[u'converted', u'how', u'is', u'representation', u'sentence', u'show', u'small', u'text', u'this', u'to', u'vector']
## convert the tfidf vector to a pandas dataframe
?>>> df = pd.DataFrame(y.toarray(), columns = tfidf.get_feature_names())

## print the dataframe
>>> df
   converted       how        is  representation  sentence      show  \
0   0.242536  0.242536  0.485071        0.242536  0.242536  0.242536   

      small      text      this        to    vector  
0  0.242536  0.242536  0.242536  0.485071  0.242536

The small code above shows how to convert a string to a vector representation which could then be fed to machine learning algorithms. However, there is a downside of the above representation, the vectors don't convey the exact order of the sentence, meaning even if the words are shuffled in the sentence, the vector representation would remain the same. Imagine this sentence as a point in a N-dimensional space just we have a point a 2D or 3D space.

Now, using the above vector representation, there are different ways in which similarities between two strings could be calculated:

  • Cosine - It is a measure that calculates the cosine of the angle between them or in mathematical terms the dot product between two vectors. Just as we had a vector representation of one sentence above, other sentences too will have their own representation which is used for similarity calculation.
  • Euclidean - It is the "ordinary" straight-line distance between two points in Euclidean space. As I said before, each vector representation could be assumed as a point in a N-dimensional space and the distance between two of such points gives an idea how far/ near they are relative to other strings.

Other useful metrics include - manhattan distance, chebyshev, minkowski, jaccard, mahalanobis. The mathematics for these are below (taken from sklearn's website):

These vector based methods scale really well with the length of the text.

Metric Type II

There exists a fuzzywuzzy logic that compares two strings character by character. It has implementation in both R (called fuzzywuzzyR) and Python (called difflib). Using this we can calculate different ratios which give a perspective of relative similarity of different strings. The following are the ratios that could be calculated:

  • Partial token set ratio
  • Partial token sort ratio
  • QRATIO
  • WRATIO
  • partial token sort ratio
  • partial token set ratio
  • UWRATIO
  • UQRATIO
  • Token set ratio

Details of each ratio could be read here. However, one thing to keep in mind is these methods don't really scale well with the length of text.

Metric Type III

Another way of measuring similarity between text strings is by taking them as sequences. These include Levenshtein, Hamming, Jaccard, and Sorensen and more and the distance package in Python could be used for this. These distances work distance measure the minimum number of single-character edits (insertions, deletions or substitutions) required to change one text into the other and each of these edits have different weights assigned.

>>> distance.levenshtein("lenvestein", "levenshtein")
3
>>> distance.hamming("hamming", "hamning")
1

These metrics don't really scale well with the length of the text.

Metric Type IV

Lately, word embedding have been used to calculate the similarity between text strings. Take into account two strings - "Trump speaks to the media in Dallas" & "The President greets the press in Texas". All the methods discussed above will convey that these two texts are not similar, but they are. Word embedding (such as word2vec and glove) can successfully convey this information.

Code for all the above approaches could be found at my github https://github.com/analyticsbot/machine-learning/tree/master/quora_question_pairs

Andor Wirth, PMP?

IT product manager | agile professional | API & CLOUD solutions

7 å¹´

minor issue, instead of y = tfidf.fit_transform([x]) / y = tfidf.fit_transform([string]) will work properly

赞
回复

要查看或添加评论,请登录

Ravi Shankar的更多文章

  • How I started with Deep Learning?

    How I started with Deep Learning?

    Note: In this post, I talk about my learning in deep learning, the courses I took to understand, and the widely used…

    4 条评论
  • Getting started with Apache Spark

    Getting started with Apache Spark

    If you are in the big data space, you must have head of these two Apache Projects – Hadoop & Spark. To read more on…

  • Intuitive Explanation of "MapReduce"

    Intuitive Explanation of "MapReduce"

    How many unique words are there in this sentence which you are reading? The answer which you will say is 12 (Note: word…

  • Getting started with Hadoop

    Getting started with Hadoop

    Note: This is a long post. It talks about big data as a concept, what is Apache Hadoop, "Hello World" program of Hadoop…

    7 条评论
  • What is the Most Complex thing in the Universe?

    What is the Most Complex thing in the Universe?

    What is the most complex piece of creation (natural/artificial) in this universe? Is it the human brain? But if the…

    11 条评论
  • Automate Finding Items on Craigslist || Python & Selenium to the Rescue

    Automate Finding Items on Craigslist || Python & Selenium to the Rescue

    If necessity is the mother of invention, then laziness is sometimes its father! Craigslist, especially in the United…

    7 条评论
  • Getting Started with Python!

    Getting Started with Python!

    Note: This post is only for Python beginners. If you are comfortable with it, there might be nothing new to learn.

    2 条评论
  • L1, L2 Regularization – Why needed/What it does/How it helps?

    L1, L2 Regularization – Why needed/What it does/How it helps?

    Simple is better! That’s the whole notion behind regularization. I recently wrote about Linear Regression and Bias…

    4 条评论
  • Bias-Variance Tradeoff: What is it and why is it important?

    Bias-Variance Tradeoff: What is it and why is it important?

    What is Bias- Variance Tradeoff? The bias-variance tradeoff is an important aspect of machine/statistical learning. All…

    7 条评论
  • Understanding Linear Regression

    Understanding Linear Regression

    In my recent post on my blog, I tried to present my understanding of linear regression with charts and tables. Here's…

社区洞察

其他会员也浏览了