Measuring Text Similarity in Python
Note: This article has been taken from a post on my blog.
A while ago, I shared a paper on LinkedIn that talked about measuring similarity between two text strings using something called Word Moving Distance (WMD). The paper can be found here.
In this post, I'll talk about different methods to calculate similarity between text strings. In general, computers can't understand text the same way they could understand numbers, so the text needs to be converted to vectors which is then used for most of the text based functions.
Metric Type I
## example in Python 2.7.11 (required modules sklearn, pandas)
?>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> import pandas as pd
## initialize TFIDFVectorizer. More can read at
## https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn-feature-extraction-text-tfidfvectorizer?
>>> tfidf = TfidfVectorizer()
>>> string = 'This is a small sentence to show how text is converted to vector representation'
>>> y = tfidf.fit_transform([x])
## convert to a sparse matrix form (not visible here but in large corpus will be)
## to know how these tfidf values are created, please google, this has
## two components - tf and idf?
?>>> y.toarray()
array([[ 0.24253563, 0.24253563, 0.48507125, 0.24253563, 0.24253563,
0.24253563, 0.24253563, 0.24253563, 0.24253563, 0.48507125,
0.24253563]])
## look at the words in vocabulary and their indices corresponding to the array
## above?
>>> tfidf.vocabulary_
{u'show': 5, u'this': 8, u'text': 7, u'is': 2, u'sentence': 4, u'to': 9, u'vector': 10, u'how': 1, u'small': 6, u'representation': 3, u'converted': 0}
## get the feature names with the correct indices
>>> tfidf.get_feature_names()
[u'converted', u'how', u'is', u'representation', u'sentence', u'show', u'small', u'text', u'this', u'to', u'vector']
## convert the tfidf vector to a pandas dataframe
?>>> df = pd.DataFrame(y.toarray(), columns = tfidf.get_feature_names())
## print the dataframe
>>> df
converted how is representation sentence show \
0 0.242536 0.242536 0.485071 0.242536 0.242536 0.242536
small text this to vector
0 0.242536 0.242536 0.242536 0.485071 0.242536
The small code above shows how to convert a string to a vector representation which could then be fed to machine learning algorithms. However, there is a downside of the above representation, the vectors don't convey the exact order of the sentence, meaning even if the words are shuffled in the sentence, the vector representation would remain the same. Imagine this sentence as a point in a N-dimensional space just we have a point a 2D or 3D space.
Now, using the above vector representation, there are different ways in which similarities between two strings could be calculated:
- Cosine - It is a measure that calculates the cosine of the angle between them or in mathematical terms the dot product between two vectors. Just as we had a vector representation of one sentence above, other sentences too will have their own representation which is used for similarity calculation.
- Euclidean - It is the "ordinary" straight-line distance between two points in Euclidean space. As I said before, each vector representation could be assumed as a point in a N-dimensional space and the distance between two of such points gives an idea how far/ near they are relative to other strings.
Other useful metrics include - manhattan distance, chebyshev, minkowski, jaccard, mahalanobis. The mathematics for these are below (taken from sklearn's website):
These vector based methods scale really well with the length of the text.
Metric Type II
There exists a fuzzywuzzy logic that compares two strings character by character. It has implementation in both R (called fuzzywuzzyR) and Python (called difflib). Using this we can calculate different ratios which give a perspective of relative similarity of different strings. The following are the ratios that could be calculated:
- Partial token set ratio
- Partial token sort ratio
- QRATIO
- WRATIO
- partial token sort ratio
- partial token set ratio
- UWRATIO
- UQRATIO
- Token set ratio
Details of each ratio could be read here. However, one thing to keep in mind is these methods don't really scale well with the length of text.
Metric Type III
Another way of measuring similarity between text strings is by taking them as sequences. These include Levenshtein, Hamming, Jaccard, and Sorensen and more and the distance package in Python could be used for this. These distances work distance measure the minimum number of single-character edits (insertions, deletions or substitutions) required to change one text into the other and each of these edits have different weights assigned.
>>> distance.levenshtein("lenvestein", "levenshtein")
3
>>> distance.hamming("hamming", "hamning")
1
These metrics don't really scale well with the length of the text.
Metric Type IV
Lately, word embedding have been used to calculate the similarity between text strings. Take into account two strings - "Trump speaks to the media in Dallas" & "The President greets the press in Texas". All the methods discussed above will convey that these two texts are not similar, but they are. Word embedding (such as word2vec and glove) can successfully convey this information.
Code for all the above approaches could be found at my github https://github.com/analyticsbot/machine-learning/tree/master/quora_question_pairs
IT product manager | agile professional | API & CLOUD solutions
7 å¹´minor issue, instead of y = tfidf.fit_transform([x]) / y = tfidf.fit_transform([string]) will work properly