Word Similarity Matrix - Python Code

Word Similarity Matrix - Python Code

When you have a text list and want a way to quantify the similarity between the phrases that you have, this function might be one way to do so.

(continued from https://www.dhirubhai.net/posts/eliasdabbas_python-datascience-textanalysis-activity-7044204796282572801-11nH )

TL;DR:

import advertools as ad
import pandas as pd


def word_similarity(text_list):
? ? tokenized = adv.word_tokenize(text_list, 1)
? ? similarity_matrix = []
? ? for i, sent_i in enumerate(tokenized):
? ? ? ? templist = []
? ? ? ? for j, sent_j in enumerate(tokenized):
? ? ? ? ? ? templist.append(len(set(sent_i).intersection(sent_j)))
? ? ? ? similarity_matrix.append(templist)
? ? sim_df = pd.DataFrame(similarity_matrix)
? ? for i, _ in enumerate(sim_df):
? ? ? ? sim_df.loc[i, i] = pd.NA
? ? return sim_dfv        

The above code is a very simple implementation, and not for large scale use, but it does the job quite well for a few thousand phrases.

For example, if we start with this text list:

text_list = [
? ? 'blue green red',
? ? 'blue green yellow',
? ? 'blue black white',
? ? 'white red purple',
? ? 'magenta teal gray',
]        

When we run the function, we get the following matrix

No alt text provided for this image

The similarity between a document (phrase) and itself is not useful in this context, and so it is set to NaN to avoid including it in any calculations. For more context, we can place the phrase text on the index and column names to see them (although this is useless with thousands of documents):

df = word_similarity(text_list
df.columns = text_list
df.index = text_list
df['average'] = df.apply('mean', axis=1)
df.style.background_gradient(subset=['average'], cmap='cividis').format('{:.1f}'))        
No alt text provided for this image

One thing to do is get the average for each document, and get some quantification of similarity (the lower the average, the more unique the document is - little in common with other docs).

More can be done, like getting count of non-zero values, and getting their average compared with the general "column_mean". We can also get the length of phrase for a better context. A phrase with two words is much more likely to find similar phrases than one with ten for example.


No alt text provided for this image

Counts or percentages?

So far we have simply counted the words, but we could have also calculated the fraction as a percentage. The interesting thing about this is that it is not symmetrical. Take two phrases:

  • one two
  • one two three four

The common words are two in this case, but what is common word fraction?

From the perspective of the first phrase, it is 100% because all of its words are common with the second one. From the second however, it is only 50%.

What else?

Einstein Soares

SEO Specialist | Technical SEO | SEO Strategist

1 年

Amazing, one of the problems that we SEOs needed help with! Great work.

Simone De Palma

Technical SEO Specialist | Data Analyst Practitioner | Founder of SEO Depths

1 年

Cool stuff It would be great if instead of pleain text we could use URLs - perhaps we could already though, gotta try ??

要查看或添加评论,请登录

Elias Dabbas的更多文章

社区洞察

其他会员也浏览了