登录查看更多内容

Word Similarity Matrix - Python Code

Elias Dabbas

Digital Marketing meets Data Science –> advertools

发布日期: 2023年3月23日

When you have a text list and want a way to quantify the similarity between the phrases that you have, this function might be one way to do so.

(continued from https://www.dhirubhai.net/posts/eliasdabbas_python-datascience-textanalysis-activity-7044204796282572801-11nH )

TL;DR:

import advertools as ad
import pandas as pd


def word_similarity(text_list):
? ? tokenized = adv.word_tokenize(text_list, 1)
? ? similarity_matrix = []
? ? for i, sent_i in enumerate(tokenized):
? ? ? ? templist = []
? ? ? ? for j, sent_j in enumerate(tokenized):
? ? ? ? ? ? templist.append(len(set(sent_i).intersection(sent_j)))
? ? ? ? similarity_matrix.append(templist)
? ? sim_df = pd.DataFrame(similarity_matrix)
? ? for i, _ in enumerate(sim_df):
? ? ? ? sim_df.loc[i, i] = pd.NA
? ? return sim_dfv

The above code is a very simple implementation, and not for large scale use, but it does the job quite well for a few thousand phrases.

For example, if we start with this text list:

text_list = [
? ? 'blue green red',
? ? 'blue green yellow',
? ? 'blue black white',
? ? 'white red purple',
? ? 'magenta teal gray',
]

When we run the function, we get the following matrix

The similarity between a document (phrase) and itself is not useful in this context, and so it is set to NaN to avoid including it in any calculations. For more context, we can place the phrase text on the index and column names to see them (although this is useless with thousands of documents):

领英推荐

Python...meh

Greg Deckler 4 年前

Python For Kids (Part 30: Lists, Tuples, Dictionaries…

Kevin Thomas 4 年前

4 Advanced Python Function Tricks

Eleke Great 1 年前

df = word_similarity(text_list
df.columns = text_list
df.index = text_list
df['average'] = df.apply('mean', axis=1)
df.style.background_gradient(subset=['average'], cmap='cividis').format('{:.1f}'))

One thing to do is get the average for each document, and get some quantification of similarity (the lower the average, the more unique the document is - little in common with other docs).

More can be done, like getting count of non-zero values, and getting their average compared with the general "column_mean". We can also get the length of phrase for a better context. A phrase with two words is much more likely to find similar phrases than one with ten for example.

Counts or percentages?

So far we have simply counted the words, but we could have also calculated the fraction as a percentage. The interesting thing about this is that it is not symmetrical. Take two phrases:

one two
one two three four

The common words are two in this case, but what is common word fraction?

From the perspective of the first phrase, it is 100% because all of its words are common with the second one. From the second however, it is only 50%.

What else?

Einstein Soares

SEO Specialist | Technical SEO | SEO Strategist

1 年

Amazing, one of the problems that we SEOs needed help with! Great work.

1 次回应

Simone De Palma

Technical SEO Specialist | Data Analyst Practitioner | Founder of SEO Depths

2 年

Cool stuff It would be great if instead of pleain text we could use URLs - perhaps we could already though, gotta try ??

1 次回应

查看更多评论

要查看或添加评论，请登录

Elias Dabbas的更多文章

How to share Python apps (Dash, Streamlit, etc.) without deploying them

2024年12月28日

How to share Python apps (Dash, Streamlit, etc.) without deploying them

There is a magical feature in uv, which allows you run remote scripts hosted online. As a consequence, you can have the…

6 条评论
10 Lines of Code to Deploy a Dash app from scratch (using uv)

2024年12月8日

10 Lines of Code to Deploy a Dash app from scratch (using uv)

Deploy a minimal Dash app very quickly with uv, using ten lines of code, and two paste operations. Assumptions: You…
Trying Google Gemini for Data & Code Analysis

2024年5月15日

Trying Google Gemini for Data & Code Analysis

This is a quick overview and my first attempt to really see how well this works. This is not a proper test, and does…
XML Sitemap Analysis - ForeignAffairs.com

2023年1月8日

XML Sitemap Analysis - ForeignAffairs.com

When you have dates in URLs you can get a lot of info about a website's content from its sitemap. Here is a quick…

2 条评论
Crawling and Parsing JSON-LD Data

2022年12月24日

Crawling and Parsing JSON-LD Data

JSON-LD data can be the best and easiest to handle while crawling, if properly used by a website. The #advertools…
advertools SEO Crawler - Analytics UI

2022年10月1日

advertools SEO Crawler - Analytics UI

OR: How to interactively explore/analyze large datasets with Plotly's Dash and The Apache Software Foundation Apache…

5 条评论
advertools v0.13.0 new features

2022年2月11日

advertools v0.13.0 new features

#advertools v0.13.

2 条评论
Migration and Population Density Dashboard - WorldBank Data

2019年12月22日

Migration and Population Density Dashboard - WorldBank Data

When I first started learning about the population of countries and the world, there were three billion of us. Now we…

2 条评论
Gold Reserves per Country - Quarterly (updated up to Q3-2019)

2019年8月1日

Gold Reserves per Country - Quarterly (updated up to Q3-2019)

I've been looking at gold data, and ended up creating a mini dashboard for that! https://www.dashboardom.

2 条评论
Global Terrorism Database Dashboard

2018年3月21日

Global Terrorism Database Dashboard

The GTD is a project by The National Consortium for the Study of Terrorism and Responses to Terrorism (START). It is…

5 条评论

See all articles

Word Similarity Matrix - Python Code

Elias Dabbas

Digital Marketing meets Data Science –> advertools

领英推荐

Counts or percentages?

Elias Dabbas的更多文章

社区洞察

其他会员也浏览了

MicroPython For micro:bit (Part 2: DataTypes & Numbers)

The lambda() and more

Using Python to Interact With Excel (PLAXIS Input)

PLAXIS Output Visualisation Using Python

What's new in Python 3.11

Demystify Python 2D Charts -- A Hackable Step-by-step Jupyter Notebook

From pans to python: A culinary introduction to multi-threading and multi-processing

FastAPI in 2023:How to Build High-Performance Python APIs

Excel and Python professionals can finally be friends, with Mito

Make your code more Pythonic with Magic Methods

领英推荐

Counts or percentages?

Elias Dabbas的更多文章

How to share Python apps (Dash, Streamlit, etc.) without deploying them

10 Lines of Code to Deploy a Dash app from scratch (using uv)

Trying Google Gemini for Data & Code Analysis

XML Sitemap Analysis - ForeignAffairs.com

Crawling and Parsing JSON-LD Data

advertools SEO Crawler - Analytics UI

advertools v0.13.0 new features

Migration and Population Density Dashboard - WorldBank Data

Gold Reserves per Country - Quarterly (updated up to Q3-2019)

Global Terrorism Database Dashboard

社区洞察

其他会员也浏览了

MicroPython For micro:bit (Part 2: DataTypes & Numbers)

The lambda() and more

Using Python to Interact With Excel (PLAXIS Input)

PLAXIS Output Visualisation Using Python

What's new in Python 3.11

Demystify Python 2D Charts -- A Hackable Step-by-step Jupyter Notebook

From pans to python: A culinary introduction to multi-threading and multi-processing

FastAPI in 2023:How to Build High-Performance Python APIs

Excel and Python professionals can finally be friends, with Mito

Make your code more Pythonic with Magic Methods