登录查看更多内容

Inside the TOP 1000 tags on Medium.com

Florin Badita

Data Mining for Startups (Series A-C) | +10k websites scrapped, collected billion of data points, +10 yrs exp | Do you need data that is difficult to find? Contact me - Forbes U30 / Tedx Speaker

发布日期: 2024年8月30日

During my 3 months stay in the US, one of my pet project was to download all of the posts on medium.com. It took around 1 week to write the script, and another 1 week to download all of the posts.

The total database size is 8GB. I ended up with 6M posts, having a total of 9.2M total tags. Counting the unique tags, we end up with 620K tags.

I ended up filtering and extracting just the stats for the tags that are used in at least 1000 different posts.

There are 1016 unique tags that are used in at least 1000 posts.

You can download the CSV file with the TOP 1000 values from this gist.

All the visualizations are posted on Tableau Public, at this link.

Let`s plot the Average Image count / Average Reading?Time.

Using the data, we plot on the Y axes the average image count inside posts with a specific tag, and on the X axes the average reading time required to read posts that have a specific tag.

Average Recommends

Number of?Posts

Image Count

Reading Time

One thing that we can do, having this data, is to plot on the Y axes the count of posts with a specific tag, and on the X axes the count of distinct users who made posts with a specific tag.

Above the Line / Below the Line explanation.

To get a better understanding of what means that a value is above the line/below the line, you can see this graph, that shows 4 data points, 4 tags, every tag have around 3100 posts, but a different number of distinct users.

One general rule that seems to work for the majority of the cases, is that the terms that above the line are more specific to a domain, and the tags that are below the line, are more general terms.

Above the?Line

The tags that are above the black line that is traversing the chart, means that a smaller group of people are writing all of the articles for that specific tag?:

For example, for the IFTTT tag, there are 55,151 posts, and only 1,331 distinct users.

If we divide the number of posts by the number of distinct users, we get the average amount of posts written by an individual.

For ITFFF tag we have 55,151 posts/ 1,331 distinct users = 41 posts per user.

Let`s take another example?:? The SEO tag. Here we have 47,271 posts, written by 6,471 users. That means, in average, each user written 7 posts with that tag.

For the Poetry tag, the ratio is more balanced, with each user writing 2.6 posts.
For the Startup tag, the most used tag on medium, the ratio is 2.2 posts/user.
For the Politics tag, the ratio is 2.1 posts/user.

This make sense, if we consider that for these categories there are people that are more specialized and passionate about a particular field, and they tend to write more about that topic.

Also, this tags are from domains that require in depth experience about the topic.

For the IFTTT and SEO tags, i`m guessing somebody uses the tag to do some SEO spam.

Below the?Line

The tags that are below the black line that is traversing the chart, means that a larger and more diverse group of people are contributing an article for that specific tag.

You don’t see the same monopoly of a group of users that are writing about a specific tag.

Also, this tags seems to be more generic tags. (Life Teens, Election, New Years Resolutions, Internet and other tags that are more generic and everybody have an opinion about this topic).

Is not something that require specialized skills and knowledge to write about it.

For example, for the medium tag, there are 16,313 posts, and only 10,267 distinct users.

If we divide the number of posts by the number of distinct users, we get the average amount of posts written by an individual.

For the medium tag we have 16,313 posts/ 10,267 distinct users = 1.6 posts per user.

Let`s take another example.

The death tag. Here we have 8,445 posts, written by 6,363 users.

That means, in average, each user written 1.3 posts with that tag.? For the “Social Media” tag, the ratio is 1.9 posts/user. For the Relationships tag, the ratio is 1.7 posts/user.

Let`s zoom in to get a better understanding of the?data.

In total, we have 1000 tags, and the vast majority of them are in the bottom left corner.

You can see in this image the 3 zoom levels that we will dive into to get a better understanding of the data, and also the amount of tags that we have in each of the third views.

Zoom Level?1

In the first zoom level, we are left with 932 tags, with a total post count of 3M (See summary in the chart)

We can see the same trend emerging, with the majority of tags that are above the line are more specific to a industry(blackchain, investing, film) or around a subgroup of persons(Vietnam, Japan,Espanol).

The tags below the line are more generic, consisting of human states (Fear, Depression), generic terms (work, future,thanksgiving), etc

One thing that becomes apparent starting now is that the posts that are below the line, meaning the ones that are more generic, are getting in average more recommendations then the ones above the line. (the bigger the circle size for each tag, the more recommendations the post got)

Zoom Level?2

At zoom level 2, we are left with 658 tags, with a total post count of 1.4M (See summary in the chart)

We can see the same trend from zoom level 1, the posts that are more specific, (above the line) have fewer recommendations then the more generic ones (below the line).

Zoom Level?3

At zoom level 3, we are left with 312 tags, with a total post count of 437K (See summary in the chart)

At this zoom level it`s clear that the posts that are more general, the ones below the line, are getting more recommendations then the ones above the line.

To test this theory, we selected 156 points that are below the line and calculated the average and median value of them.

We did the same with 119 points that are above the line (more specific topics) The results are?:

Average recommendations

Above the Line = 4.04 Median recommendations? Above the Line = 3.23 Average recommendations

Below the Line = 6.58 Average recommendations Below the Line = 4.73 Median recommendations

This is just the tip of the Iceberg of what we can do, learn from this data set. Searching for ideas of what to do next with the data set.

If you want to contribute and join me in the quest of playing with the data set, send an email to [email protected].

要查看或添加评论，请登录

Florin Badita的更多文章

A adventure inside medium.com TOP 30.000 posts by number of hyperlinks in each post.

2024年7月30日

A adventure inside medium.com TOP 30.000 posts by number of hyperlinks in each post.

As I await an answer from Medium about my account that is currently blocked because I indexed medium.com in 2016 to do…
How I ended up being banned from TED AI after revealing that the TED AI app was inadvertently disclosing participants’ private email addresses.

2024年7月6日

How I ended up being banned from TED AI after revealing that the TED AI app was inadvertently disclosing participants’ private email addresses.

Reposting this old medium.com article on Linkedin now because My Medium.
Mapping the Global Workforce: "The Global Workforce Unveiled: Jobs, Gaps, and Trends Across 43 Domains"

2024年3月20日

Mapping the Global Workforce: "The Global Workforce Unveiled: Jobs, Gaps, and Trends Across 43 Domains"

This article is a collaboration between the Scrape The World team and MR Data - Data Mining in the real world The…

2 条评论
Echoes of the Forgotten Code: 21K Codebase Challenge – From GPT-3.5 to Google Gemini, Who Remembers Best?

2024年3月18日

Echoes of the Forgotten Code: 21K Codebase Challenge – From GPT-3.5 to Google Gemini, Who Remembers Best?

Last week I showed how I build a advanced graph visualizer in 6 hours using ChatGPT, Mistral and Google Gemeni…
From ChatGPT to Mistral: How I Built an Interactive Graph Visualizer in 6 Hours (and Survived Google Gemeni's Advanced Forgetfulness)

2024年3月15日

From ChatGPT to Mistral: How I Built an Interactive Graph Visualizer in 6 Hours (and Survived Google Gemeni's Advanced Forgetfulness)

I was on a mission: find a way to build interactive graph visualizations online. Networkx? Bah, static images.

7 条评论
Looking forward to participate in the first ever Forbes Under 30 Global Retreat in Slovakia next month!

2019年5月24日

Looking forward to participate in the first ever Forbes Under 30 Global Retreat in Slovakia next month!

To learn more about what I do, read this article: https://florinbadita.eu/civic-activism/
Join me remote at this weekend Debug Politics Hackaton. Working on a cool project

2016年12月10日

Join me remote at this weekend Debug Politics Hackaton. Working on a cool project

Until Sunday, 11 Dec 2016, I'm taking part in the Debug Politics Hackaton , working on “The Outbrake”, a tool that can…
The Outbreak?—?Detecting fake Viral News, automatically.

2016年12月9日

The Outbreak?—?Detecting fake Viral News, automatically.

Two weeks ago i published this post on medium about how we can detect fake viral news, using the Outbreak, a tool…

See all articles

Let`s plot the Average Image count / Average Reading?Time.

Top 10 posts per?:

Average Recommends

Number of?Posts

Image Count

Reading Time

Above the Line / Below the Line explanation.

Above the?Line

Below the?Line

Let`s zoom in to get a better understanding of the?data.

Zoom Level?1

Zoom Level?2

Zoom Level?3

Florin Badita的更多文章

A adventure inside medium.com TOP 30.000 posts by number of hyperlinks in each post.

How I ended up being banned from TED AI after revealing that the TED AI app was inadvertently disclosing participants’ private email addresses.

Mapping the Global Workforce: "The Global Workforce Unveiled: Jobs, Gaps, and Trends Across 43 Domains"

Echoes of the Forgotten Code: 21K Codebase Challenge – From GPT-3.5 to Google Gemini, Who Remembers Best?

From ChatGPT to Mistral: How I Built an Interactive Graph Visualizer in 6 Hours (and Survived Google Gemeni's Advanced Forgetfulness)

Looking forward to participate in the first ever Forbes Under 30 Global Retreat in Slovakia next month!

Join me remote at this weekend Debug Politics Hackaton. Working on a cool project

The Outbreak?—?Detecting fake Viral News, automatically.