Inside the TOP 1000 tags on Medium.com
Florin Badita
Data Mining for Startups (Series A-C) | +10k websites scrapped, collected billion of data points, +10 yrs exp | Do you need data that is difficult to find? Contact me - Forbes U30 / Tedx Speaker
During my 3 months stay in the US, one of my pet project was to download all of the posts on medium.com. It took around 1 week to write the script, and another 1 week to download all of the posts.
The total database size is 8GB. I ended up with 6M posts, having a total of 9.2M total tags. Counting the unique tags, we end up with 620K tags.
I ended up filtering and extracting just the stats for the tags that are used in at least 1000 different posts.
There are 1016 unique tags that are used in at least 1000 posts.
You can download the CSV file with the TOP 1000 values from this gist.
All the visualizations are posted on Tableau Public, at this link.
Let`s plot the Average Image count / Average Reading?Time.
Using the data, we plot on the Y axes the average image count inside posts with a specific tag, and on the X axes the average reading time required to read posts that have a specific tag.
Top 10 posts per?:
Average Recommends
Number of?Posts
Image Count
Reading Time
One thing that we can do, having this data, is to plot on the Y axes the count of posts with a specific tag, and on the X axes the count of distinct users who made posts with a specific tag.
Above the Line / Below the Line explanation.
To get a better understanding of what means that a value is above the line/below the line, you can see this graph, that shows 4 data points, 4 tags, every tag have around 3100 posts, but a different number of distinct users.
One general rule that seems to work for the majority of the cases, is that the terms that above the line are more specific to a domain, and the tags that are below the line, are more general terms.
Above the?Line
The tags that are above the black line that is traversing the chart, means that a smaller group of people are writing all of the articles for that specific tag?:
For example, for the IFTTT tag, there are 55,151 posts, and only 1,331 distinct users.
If we divide the number of posts by the number of distinct users, we get the average amount of posts written by an individual.
For ITFFF tag we have 55,151 posts/ 1,331 distinct users = 41 posts per user.
Let`s take another example?:? The SEO tag. Here we have 47,271 posts, written by 6,471 users. That means, in average, each user written 7 posts with that tag.
This make sense, if we consider that for these categories there are people that are more specialized and passionate about a particular field, and they tend to write more about that topic.
Also, this tags are from domains that require in depth experience about the topic.
For the IFTTT and SEO tags, i`m guessing somebody uses the tag to do some SEO spam.
Below the?Line
The tags that are below the black line that is traversing the chart, means that a larger and more diverse group of people are contributing an article for that specific tag.
You don’t see the same monopoly of a group of users that are writing about a specific tag.
Also, this tags seems to be more generic tags. (Life Teens, Election, New Years Resolutions, Internet and other tags that are more generic and everybody have an opinion about this topic).
Is not something that require specialized skills and knowledge to write about it.
For example, for the medium tag, there are 16,313 posts, and only 10,267 distinct users.
If we divide the number of posts by the number of distinct users, we get the average amount of posts written by an individual.
For the medium tag we have 16,313 posts/ 10,267 distinct users = 1.6 posts per user.
Let`s take another example.
The death tag. Here we have 8,445 posts, written by 6,363 users.
That means, in average, each user written 1.3 posts with that tag.? For the “Social Media” tag, the ratio is 1.9 posts/user. For the Relationships tag, the ratio is 1.7 posts/user.
Let`s zoom in to get a better understanding of the?data.
In total, we have 1000 tags, and the vast majority of them are in the bottom left corner.
You can see in this image the 3 zoom levels that we will dive into to get a better understanding of the data, and also the amount of tags that we have in each of the third views.
Zoom Level?1
In the first zoom level, we are left with 932 tags, with a total post count of 3M (See summary in the chart)
We can see the same trend emerging, with the majority of tags that are above the line are more specific to a industry(blackchain, investing, film) or around a subgroup of persons(Vietnam, Japan,Espanol).
The tags below the line are more generic, consisting of human states (Fear, Depression), generic terms (work, future,thanksgiving), etc
One thing that becomes apparent starting now is that the posts that are below the line, meaning the ones that are more generic, are getting in average more recommendations then the ones above the line. (the bigger the circle size for each tag, the more recommendations the post got)
Zoom Level?2
At zoom level 2, we are left with 658 tags, with a total post count of 1.4M (See summary in the chart)
We can see the same trend from zoom level 1, the posts that are more specific, (above the line) have fewer recommendations then the more generic ones (below the line).
Zoom Level?3
At zoom level 3, we are left with 312 tags, with a total post count of 437K (See summary in the chart)
At this zoom level it`s clear that the posts that are more general, the ones below the line, are getting more recommendations then the ones above the line.
To test this theory, we selected 156 points that are below the line and calculated the average and median value of them.
We did the same with 119 points that are above the line (more specific topics) The results are?:
Average recommendations
Above the Line = 4.04 Median recommendations? Above the Line = 3.23 Average recommendations
Below the Line = 6.58 Average recommendations Below the Line = 4.73 Median recommendations
This is just the tip of the Iceberg of what we can do, learn from this data set. Searching for ideas of what to do next with the data set.
If you want to contribute and join me in the quest of playing with the data set, send an email to [email protected].