登录查看更多内容

Identifying Clickbaits Using Machine Learning

Abhishek Thakur

AI @ Arcee | Ex-Hugging Face | World’s 1st 4x Kaggle GrandMaster | 158k+ LinkedIn Followers,?100k+ YouTube Subscribers

发布日期: 2016年8月19日

Bait is something that is used to lure fishes. Clickbait is similar to that. It is used to lure humans to websites. However, studies have found that humans are more intelligent than fishes and more dangerous too.

Facebook started with detection of clickbaits in late 2014 and recently announced that it is going to reduce the number of clickbaits that appear in news feed (https://newsroom.fb.com/news/2016/08/news-feed-fyi-further-reducing-clickbait-in-feed/). With the penalization of clickbaits, it becomes important to check whether the content written by content writers consists of clickbait titles. In case it does, it will mean the content will be penalized and will appear in the top search results or facebook news feed very seldomly.

This study gives an overview about what clickbaits are and how to recognize clickbaits using machine learning.

What do clickbait titles look like?

Some examples of clickbaits are as follows:

10 things Apple didn’t tell you about the new iPhone
What happened next will surprise you
This is what the actor/actress from 90s looks like now
What did Donald Trump just say about Obama and Clinton
9 things you must have to be a good data scientist
How owning an iPhone boosts up your sex life
and there are many more....

We see a pattern, don’t we? The titles are very interesting! And in some way they are frustrating too. Users would like to click on them and know more about what the title wants to say. Of course, these kind of titles very seldomly have good content. Thus, they are classified as clickbaits.

One very popular website for these kind of clickbait-y titles is Buzzfeed. Well, it’s not just buzzfeed. There are thousands of sites that that which rely on clickbaits to get traffic. But with Google and Facebook penalizing them, how long is it going to last? Not so long, I guess.

Unlike other applied machine learning posts, this post will not include the very basics of machine learning rather we would dive right into the topic and the analysis.

Detecting Clickbaits

To detect clickbaits, we must first collect some data. To create a dataset of clickbait titles and non-clickbait titles and approach them as a supervised learning problem, we crawled the titles from websites likes Buzzfeed Buzz, Clickhole, etc which belonged to the former category and titles from trusted content websites like The New York Times and other news websites belonged to the latter category. This way we collected ~10K titles. ~5000 from either categories.

I used two different models for identification of clickbait which have been discussed in the following part:

Method 1: Term Frequency - Inverse Document Frequency (TF-IDF)

The first method was a very simple TF-IDF analysis. I used both character and word analysis and an n-gram range of (1, 1), (1, 2) and (1, 3). Everybody in the machine learning community knows about scikit-learn (https://scikit-learn.org/stable/) and that’s what we used.

For character analyzer:

And for word analyzer:

The TF-IDF vectorizer is very powerful and often provides with a great performance. The following graph shows which words contribute most for clickbaits:

Similarly, for non-clickbaits, the top words are:

We see how numbers are very clickbait-y. This is because most of the clickbait titles start with a number [X things no one ever told you about something] or [X things you won’t believe about something unless you see this].

I used two different machine learning models, namely Logistic Regression and Gradient Boosting. To evaluate the performance of the model, I used the following metrics:

Area under the ROC Curve (https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
Precision (https://en.wikipedia.org/wiki/Precision_and_recall)
Recall (https://en.wikipedia.org/wiki/Precision_and_recall)
F1-Score (https://en.wikipedia.org/wiki/F1_score)

Without going into the details of these evaluation metrics, let me just tell you that 1.0 is best score one can get and 0.5 is random score. 0.0 is obviously the worst.

To avoid overfitting (https://en.wikipedia.org/wiki/Overfitting), I used 5 fold stratified sampling.

The following figure tells us how random sampling is achieved. In case of stratified sampling, we have the same ratio of positive and negative labels in validation and training set.

After some simple tuning of hyperparameters for the machine learning models mentioned above, the following scores for the aforementioned evaluation metrics were obtained:

For Logistic Regression:

ROC AUC Score = 0.987319021551
Precision Score = 0.950326797386
Recall Score = 0.939276485788
F1 Score = 0.944769330734

And the ROC Curve:

For Gradient Boosting:

ROC AUC Score = 0.969700677962
Precision Score = 0.95756718529
Recall Score = 0.874677002584
F1 Score = 0.914247130317

And the ROC curve:

We see that these models are good enough (actually very good). But TF-IDF doesn’t necessarily capture everything all the time. Using the models above title like “Barrack Obama” got 80% probability of being a clickbait while “Donald Trump” got 15% and I instantly knew that this model isn’t enough to capture everything and we needed something more powerful. I decided to give word2vec a try and this has been discussed the following subsection.

Method 2: Word2Vec

Word2Vec creates a multi-dimensional vector for every word in the english vocabulary (or the corpus it has been trained on). Word2Vec embeddings are very popular in natural language processing and always provide us with great insights. Wikipedia provides a good explanation of what these embeddings are and how they are generated (https://en.wikipedia.org/wiki/Word2vec).

Word2Vec can be used to represent words and words which have similar meaning will be very close to each other in the word2vec space. An example has been shown in the following figure:

Similarly, we can also represent sentences using word2vec:

We represent each word (and each sentence/title) as a vector of 200 dimensions. The best way to visualize word2vec embeddings is decomposition of these large vectors into two dimensions using t-SNE (https://lvdmaaten.github.io/tsne/). This visualization is presented in the following figure:

We see how using only the word2vec can distinguish between clickbaits and non-clickbaits without even using a model on top of it. Which means, a machine learning model on top of these vectors will surely improve our classification. We used the same two machine learning models on the processed data. The evaluation scores are provided below:

For Logistic Regression:

ROC AUC Score = 0.981149604411
Precision Score = 0.936280884265
Recall Score = 0.93023255814
F1 Score = 0.933246921581

And the ROC Curve:

For Gradient Boosting:

ROC AUC Score = 0.981312768055
Precision Score = 0.939947780679
Recall Score = 0.93023255814
F1 Score = 0.935064935065

And the ROC Curve:

We can see the scores have improved quite substantially in case of gradient boosting model.

To enhance the results and to incorporate both TF-IDF and Word2Vec features we used an ensemble of the above mentioned models from both the methods. The results seem to be surprisingly awesome.

Conclusion: Stop using clickbaits. They might give you some extra traffic for now but it’s not going to last long.

P.S.: This article was previously titled as “10 things no one ever told you about clickbaits” ;)

I can be reached at: abhishek4 [at] gmail [dot] com

Dima Tisnek

Backend, full stack, systems, RLHF

3 年

Available as a Chrome extension near you! ??

杜易初

阿里巴巴集团 - 算法专家

7 年

two questions: 1. can't see your figures . 2. you say "Similarly, we can also represent sentences using word2vec:". How long are these sentences? I think Doc2Vec is more suitable to long article . How to represent sentences using word2vec? Use `sum and average` formula? Thx.

Rudie ('Ruud') Kortekaas

Neuroscientist, Fibromyalgie en Samenleving, microTMS, Magnolia Therapeutics, Neuroanatomist

7 年

Terrific article, the post scriptum is hilarious!

Yaping Chu

Research And Development Engineer II at Microsoft

8 年

Thank you for the post, it inspired me lot, i wonder could you share me how do you crawl the buzzfeed data?

Dan O.

PhD

8 年

Nice! That said, the class balance isn't very representative, and a bigger mix of news-sources would probably make this more challenging, but I can understand the desire to make a clean dataset :) Where is the dataset available? I'd like to try running our platform on it, for fun.

查看更多评论

要查看或添加评论，请登录

Abhishek Thakur的更多文章

Clickbaits Revisited: Deep Learning on Title + Content Features to Tackle Clickbaits

2017年3月13日

Clickbaits Revisited: Deep Learning on Title + Content Features to Tackle Clickbaits

TL;DR: I obtained 99.2% accuracy on test data with title and content features.

26 条评论
Is That a Duplicate Quora Question?

2017年2月27日

Is That a Duplicate Quora Question?

TL;DR : I achieved near state-of-the-art accuracy by using a very deep neural net. The code is available here:…

71 条评论
The One With the Anime (or Hentai?)

2017年1月29日

The One With the Anime (or Hentai?)

In September, 2016, Yahoo announced Open NSFW (https://yahooeng.tumblr.

5 条评论
Data Science India

2016年8月18日

Data Science India

Data Science is something that everyone wants to learn and practise these days. People want to learn about machine…

4 条评论
Approaching (Almost) Any Machine Learning Problem

2016年7月18日

Approaching (Almost) Any Machine Learning Problem

An average data scientist deals with loads of data daily. Some say over 60-70% time is spent in data cleaning, munging…

64 条评论
How Much of the Internet Disappears Every Year?

2016年5月30日

How Much of the Internet Disappears Every Year?

Stars in the observable universe domains are born and domains die every second. But how much of this online universe –…

1 条评论
Keras Neural Networks to Win NVIDIA Titan X

2016年5月2日

Keras Neural Networks to Win NVIDIA Titan X

In late 2015, NVIDIA joined hands with Codalab in an ongoing competition for AutoML. The competition ended on May, 1st,…

15 条评论
Does Wikipedia Deserve Such High Ranks in Search Results?

2016年4月22日

Does Wikipedia Deserve Such High Ranks in Search Results?

We have seen a lot of studies around the web. Some of them say wikipedia ranks for 98% keywords and some say 50%.

7 条评论
Ranking Factors 2016 - Nederlands

2016年2月12日

Ranking Factors 2016 - Nederlands

What are the factors that make the content more successful in search? Download Searchmetrics Ranking Factors for…
Distinguishing Between Honey Bee and Bumble Bee

2015年12月29日

Distinguishing Between Honey Bee and Bumble Bee

Distinguishing Between Honey Bee and Bumble Bee Abhishek Thakur & Eben Olson The Challenge Metis (www.thisismetis.

5 条评论

See all articles

Identifying Clickbaits Using Machine Learning

Abhishek Thakur

AI @ Arcee | Ex-Hugging Face | World’s 1st 4x Kaggle GrandMaster | 158k+ LinkedIn Followers,?100k+ YouTube Subscribers

What do clickbait titles look like?

Detecting Clickbaits

Method 1: Term Frequency - Inverse Document Frequency (TF-IDF)

Method 2: Word2Vec

Abhishek Thakur的更多文章

社区洞察

其他会员也浏览了

This AI newsletter is all you need #13

?? Daily News in AI Agents: Key Updates 03/02 - ? OpenAI's GPT-4.5 has launched for Pro users, though facing GPU horsepower shortages

AI Weekly Digest - September 30 2024

Understanding the fashion and chronology of algorithms

Grok 3 VS Gemini 2.0 vs Perplexity VS Qwen 2.5 Max: Who Wins?

Enough talk. Let's make money

The Strawberry Surprise: OpenAI's New Model Breaks the Mold

The promised face-off between GPT-4 and DMR's listening247

ML Series — 1 — A Beginner’s Guide to Machine Learning Algorithms

What do clickbait titles look like?

Detecting Clickbaits

Method 1: Term Frequency - Inverse Document Frequency (TF-IDF)

Method 2: Word2Vec

Abhishek Thakur的更多文章

Clickbaits Revisited: Deep Learning on Title + Content Features to Tackle Clickbaits

Is That a Duplicate Quora Question?

The One With the Anime (or Hentai?)

Data Science India

Approaching (Almost) Any Machine Learning Problem

How Much of the Internet Disappears Every Year?

Keras Neural Networks to Win NVIDIA Titan X

Does Wikipedia Deserve Such High Ranks in Search Results?

Ranking Factors 2016 - Nederlands

Distinguishing Between Honey Bee and Bumble Bee

社区洞察

其他会员也浏览了

This AI newsletter is all you need #13

?? Daily News in AI Agents: Key Updates 03/02 - ? OpenAI's GPT-4.5 has launched for Pro users, though facing GPU horsepower shortages

AI Weekly Digest - September 30 2024

Understanding the fashion and chronology of algorithms

Grok 3 VS Gemini 2.0 vs Perplexity VS Qwen 2.5 Max: Who Wins?

Enough talk. Let's make money

The Strawberry Surprise: OpenAI's New Model Breaks the Mold

The promised face-off between GPT-4 and DMR's listening247

ML Series — 1 — A Beginner’s Guide to Machine Learning Algorithms