Identifying Clickbaits Using Machine Learning
Abhishek Thakur
AI @ Arcee | Ex-Hugging Face | World’s 1st 4x Kaggle GrandMaster | 158k+ LinkedIn Followers,?100k+ YouTube Subscribers
Bait is something that is used to lure fishes. Clickbait is similar to that. It is used to lure humans to websites. However, studies have found that humans are more intelligent than fishes and more dangerous too.
Facebook started with detection of clickbaits in late 2014 and recently announced that it is going to reduce the number of clickbaits that appear in news feed (https://newsroom.fb.com/news/2016/08/news-feed-fyi-further-reducing-clickbait-in-feed/). With the penalization of clickbaits, it becomes important to check whether the content written by content writers consists of clickbait titles. In case it does, it will mean the content will be penalized and will appear in the top search results or facebook news feed very seldomly.
This study gives an overview about what clickbaits are and how to recognize clickbaits using machine learning.
What do clickbait titles look like?
Some examples of clickbaits are as follows:
- 10 things Apple didn’t tell you about the new iPhone
- What happened next will surprise you
- This is what the actor/actress from 90s looks like now
- What did Donald Trump just say about Obama and Clinton
- 9 things you must have to be a good data scientist
- How owning an iPhone boosts up your sex life
- and there are many more....
We see a pattern, don’t we? The titles are very interesting! And in some way they are frustrating too. Users would like to click on them and know more about what the title wants to say. Of course, these kind of titles very seldomly have good content. Thus, they are classified as clickbaits.
One very popular website for these kind of clickbait-y titles is Buzzfeed. Well, it’s not just buzzfeed. There are thousands of sites that that which rely on clickbaits to get traffic. But with Google and Facebook penalizing them, how long is it going to last? Not so long, I guess.
Unlike other applied machine learning posts, this post will not include the very basics of machine learning rather we would dive right into the topic and the analysis.
Detecting Clickbaits
To detect clickbaits, we must first collect some data. To create a dataset of clickbait titles and non-clickbait titles and approach them as a supervised learning problem, we crawled the titles from websites likes Buzzfeed Buzz, Clickhole, etc which belonged to the former category and titles from trusted content websites like The New York Times and other news websites belonged to the latter category. This way we collected ~10K titles. ~5000 from either categories.
I used two different models for identification of clickbait which have been discussed in the following part:
Method 1: Term Frequency - Inverse Document Frequency (TF-IDF)
The first method was a very simple TF-IDF analysis. I used both character and word analysis and an n-gram range of (1, 1), (1, 2) and (1, 3). Everybody in the machine learning community knows about scikit-learn (https://scikit-learn.org/stable/) and that’s what we used.
For character analyzer:
And for word analyzer:
The TF-IDF vectorizer is very powerful and often provides with a great performance. The following graph shows which words contribute most for clickbaits:
Similarly, for non-clickbaits, the top words are:
We see how numbers are very clickbait-y. This is because most of the clickbait titles start with a number [X things no one ever told you about something] or [X things you won’t believe about something unless you see this].
I used two different machine learning models, namely Logistic Regression and Gradient Boosting. To evaluate the performance of the model, I used the following metrics:
- Area under the ROC Curve (https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
- Precision (https://en.wikipedia.org/wiki/Precision_and_recall)
- Recall (https://en.wikipedia.org/wiki/Precision_and_recall)
- F1-Score (https://en.wikipedia.org/wiki/F1_score)
Without going into the details of these evaluation metrics, let me just tell you that 1.0 is best score one can get and 0.5 is random score. 0.0 is obviously the worst.
To avoid overfitting (https://en.wikipedia.org/wiki/Overfitting), I used 5 fold stratified sampling.
The following figure tells us how random sampling is achieved. In case of stratified sampling, we have the same ratio of positive and negative labels in validation and training set.
After some simple tuning of hyperparameters for the machine learning models mentioned above, the following scores for the aforementioned evaluation metrics were obtained:
For Logistic Regression:
- ROC AUC Score = 0.987319021551
- Precision Score = 0.950326797386
- Recall Score = 0.939276485788
- F1 Score = 0.944769330734
And the ROC Curve:
For Gradient Boosting:
- ROC AUC Score = 0.969700677962
- Precision Score = 0.95756718529
- Recall Score = 0.874677002584
- F1 Score = 0.914247130317
And the ROC curve:
We see that these models are good enough (actually very good). But TF-IDF doesn’t necessarily capture everything all the time. Using the models above title like “Barrack Obama” got 80% probability of being a clickbait while “Donald Trump” got 15% and I instantly knew that this model isn’t enough to capture everything and we needed something more powerful. I decided to give word2vec a try and this has been discussed the following subsection.
Method 2: Word2Vec
Word2Vec creates a multi-dimensional vector for every word in the english vocabulary (or the corpus it has been trained on). Word2Vec embeddings are very popular in natural language processing and always provide us with great insights. Wikipedia provides a good explanation of what these embeddings are and how they are generated (https://en.wikipedia.org/wiki/Word2vec).
Word2Vec can be used to represent words and words which have similar meaning will be very close to each other in the word2vec space. An example has been shown in the following figure:
Similarly, we can also represent sentences using word2vec:
We represent each word (and each sentence/title) as a vector of 200 dimensions. The best way to visualize word2vec embeddings is decomposition of these large vectors into two dimensions using t-SNE (https://lvdmaaten.github.io/tsne/). This visualization is presented in the following figure:
We see how using only the word2vec can distinguish between clickbaits and non-clickbaits without even using a model on top of it. Which means, a machine learning model on top of these vectors will surely improve our classification. We used the same two machine learning models on the processed data. The evaluation scores are provided below:
For Logistic Regression:
- ROC AUC Score = 0.981149604411
- Precision Score = 0.936280884265
- Recall Score = 0.93023255814
- F1 Score = 0.933246921581
And the ROC Curve:
For Gradient Boosting:
- ROC AUC Score = 0.981312768055
- Precision Score = 0.939947780679
- Recall Score = 0.93023255814
- F1 Score = 0.935064935065
And the ROC Curve:
We can see the scores have improved quite substantially in case of gradient boosting model.
To enhance the results and to incorporate both TF-IDF and Word2Vec features we used an ensemble of the above mentioned models from both the methods. The results seem to be surprisingly awesome.
Conclusion: Stop using clickbaits. They might give you some extra traffic for now but it’s not going to last long.
P.S.: This article was previously titled as “10 things no one ever told you about clickbaits” ;)
I can be reached at: abhishek4 [at] gmail [dot] com
Backend, full stack, systems, RLHF
3 年Available as a Chrome extension near you! ??
阿里巴巴集团 - 算法专家
7 年two questions: 1. can't see your figures . 2. you say "Similarly, we can also represent sentences using word2vec:". How long are these sentences? I think Doc2Vec is more suitable to long article . How to represent sentences using word2vec? Use `sum and average` formula? Thx.
Neuroscientist, Fibromyalgie en Samenleving, microTMS, Magnolia Therapeutics, Neuroanatomist
7 年Terrific article, the post scriptum is hilarious!
Research And Development Engineer II at Microsoft
8 年Thank you for the post, it inspired me lot, i wonder could you share me how do you crawl the buzzfeed data?
PhD
8 年Nice! That said, the class balance isn't very representative, and a bigger mix of news-sources would probably make this more challenging, but I can understand the desire to make a clean dataset :) Where is the dataset available? I'd like to try running our platform on it, for fun.