Maths behind Naive Bayes

Maths behind Naive Bayes

Ever thought of the world before computers? Statisticians were calculating the probabilities manually and predicting the classes or results! Computer programming has made our life so easy, we have pre-built modules of machine learning libraries. If you are a python developer, you can implement naive Bayes easily using scikit learn. There are tools like Orange which has drag and drop functionalities and the algorithm can be tested with one single click!.

Let's see how Naive Bayes works and the maths behind it with a simple problem statement. Let's assume Mr X is working on a notification project and he has to notify news headlines to respective subscribers. Mr X has 2 sets of subscribers, those who subscribed for political news and those who have subscribed for Entertainment news. Mr X now has to classify each and every news feed he receives and send the notifications accordingly.

Let's consider the below headlines which were already classified by Mr X

  1. Rahul Gandhi thanks government for changing FDI norms after his warning
  2. DMK Allies with ruling AIADMK combine in Tamil Nadu rural civic polls
  3. Congress says government is doing injustice to retailers
  4. Amitabh Bachchan shares throwback photo from Sholay premiere, says "How pretty Jaya looks"
  5. IPL postponed further as Indian government extends lockdown
  6. Siddharth Malhotra reacts to "Masakali" controversy.

Please note these are just imaginary headlines for learning purpose and may or may not be associated with any news feeds or current affairs. The first 3 headlines are clearly political and last 3 are entertainment related news feeds. Now based on this data, can we predict or classify a new news headline?

Which category would the below feed be, based on above dataset.

"Those who have no work criticise government" Says Mamata

Before jumping into solving the problem, let's understand what Naive Bayes is, it's a supervised algorithm based on Bayes theorem with strong (naive) independence assumptions between the features. Bayes’ theorem states the following relationship, given class variable y and dependent feature vector x1 through xn,

No alt text provided for this image

This can be decomposed as,

No alt text provided for this image

Let's go back to the above dataset, to classify the new news feed, we need to find the best probability comparing both categories. That is we need to calculate

P(Politics | "Those who have no work criticise government" Says Mamata) and

P(Entertainment | "Those who have no work criticise government" Says Mamata)

We can use Bayes theorem to find out the above probabilities.

No alt text provided for this image

Let's put the the naive assumption to the Bayes’ theorem, which is, independence among the features. So now, we split evidence into the independent parts.

Now, if any two events A and B are independent, then,

P(A|B) = P(A) * P(B), Applying this on the new news feed.

P(Those Who Have No Work Criticise government Says Mamta) = P(Those) * P(Who) * P(No) * P(Work) * P(Criticise) * P (government) * P(Says) * P(Mamta)

and,

P(Those Who Have No Work Criticise government Says Mamta | Politics) = P(Those| Politics) * P(Who| Politics) * P(No| Politics) * P(Work| Politics) * P(Criticise| Politics) * P (government| Politics) * P(Says| Politics) * P(Mamta| Politics)

Let's calculate the probability of word 'those' in training data labelled as politics. "Those" is not present in our training data. As we have no value for this probability. Now, we are unable to be sure about the prediction and our model performed poorly. To avoid these kind of issues, we use smoothing. In statistics, additive smoothing, also called Laplace smoothing or Lidstone smoothing, is a technique used to smooth categorical data. To ensure that our posterior probabilities are never zero, we add 1 to the numerator, and we add k to the denominator. So, in the case that we don’t have "Those" in training set, the posterior probability comes out to 1 / N + k instead of zero.

No alt text provided for this image

Total number of words in training set labelled as political = 31

Total number of words in training set labelled as Entertainment = 28

Total Unique words in above training set= 55

No alt text provided for this image

P(Those Who Have No Work Criticise government Says Mamta | politics) * P(Politics) = 1.149e-17

P(Those Who Have No Work Criticise government Says Mamta | Entertainment) * P(Entertainment) = 1.04e-17

That is, P(Politics | New feed) > P(Entertainment | New feed )

From above results it’s clear than the new sentence will be classified as Political news feed.

Please check out following tutorials if you are interested in trying out/implementing this,

Text Classification using Naive Bayes

Classification using Orange

Thank you!



要查看或添加评论,请登录

Mohit Rao的更多文章

  • Simple Linear Regression and the Line of best fit

    Simple Linear Regression and the Line of best fit

    Simple linear regression, as the name suggests, is a modeling approach that explores the connection between one…

  • Installation of Apache Hadoop 3.2

    Installation of Apache Hadoop 3.2

    1. Pre-requisites OS – Ubuntu 16.

  • Transformation story of a New Manager! - Article 2

    Transformation story of a New Manager! - Article 2

    Continuation - To read the first part of the article, click here Almost 4 months into his new role, Sam became…

  • Transformation story of a New Manager!

    Transformation story of a New Manager!

    Are you new to people management? were you an individual contributor so far? How are you feeling about the…

    13 条评论
  • Expansion of the universe!

    Expansion of the universe!

    Abstract Big bang theory describes the possible expansion of the universe from an initial state, which is possibly a…

    10 条评论
  • Are you enjoying your work? Are you Bored!?

    Are you enjoying your work? Are you Bored!?

    A couple of months back, I was flicking through LinkedIn and found “Transformation” as one of the core integrants in…

    1 条评论
  • Text Mining Covid-19 Dataset

    Text Mining Covid-19 Dataset

    After consolidating all scientific papers and public dataset, a word cloud is created and below lines are extracted…

    1 条评论
  • Discern the Intrinsic Motivation!!

    Discern the Intrinsic Motivation!!

    It appears that the performance of the task provides its own intrinsic reward…this drive… may be as basic as the…

    1 条评论
  • Selecting Right Automation Platform.

    Selecting Right Automation Platform.

    With popularity of Automation especially Robotic Process Automation or RPA. Companies started leveraging automation as…

    3 条评论
  • IT Automation Maturity Model

    IT Automation Maturity Model

    Artificial Intelligence, Deep learning, machine learning, Automation these are few jargon's you encounter in the…

    6 条评论

社区洞察

其他会员也浏览了