?Naive Bayes Algorithm - Explained??

?Naive Bayes Algorithm - Explained??

Naive?Bayes?is?a?probabilistic?algorithm?that’s?typically?used?for?classification?problems. It uses Conditional probability, which is a measure of the probability of an event occurring given that another event has (by assumption, presumption, assertion, or evidence) occurred. It?is?simple,?intuitive,?and?yet?performs?surprisingly?well?in?many?cases. It?is based?on?Bayes’?Theorem?with?an?assumption?of?independence?among?predictors.?In?simple?terms,?a?Naive?Bayes?classifier?assumes?that?the?presence?of?a?particular?feature?in?a?class?is?unrelated?to?the?presence?of?any?other?feature.

No alt text provided for this image

Assumptions?made?by?Naive?Bayes

The?fundamental?Na?ve?Bayes?assumption?is?that?each?feature?makes?an:

-?Independent

-?Equal

contribution?to?the?outcome.

For?example,?a?fruit?may?be?considered?to?be?an?apple?if?it?is?red,?round,?and?about?3?inches?in?diameter.?Even?if?these?features?depend?on?each?other?or?upon?the?existence?of?the?other?features,?all?of?these?properties?independently?contribute?to?the?probability?that?this?fruit?is?an?apple?and?that?is?why?it?is?known?as?‘Naive’.

Note - Naive?Bayes?model?is?easy?to?build?and?particularly?useful?for?very?large?data?sets.?Along?with?simplicity,?Naive?Bayes?is?known?to?outperform?even?highly?sophisticated?classification?methods.

Bayes?theorem?provides?a?way?of?calculating?posterior?probability?P(c|x)?from?P(c),?P(x),?and?P(x|c).?Look?at?the?equation?below:

No alt text provided for this image

Above,

-?P(c|x)?is?the?posterior?probability?of?class?(c,?target)?given?predictor?(x,?attributes).

-?P(c)?is?the?prior?probability?of?class.

-?P(x|c)?is?the?likelihood?which?is?the?probability?of?predictor?given?class.

-?P(x)?is?the?prior?probability?of?predictor.

This?is?a?rather?simple?transformation,?but?it?bridges?the?gap?between?what?we?want?to?do?and?what?we?can?do.?We?can’t?get?P(C|X)?directly,?but?we?can?get?P(X|C)?and?P(C)?from?the?training?data.?Here’s?an?example:

No alt text provided for this image

In?this?case,?X?=(Outlook,?Temperature,?Humidity,?Windy),?and?Y=Play.?P(X|Y)?and?P(Y)?can?be?calculated:

No alt text provided for this image
No alt text provided for this image

Having?this?amount?of?parameters?in?the?model?is?impractical.?To?solve?this?problem,?a?naive?assumption?is?made.?We?pretend?all?features?are?independent.?What?does?this?mean?

No alt text provided for this image

Now?with?the?help?of?this?naive?assumption?(naive?because?features?are?rarely?independent),?we?can?make?classification?with?much?fewer?parameters:

No alt text provided for this image
No alt text provided for this image

This?is?a?big?deal.?We?changed?the?number?of?parameters?from?exponential?to?linear.?This?means?that?Naive?Bayes?can deal with?high-dimensional?data?well.

Another Example with Mathematics -

No alt text provided for this image

Problem:?Players?will?play?if?the weather?is?sunny.?Is?this?statement?is?correct?

We?can?solve?it?using?above?discussed?method?of?the posterior?probability.

-?P(Yes?|?Sunny)?=?P(?Sunny?|?Yes)?*?P(Yes)?/?P?(Sunny)

-?Here?we?have?P?(Sunny?|Yes)?=?3/9?=?0.33,?P(Sunny)?=?5/14?=?0.36,?P(?Yes)=?9/14?=?0.64

-?Now,?P?(Yes?|?Sunny)?=?0.33?*?0.64?/?0.36?=?0.60,?which?has?higher?probability.

Naive?Bayes?uses?a?similar?method?to?predict?the?probability?of?different?classes?based?on?various?attributes.?This?algorithm?is?mostly?used?in?text?classification?and?with?problems?having?multiple?classes.

Na?ve Bayes Classifier assumes that all the features are unrelated to each other. The presence or absence of a feature does not influence the presence or absence of any other feature.

In real-world datasets, we test a hypothesis given multiple evidence on features. So, the calculations become quite complicated. To simplify the work, the feature independence approach is used to uncouple multiple pieces of evidence and treat each as an independent one.

The zero-frequency problem

One of the disadvantages of Na?ve-Bayes is that if you have no occurrences of a class label and a certain attribute value together then the frequency-based probability estimate will be zero. And this will get a zero when all the probabilities are multiplied.

No alt text provided for this image

Solution - An approach to overcome this ‘zero-frequency problem’ in a Bayesian environment is to add one to the count for every attribute value-class combination when an attribute value doesn’t occur with every class value.

No alt text provided for this image

There are three types of Naive Bayes model under the sci-kit-learn library:

Gaussian:?It is used in classification and it assumes that features follow a normal distribution.

Multinomial:?It is used for discrete counts. For example, let’s say, we have a text classification problem. Here we can consider Bernoulli trials which is one step further and instead of “word occurring in the document”, we have “count how often word occurs in the document”, you can think of it as “number of times outcome number x_i is observed over the n trials”.

Bernoulli:?The binomial model is useful if your feature vectors are binary (i.e. zeros and ones). One application would be text classification with a ‘bag of words’ model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively.

What are the Pros and Cons of Naive Bayes?

Pros:

  • It is easy and fast to predict the class of test data set. It also performs well in multi-class prediction
  • When the assumption of independence holds, a Naive Bayes classifier performs better compared to other models like logistic regression and you need less training data.
  • It performs well in the case of categorical input variables compared to a numerical variable(s). For numerical variables, the normal distribution is assumed (bell curve, which is a strong assumption).

Cons:

  • If a categorical variable has a category (in the test data set), which was not observed in the training data set, then the model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
  • On the other side, naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.
  • Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

Tips to improve the power of the Naive Bayes Model

Here are some tips for improving the power of the Naive Bayes Model:

  • If continuous features do not have a normal distribution, we should use transformation or different methods to convert it into a normal distribution.
  • If a test data set has zero frequency issue, apply smoothing techniques “Laplace Correction” to predict the class of test data set.
  • Remove correlated features, as the highly correlated features are voted twice in the model and it can lead to overinflating importance.
  • Naive Bayes classifiers have limited options for parameter tuning like alpha=1 for smoothing, fit_prior=[True|False] to learn class prior probabilities or not and some other options (look at the detail here). I would recommend focusing on your pre-processing of data and the feature selection.
  • You might think to apply some classifier combination techniques like ensembling, bagging, and boosting but these methods would not help. Actually, “ensembling, boosting, bagging” won’t help since their purpose is to reduce variance. Naive Bayes has no variance to minimize.

Applications of Naive Bayes Algorithms

Real-time Prediction:?Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real-time.

Multi-class Prediction:?This algorithm is also well known for its multi-class prediction feature. Here we can predict the probability of multiple classes of target variables.

Text classification/ Spam Filtering/ Sentiment Analysis:?Naive Bayes classifiers mostly used in text classification (due to better results in multi-class problems and independence rule) have a higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)

Recommendation System:?Naive Bayes Classifier and Collaborative Filtering together build a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not.

Thanks for Reading, Like Comment and Sharing if it's good.

Gowri Swaminathan

Director (IT)/ Scientist 'E' at National Informatics Centre, MeitY

3 年

A good and neatly explained article . thanks

Pablo Gaston Schulz

Engineering Manager, Fertilizer Division

3 年

Very clear explanation. Thanks for sharing

要查看或添加评论,请登录

Simranjeet Singh的更多文章

社区洞察

其他会员也浏览了