Go from Beginner to Pro in Logistic Regression
credit: pixabay.com

Go from Beginner to Pro in Logistic Regression

  • It is a Supervised Machine Learning-Based Classification model.
  • Given a Dataset D= {x?, y?} where x? ∈ R? and y? ∈ {C?, C?, C?,…., C?}, and a query point x?, we have to predict which class does that point belong to.

Read till the end and witness yourself becoming a master in Logistic Regression.

I aim to provide you with its underlying mathematics in a simple and easy-to-understand language.

Without any further ado! Let’s get started. ??

Assumption of Logistic Regression: It assumes that data is linearly separable or nearly linearly separable.

No alt text provided for this image
No alt text provided for this image

Aim of Logistic Regression is to find a hyperplane that best separates the classes.

To find the plane, we need to find w and b, where w is normal to plane and b is the intercept term.

We know, the distance of a plane from a point x? is:

No alt text provided for this image

For Simplicity, let's assume that

No alt text provided for this image

and the plane passes through the origin. Therefore,

No alt text provided for this image

Another assumption:-

No alt text provided for this image

 i.e. Positive points are labelled as +1 and Negative points are labelled as -1.

How will our classifier classify a point?

  • If a point belongs to the region in the direction of w vector, then the point will be labelled as positive
No alt text provided for this image
  • If a point belongs to the region in the oppositive direction of w vector, then the point will be labelled as negative.
No alt text provided for this image

Let's consider all the cases for our classifier

Case 1 : y? = 1 (point is positive) and w? x? > 0 (model predicts positive)

In this case, y? w? x? > 0

Case 2 : y? = -1 (point is negative) and w? x? < 0 (model predicts negative)

In this case, y? w? x? > 0

Case 3 : y? = 1 (point is positive) and w? x? < 0 (model predicts negative)

In this case, y? w? x? < 0

Case 4 : y? = -1 (point is negative) and w? x? > 0 (model predicts positive)

In this case, y? w? x? < 0

The mathematical objective function of Logistic Regression

Suppose we have n data points, then we want a hyperplane such that

No alt text provided for this image

Problem with the mathematical objective function of Logistic Regression

It is sensitive to outliers. How?

Let’s imagine a scenario like this-

No alt text provided for this image

Here, point P is an outlier w.r.t plane π? which is at the distance of 100 units in the opposite direction.

Before moving forward, just by looking at the scenario, π? appears to be a better hyperplane than π?, as π? has less misclassified points.

Now let’s see what out mathematical objective function has to say about this.

for π? :

No alt text provided for this image

for π? :

No alt text provided for this image

Even though π? appears to be a better option, just because of an outlier P, our objective function says π? to be the better one.

Moreover,

No alt text provided for this image

So, How to handle outliers?

This is a major concern only if the outlier is very far away.

?? Idea is that if the distance of a point from the plane is small, use it as is, otherwise, if the distance is large, then minimize it.

This process of minimizing large distances is known as SQUASHING. One of the functions that do this work for us and which is most preferred is the Sigmoid function.

Sigmoid Function

It grows linearly for small values but reaches saturation when the input becomes too large.

No alt text provided for this image
  • sig(0) =σ(0) = 0.5
  • Distances from the plane can be (-∞ , +∞) but after passing through σ function, we’ll get a value in (0,1).

Threshold:

Typically, in Binary Classification, 0.5 is taken as a threshold to decide the class label of a query point.

If σ(w? x) ≥ 0.5, predict +1 , else predict 0

Why the sigmoid function is used?

  • Easily differentiable.
  • It gives a probabilistic interpretation.
  • Grows linearly for small values but saturates for large values.

So, our new mathematical objective function will be -

No alt text provided for this image

Simplifying the objective function :

No alt text provided for this image

Why have we taken a negative log?

  • To make σ(y? w? x?) a convex function, so that it is easier to optimize.
  • We can derive objective function of logistic regression using two more ways :(1)probabilistic approach (where we consider features to follow Gaussian Distribution and output label to follow Bernoulli distribution), and (2) minimizing logistic-loss function (which is an approximation to 0–1 loss function ). Both of these have a logarithmic term in it. Since log is a monotonic function, It won’t affect our optimization problem.

Interpretation of w

Suppose we get an optimal w value, then

No alt text provided for this image

i.e. for each feature f?, there would be a weight corresponding to it. That’s why w is also known as weight vector.

Case 1: when w? is +ve

No alt text provided for this image

Case 2: when w? is -ve

No alt text provided for this image

Regularization

Let, z? = y? w? x?

No alt text provided for this image

The optimal value of loss function is obtained when :

each of log (1 + exp(-z?)) is minimum which happens when z? tends to +∞

Which means, all z? ’s must be positive or in other words, all y? w? x? must be positive (This is case 1 and case 2 that we have discussed earlier).

All perfectly classified training data would sometimes lead to overfitting of data.

How to avoid that?

We need to control the value of w so that it doesn’t reach very high.

Penalizing w value can be done in 3 ways:

  1. L2 regularization
No alt text provided for this image

There is a tradeoff between logistic loss and regularization term.

No alt text provided for this image

2. L1 regularization

No alt text provided for this image

Important points :

  • For less important feature f?, L1 regularization generates w? = 0 (i.e. L1 regularization creates sparse weight vector) while L2 regularization w? which is small.
  • L1 regularization results in fast computation due to generation of a sparse vector.

3. Elastic - Net

It incorporates the benefits of both L1 as well as L2 regularization.

No alt text provided for this image
  • It will have 2 hyper-parameters
  • Time-consuming
  • High performance

Column Standardization

  • Since it is a distance-based model, standardization is required.
  • Standardization is also useful in fast convergence of our optimization problem.

Feature Importance and Model Interpretability

  • If the features of data are not collinear/multicollinear, then the features f? corresponding to higher w? weights would be more important.
  • However, if the features are collinear, one may switch to Forward feature selection or Backward feature elimination which are the standard ways of getting feature importance and works irrespective of the model.
  • Once feature importance is known, the model can give reasoning that it has predicted +1 or -1 based on those features.

Time and Space Complexities

Train :

  • Training of Logistic Regression is nothing but optimizing the loss function which takes O(n*d) time using SGD.
  • We are required to store all of the points, which takes O(n*d) space.

Test:

  • We only need to store the weight vector which is a d-dimensional vector, hence required O(d) space
  • Calculating σ(w?x?) requires d multiplications, hence O(d) time.

Dimensionality and its effect

if d is small:

  • Logistic Regression works very well.
  • Can be incorporated in low-latency systems.
  • Time and Space complexity is less.

if d is large:

  • It gets affected by the Curse of Dimensionality.
  • One can use L1 regularization to remove less important features. However, the λ value should be chosen carefully in to handle bias, variance and latency.

Working with Imbalanced Data

Let’s see how Logistic Regression is affected by Imbalanced Data

No alt text provided for this image

for π?:

No alt text provided for this image

for π? :

No alt text provided for this image
  • π? gives a better result than π?, however, we know π? is a better option by looking at the position of hyperplanes

How to handle it?

Standard ways of handling imbalanced data are to perform Upsampling or Downsampling

  1. Upsampling :
  • Give more weightage to the minority class.
  • Create artificial points of the minority class.

SMOTE or Synthetic Minority Oversampling Technique :

  • Another way of upsampling which uses nearest neighbours to generate new points.


2. Downsampling

  • Randomly remove samples from majority class.

Dealing with Multi-Class Classification

  • Logistic Regression doesn’t inherently support Multi-Class Classification.
  • However, one can use One vs All strategy to deal with it.

That’s all for now!

Congrats! You have now successfully added Logistic Regression to your arsenal. ??








要查看或添加评论,请登录

Anuj Shrivastav的更多文章

社区洞察

其他会员也浏览了