登录查看更多内容

Go from Beginner to Pro in Logistic Regression

Anuj Shrivastav

MTS at Oracle | NIT Bhopal

发布日期: 2020年3月23日

+ 关注

It is a Supervised Machine Learning-Based Classification model.
Given a Dataset D= {x?, y?} where x? ∈ R? and y? ∈ {C?, C?, C?,…., C?}, and a query point x?, we have to predict which class does that point belong to.

Read till the end and witness yourself becoming a master in Logistic Regression.

I aim to provide you with its underlying mathematics in a simple and easy-to-understand language.

Without any further ado! Let’s get started. ??

Assumption of Logistic Regression: It assumes that data is linearly separable or nearly linearly separable.

Aim of Logistic Regression is to find a hyperplane that best separates the classes.

To find the plane, we need to find w and b, where w is normal to plane and b is the intercept term.

We know, the distance of a plane from a point x? is:

For Simplicity, let's assume that

and the plane passes through the origin. Therefore,

Another assumption:-

i.e. Positive points are labelled as +1 and Negative points are labelled as -1.

How will our classifier classify a point?

If a point belongs to the region in the direction of w vector, then the point will be labelled as positive

If a point belongs to the region in the oppositive direction of w vector, then the point will be labelled as negative.

Let's consider all the cases for our classifier

Case 1 : y? = 1 (point is positive) and w? x? > 0 (model predicts positive)

In this case, y? w? x? > 0

Case 2 : y? = -1 (point is negative) and w? x? < 0 (model predicts negative)

In this case, y? w? x? > 0

Case 3 : y? = 1 (point is positive) and w? x? < 0 (model predicts negative)

In this case, y? w? x? < 0

Case 4 : y? = -1 (point is negative) and w? x? > 0 (model predicts positive)

In this case, y? w? x? < 0

The mathematical objective function of Logistic Regression

Suppose we have n data points, then we want a hyperplane such that

Problem with the mathematical objective function of Logistic Regression

It is sensitive to outliers. How?

Let’s imagine a scenario like this-

Here, point P is an outlier w.r.t plane π? which is at the distance of 100 units in the opposite direction.

Before moving forward, just by looking at the scenario, π? appears to be a better hyperplane than π?, as π? has less misclassified points.

Now let’s see what out mathematical objective function has to say about this.

for π? :

Even though π? appears to be a better option, just because of an outlier P, our objective function says π? to be the better one.

Moreover,

So, How to handle outliers?

This is a major concern only if the outlier is very far away.

?? Idea is that if the distance of a point from the plane is small, use it as is, otherwise, if the distance is large, then minimize it.

This process of minimizing large distances is known as SQUASHING. One of the functions that do this work for us and which is most preferred is the Sigmoid function.

Sigmoid Function

It grows linearly for small values but reaches saturation when the input becomes too large.

sig(0) =σ(0) = 0.5
Distances from the plane can be (-∞ , +∞) but after passing through σ function, we’ll get a value in (0,1).

Threshold:

Typically, in Binary Classification, 0.5 is taken as a threshold to decide the class label of a query point.

If σ(w? x) ≥ 0.5, predict +1 , else predict 0

Why the sigmoid function is used?

Easily differentiable.
It gives a probabilistic interpretation.
Grows linearly for small values but saturates for large values.

So, our new mathematical objective function will be -

Simplifying the objective function :

Why have we taken a negative log?

To make σ(y? w? x?) a convex function, so that it is easier to optimize.
We can derive objective function of logistic regression using two more ways :(1)probabilistic approach (where we consider features to follow Gaussian Distribution and output label to follow Bernoulli distribution), and (2) minimizing logistic-loss function (which is an approximation to 0–1 loss function ). Both of these have a logarithmic term in it. Since log is a monotonic function, It won’t affect our optimization problem.

Interpretation of w

Suppose we get an optimal w value, then

i.e. for each feature f?, there would be a weight corresponding to it. That’s why w is also known as weight vector.

Case 1: when w? is +ve

Case 2: when w? is -ve

Regularization

Let, z? = y? w? x?

The optimal value of loss function is obtained when :

each of log (1 + exp(-z?)) is minimum which happens when z? tends to +∞

Which means, all z? ’s must be positive or in other words, all y? w? x? must be positive (This is case 1 and case 2 that we have discussed earlier).

All perfectly classified training data would sometimes lead to overfitting of data.

How to avoid that?

We need to control the value of w so that it doesn’t reach very high.

Penalizing w value can be done in 3 ways:

L2 regularization

There is a tradeoff between logistic loss and regularization term.

2. L1 regularization

Important points :

For less important feature f?, L1 regularization generates w? = 0 (i.e. L1 regularization creates sparse weight vector) while L2 regularization w? which is small.
L1 regularization results in fast computation due to generation of a sparse vector.

3. Elastic - Net

It incorporates the benefits of both L1 as well as L2 regularization.

It will have 2 hyper-parameters
Time-consuming
High performance

Column Standardization

Since it is a distance-based model, standardization is required.
Standardization is also useful in fast convergence of our optimization problem.

Feature Importance and Model Interpretability

If the features of data are not collinear/multicollinear, then the features f? corresponding to higher w? weights would be more important.
However, if the features are collinear, one may switch to Forward feature selection or Backward feature elimination which are the standard ways of getting feature importance and works irrespective of the model.
Once feature importance is known, the model can give reasoning that it has predicted +1 or -1 based on those features.

Time and Space Complexities

Train :

Training of Logistic Regression is nothing but optimizing the loss function which takes O(n*d) time using SGD.
We are required to store all of the points, which takes O(n*d) space.

Test:

We only need to store the weight vector which is a d-dimensional vector, hence required O(d) space
Calculating σ(w?x?) requires d multiplications, hence O(d) time.

Dimensionality and its effect

if d is small:

Logistic Regression works very well.
Can be incorporated in low-latency systems.
Time and Space complexity is less.

if d is large:

It gets affected by the Curse of Dimensionality.
One can use L1 regularization to remove less important features. However, the λ value should be chosen carefully in to handle bias, variance and latency.

Working with Imbalanced Data

Let’s see how Logistic Regression is affected by Imbalanced Data

for π?:

for π? :

π? gives a better result than π?, however, we know π? is a better option by looking at the position of hyperplanes

How to handle it?

Standard ways of handling imbalanced data are to perform Upsampling or Downsampling

Upsampling :

Give more weightage to the minority class.
Create artificial points of the minority class.

SMOTE or Synthetic Minority Oversampling Technique :

Another way of upsampling which uses nearest neighbours to generate new points.

2. Downsampling

Randomly remove samples from majority class.

Dealing with Multi-Class Classification

Logistic Regression doesn’t inherently support Multi-Class Classification.
However, one can use One vs All strategy to deal with it.

That’s all for now!

Congrats! You have now successfully added Logistic Regression to your arsenal. ??

要查看或添加评论，请登录

Anuj Shrivastav的更多文章

All you need to know about K-Means Clustering

2020年3月20日

All you need to know about K-Means Clustering

In Supervised Learning , we are provided with a dataset having data points as well as a class label (in case of…
What's so naive about Naive Bayes?

2020年3月18日

What's so naive about Naive Bayes?

What if I tell you that performing Naive Bayes is just a 3 step process ? It consists of calculating three things: 1)…
A comprehensive guide to different Performance Measures for Classification?models

2020年3月16日

A comprehensive guide to different Performance Measures for Classification?models

Hi , in this article , we are going to discuss various performace measures used in Classification Machine Learning…

Go from Beginner to Pro in Logistic Regression

Anuj Shrivastav

MTS at Oracle | NIT Bhopal

How will our classifier classify a point?

The mathematical objective function of Logistic Regression

Problem with the mathematical objective function of Logistic Regression

Sigmoid Function

Simplifying the objective function :

Interpretation of w

Regularization

Column Standardization

Feature Importance and Model Interpretability

Time and Space Complexities

Dimensionality and its effect

Working with Imbalanced Data

Dealing with Multi-Class Classification

Anuj Shrivastav的更多文章

社区洞察

其他会员也浏览了

Mastering Logistic Regression

Logistic Regression: Basics, Obscurities and its Membership as a Classifier

Quantile Regression Random Forests

Logistic Regression

How to Interpret the Intercept in 6 Linear Regression Examples

Evaluation of logistic regression model ( Must read for all )

Logistic Regression

Logistic Regression Models for Multinomial and Ordinal Variables

What is a Logit Function and Why Use Logistic Regression?

SVR - Support Vector Regressor

How will our classifier classify a point?

The mathematical objective function of Logistic Regression

Problem with the mathematical objective function of Logistic Regression

Sigmoid Function

Simplifying the objective function :

Interpretation of w

Regularization

Column Standardization

Feature Importance and Model Interpretability

Time and Space Complexities

Dimensionality and its effect

Working with Imbalanced Data

Dealing with Multi-Class Classification

Anuj Shrivastav的更多文章

All you need to know about K-Means Clustering

What's so naive about Naive Bayes?

A comprehensive guide to different Performance Measures for Classification?models

社区洞察

其他会员也浏览了

Mastering Logistic Regression

Logistic Regression: Basics, Obscurities and its Membership as a Classifier

Quantile Regression Random Forests

Logistic Regression

How to Interpret the Intercept in 6 Linear Regression Examples

Evaluation of logistic regression model ( Must read for all )

Logistic Regression

Logistic Regression Models for Multinomial and Ordinal Variables

What is a Logit Function and Why Use Logistic Regression?

SVR - Support Vector Regressor