登录查看更多内容

Understanding Logistic Regression in Machine Learning: Sigmoid Function, Log-Likelihood Estimation, Class Imbalance Adjustment, and More

Kay Chansiri, Ph.D.

Research Scientist | ML & GenAI for Social Impacts | Human-Computer Interaction

发布日期: 2024年7月31日

In my previous post, I discussed linear regression from a machine learning (ML) perspective. Today, let’s delve into a different type of regression used to predict binary outcomes — logistic regression. In this post, I will also discuss important concepts relevant to the algorithm, such as the sigmoid function, log-likelihood estimation, class imbalance adjustment, and more. If you are ready, let's get started!

Why Logistic Regression?

If you're like me, the first time you learn about logistic regression, you may wonder, "Why can’t we use linear regression to predict a binary outcome (e.g., outcomes of which levels are 0 or 1, "True" or "False", etc.)?"

The answer lies in the limitations of linear regression for binary outcomes. Linear regression predictions can extend beyond the [0, 1] range, leading to unreasonable interpretations when our target outcome is strictly within this boundary. Thus, an alternative model that works well with binary decisions is better. Logistic regression, which uses the sigmoid function, is the answer.

Before we get to the sigmoid function, there are three key terms you should familiarize yourself with in logistic regression: probability, odds, and logit:

Probability refers to the chance of an event occurring, ranging from 0 to 1, and can be interpreted as a percentage. For example, if the probability of a customer subscribing to a new streaming service is 0.6, it means there is a 60% chance the customer will subscribe.
Odds are calculated by dividing the probability of an event occurring by the probability of it not occurring.
Logit is the natural logarithm of the odds. The term is also referred to as log odds. See below.

Mathematically, you can convert logit back to odds by applying the exponential function, and you can convert odds to probability by dividing the odds by 1 plus the odds:

I know these concepts might sound confusing at first, but bear with me. I promise that by the end of this post, you'll have a better understanding. Let’s get back to the question I asked previously: why do we use the sigmoid function and why logistic regression for binary outcomes?

Sigmoid Function

The formula below represents the sigmoid function:

According to the function formula, whatever value is plugged into the function, the output is probabilities and always bounded between 0 and 1.

Imagine x is your typical linear regression function that could be represented by B0 + B1X1. When you plug in the value into the formula, the probability of detecting the event of your interest, which is mostly represented by y = 1, would be equal to:

Let’s take a look at an example. Say you want to predict the probability of a customer subscribing to a streaming service (1) versus no-subscription (0). Your predictor is customers’ age. Keep in mind that in the real world, we will likely have more than one predictor, but I will use only one predictor here for simplicity in demonstration.

Suppose that your intercept is B0 = -3 and the coefficient of age is B1 = 0.1. Assume that customer A is 25 years old. You plug in all of the values into the sigmoid function above, and the probability of the customer subscribing to the streaming service is calculated as the following.

The linear regression function X can be represented as:

X = B0+ B1?Age

X = ?3 + 0.1?25

X = ?0.5

Now, plug this value into the sigmoid function:

You will get P(y=1) ≈ 0.3775. So, the probability of the customer subscribing to the streaming service is approximately 37.75%.

Now imagine doing this with all customers in the dataset. You will get a plot like the one below:

One challenge that we face, if we end our logistic regression work here with the sigmoid function, is the challenge of interpretation. Since the line does not follow a linear relationship due to the application of the sigmoid function, a one-unit increase in X does not correspond to a one-unit increase in Y. To make our interpretation easier, we convert the sigmoid function into logit, which is mathematically equivalent and simplifies the interpretation of the results. Still confused?

To better explain, the outcome of a sigmoid function is in the form of probability, ranging between 0 and 1, according to this formula:

The logit function does the opposite: it takes a probability (a number between 0 and 1) and converts it back into a linear combination of predictors. This is done by taking the natural logarithm of the ratio of the probability of success to the probability of failure (i.e., the odds):

Remember I said earlier that X can be represented by your typical linear regression function. Thus, converting a sigmoid function to logit for an easier interpretation of the output results in the following equation:

By converting the probability to the logit (i.e., log odds), we transform the nonlinear relationship into a linear one, making it easier to interpret. The coefficients in the logit model tell us how a one-unit change in a predictor affects the log odds (i.e., logit) of the outcome.

Even though we solve the linear interpretation problem, trying to understand what ‘one unit increase in logit’ means exactly is still challenging. Thus, we often convert regression coefficients to something easier for interpretation, like odds ratios. This can be done easily by exponentiating the coefficient.

For instance, according to the age and customers' subscription example I mentioned previously (the intercept = -3 and the slope = 0.1), we can say that when age is equal to zero, the odds of subscribing to the service is e-3 ≈ 0.05. For every additional year of age, the odds of subscription increase by approximately e0.1 ≈ 1.105, which means the odds are multiplied by 1.105, or increased by about 10.5%. Putting it together, when combining both the intercept and the coefficient for customer A who is 25 years old, x=?3+0.1?25=?0.5. The probability of subscription is:

The odds of subscribing to the service when the customer is 25 years old are e-0.5 ≈ 0.607.

Note that in the real world, we tend to have more than one predictor. We can write the logit formula of logistic regression as:

I hope now you can see how are probabilities (using the sigmoid function such that the output values are bounded between 0 and 1), logit (i.e., log odds), and odds ratios used in logistic regression functions.

In conclusion, we first started by applying a sigmoid function to a typical linear regression so that our output values represent reality by being bounded between 0 and 1. As it's challenging to interpret how a one-unit increase in X would result in how many units increase in Y for a nonlinear function (i.e., the sigmoid function), we convert the function to logit or log odds, which is a linear function. Nonetheless, trying to understand how a one-unit increase in X would result in how many logit increases in Y is still challenging for us humans, so we convert the logit to odds ratios. The process can be mathematically reverted as well to get the probabilities from odds ratios.

领英推荐

Linear Regression

Raj Kishore Agrawal 6 个月前

Elastic Net Regression: Combining Both Ridge & Lasso

Shakil Khan 5 个月前

Lasso Regression: A Game-Changer for Feature Selection

Shakil Khan 5 个月前

The Concept of Likelihood

In my previous post about linear regression, I showed you a visualization of how a software program that you use to run regression comes up with a set of beta coefficients (e.g., by relying on matrix operations or gradient descent boosting). For logistic regression, things work a bit differently. To get the best set of regression coefficients, we use the concept of likelihood. Let’s try to understand the basic idea of this concept first.

Say you work for a streaming service company based in Northern Virginia, where the Asian population is on the rise, and you assume that the probability of Asian customers subscribing to a new streaming service from Korea should be quite high, around 0.8. In other words, we can say p = 0.8, meaning that there is an 80% chance that a customer would subscribe to the service. Then you look at the actual data and observe that at least 7 out of 10 customers subscribe to the streaming service. This observed data could be represented by the vector below:

Y = (H,H,H,T,H,H,H,T,H,T)

According to the vector, H=subscription, and T =no subscription. Now, when plugging the 0.8 probability into the vector, you would get the following likelihood:

Thus, the likelihood of observing the data given that the probability of subscription is 0.8 is approximately 0.001677. In other words, we can say that if we assume that the probability is 0.8, the likelihood of observing the outcome in our dataset (7 subscriptions and 3 no subscriptions) is 0.001677.

If you think that the probability of an Asian customer subscribing to the streaming service might be a bit lower, like about 0.5 due to the economic recession, the likelihood would be:

According to the two likelihood estimations above, we can say that an estimate of p being equal to 0.8 (likelihood = 0.001677) is more likely than an estimate of p being equal to 0.5 (likelihood = 0.0009765625). In other words, our observation of 7 subscriptions and 3 no-subscriptions is more likely if we estimate p as 0.8 rather than 0.5.

Maximum Likelihood Estimation (MLE)

Now you may have a question regarding which p you should use such that you get the highest likelihood that best reflects the actual observed data (7 subscriptions and 3 no-subscriptions). The answer is you can try different values of p from 0 to 1 and see which one yields the highest likelihood as seen in the plot below:

In the plot, the x-axis represents the probability of observing a subscription (H), and the y-axis indicates the likelihood of observing 7 subscriptions (H) and 3 no-subscriptions (T). The peak probability value here, about 0.7, yields the highest likelihood (i.e., about 0.0022) of observing 7 subscriptions and 3 no-subscriptions. Therefore, the maximum likelihood estimate of the probability of observing a subscription for this particular dataset is 0.7 given the 10 observations we have made.

Note that the concept of likelihood I mentioned above should work fine if you have only 10 observations. However, in the real world, you tend to have many more observations, like thousands to millions. Keeping multiplying p for each participant together could lead to high computational complexity. This is where the concept of log-likelihood could be helpful.

In the end, the probability that maximizes likelihood is also the same number that maximizes the log-likelihood. Thus, it does not matter that much if we use log-likelihood instead of likelihood. The formula of log-likelihood based on the concept of likelihood is below:

Yi is the observed outcome (1 for subscription, 0 for no subscription).
p is the probability of subscribing.
L(Y∣p) is the likelihood of observing the data given the probability p.

Maximum Likelihood Estimation (MLE) for Logistic Regression

Now that you have learned about the concept of MLE, let’s see how we can apply the concept to logistic regression. Let’s take another look at the logistic regression equation that I introduced to you previously:

Imagine that at first, we have a random set of coefficients B0=?0.3 and B1 = 0.1 as discussed previously. When we plug in the values and each X into the equation above, we get the logit for every observation. Note that unlike the example above, where I simply said that the probability of an Asian customer subscribing to the streaming service is 0.8, in reality, each sample should have a different X (such as age) and therefore should have a different probability.

For instance, say each customer has a different probability as shown in the data table below:

In this table:

Age is the predictor variable X.
Subscription (Y) indicates whether the customer subscribed (1) or not (0).
Probability (P) is calculated using the logistic regression model with the initial coefficients B0 =?0.3 and B1 = 0.1

You can calculate these probabilities using the logistic regression equation:

Using the above probabilities, we can calculate the log-likelihood for the entire dataset using the formula below I mentioned previously:

Plugging in the values in the dataset above, we would get the log-likelihood of about -28.29.

Just like in linear regression, one question to ask is which sets of beta coefficients would provide a higher likelihood of the data. The answer is pretty much the same. We try different sets of Betas and see which ones yield the highest log-likelihood. However, unlike linear regressions where you are trying to find the best beta coefficients that yield the lowest SSE (Sum of Squared Errors), here you find the best coefficients that yield the highest log-likelihood (see the figure below).

The way we do this is called the logistic loss function, which basically refers to trying to find the set of coefficients in a model that can obtain the maximum likelihood estimates of the coefficients for the logistic regression model.

Regularization and Evaluation Metrics in Logistic Regression

The way that regularization works for logistic regression is quite similar to how it works in linear regression, which involves adding penalty terms to the loss function to avoid large coefficients. We have different types of regularization, including ridge, lasso, and elastic net. You may refer to my previous post to read more about regularization techniques in regression functions.

Although logistic and linear regressions share a similar regularization process, the evaluation metrics are different and emphasize accuracy, precision, recall, F1 scores, and the area under the curve (AUC). Read more in the post I wrote previously regarding evaluation metrics for binary outcomes.

Class Imbalance

In the real world, we do not always have projects where the outcome is balanced between the two categories. For example, consider fraud detection in banking; the percentage of fraudulent transactions is likely much lower than non-fraudulent transactions. The imbalance of the target outcome classes can influence the model's performance. I wrote a post about which metrics are better when we have a class imbalance here. In addition to selecting the right metric to evaluate model performance, there are some strategies that could help to boost the performance of logistic regression when dealing with class imbalance.

1. SMOTE (Synthetic Minority Over-sampling Technique)

What it is: SMOTE is a technique used to generate synthetic samples for the minority class to balance the class distribution.
When to use it: Use SMOTE when you have a significant class imbalance and need to improve the performance of your model by creating a more balanced dataset.
When not to use it: Avoid using SMOTE if the minority class has very few instances, as the synthetic samples might not represent the actual data distribution well. From my personal experience, if the ratio of your majority: minority class is less than 100:1, most of the times, SMOTE will not work well.

2. Undersampling the Majority Class

What it is: Undersampling involves reducing the number of instances in the majority class to balance the dataset.
When to use it: Use undersampling when you have a large dataset and can afford to lose some majority class instances without significantly impacting the model's ability to learn. However, make sure that your majority instances follow a normal distribution. Otherwise, you may have a biased sample when you perform the undersampling technique, although you can write a function to randomly select majority class instances to represent the distribution of their population pool.
When not to use it: Be cautious when the class distribution ratio is crucial to the problem at hand. For example, in fraud detection, if fraudulent transactions (coded as 1) make up only 5% of the dataset, undersampling the majority class (non-fraudulent transactions coded as 0) to make a 1:1 ratio does not reflect the real-world distribution and could influence the validity of your project findings.

3. Weight Adjustment

What it is: Adjusting the weights involves assigning more weight to the minority class during model training to penalize misclassifications of the minority class more heavily.
When to use it: Use weight adjustment when you want to give more importance to the minority class without altering the dataset.
When not to use it: Avoid using weight adjustment if the model already handles class imbalance well, as it might lead to overfitting.

4. Other Techniques

There are other techniques to address class imbalance, such as using different algorithms that are robust to class imbalance, employing ensemble methods (e.g., random forest), or leveraging anomaly detection techniques. Each method has its advantages and should be considered based on the specific context of the problem.

Another thing you may wonder if you have class imbalance is whether you should apply regularization first or the class adjustment techniques mentioned above first. The answer is to apply the class adjustment techniques first. This helps to address the imbalance in the data, allowing the regularization process to work more effectively by focusing on a balanced dataset.

Example

Now that you have a strong foundation in logistic regression, let's apply this knowledge to a real-world scenario. On my GitHub page: https://github.com/KayChansiri/Demo_logistic_regression_ML, I provide a step-by-step guide on how to perform logistic regression, especially when dealing with class imbalance. The dataset used in this guide discusses subscriptions to a streaming service among Asian Americans. This fictional streaming company is focused on detecting all potential customers who may subscribe to their service and wants to reach out to as many of them as possible to boost sales rates. Sounds interesting? If you want to practice and learn more about coding, feel free to visit the page.

I hope this post has been helpful in understanding the foundational concepts of logistic regression from a machine learning perspective. Stay tuned for the next post, where I will discuss Gradient Boosting and AdaBoost, two of the most effective algorithms for boosting model performance.

Nii Mahliaire, Ph.D.

Co-Founder, Communication Strategist | Digital PR. Done right. | Anishinaabe

7 个月

Kay, love the article it is so helpful!

Ping Wongphothiphan

UX Researcher | Healthcare and Social Research Scientist | Quantitative and Mixed Method Researcher

7 个月

You make complicated math becomes so simple! Thanks for the article.

查看更多评论

要查看或添加评论，请登录

Kay Chansiri, Ph.D.的更多文章

The Art of Gradient Boosting Machines: A Practical Approach

2024年9月19日

The Art of Gradient Boosting Machines: A Practical Approach

Welcome to the third article in my series on ensemble techniques for machine learning. In my previous post, I discussed…

1 条评论
Linear Regression from a Machine Learning Perspective

2024年7月3日

Linear Regression from a Machine Learning Perspective

Let's discuss one of the simplest machine learning (ML) algorithms today: regression. Like other ML algorithms…

4 条评论
Let's Talk About Performance Evaluation Metrics for Machine Learning

2024年6月5日

Let's Talk About Performance Evaluation Metrics for Machine Learning

Performance Evaluation Metrics Hello, all data scientists and researchers! Today, I will talk about one of the most…

2 条评论
From Trees to Forests: Exploring the Power of Random Forest in Machine Learning

2024年5月22日

From Trees to Forests: Exploring the Power of Random Forest in Machine Learning

Hello, machine learning enthusiasts! Today, I'm excited to share the second installment in my series on random forests.…

2 条评论
Insights from the National AI Expo: Navigating the Future of AI with Ethics and Innovation

2024年5月7日

Insights from the National AI Expo: Navigating the Future of AI with Ethics and Innovation

Wrapping up Day 1 of The AI Expo for National Competitiveness, an event teeming with bright-minded individuals from…
Classification Tree - Read This Before Applying Your Random Forest Algorithms

2024年2月28日

Classification Tree - Read This Before Applying Your Random Forest Algorithms

Hello, LinkedIn people! Today, I am back with another post about a simple yet powerful machine learning algorithm:…

3 条评论
Longitudinal Multilevel Modeling: A Fundamental Pillar in the Architecture of Machine Learning and Deep Learning Algorithms

2024年1月30日

Longitudinal Multilevel Modeling: A Fundamental Pillar in the Architecture of Machine Learning and Deep Learning Algorithms

In today's post, we will diverge from our usual focus on machine learning and AI to delve into the world of…

2 条评论
Comparing AI Paradigms: Reflex, State, Bayseian, and Logic Learning

2023年12月18日

Comparing AI Paradigms: Reflex, State, Bayseian, and Logic Learning

In my previous post, I discussed categorizing machine learning (ML) based on supervision styles (such as supervised…

4 条评论
Discovering the Spectrum of Machine Learning: A Beginner's Comprehensive Guide

2023年12月12日

Discovering the Spectrum of Machine Learning: A Beginner's Comprehensive Guide

Hello there, data enthusiasts! I'm back with another article, where we unravel the various types of Machine Learning…

2 条评论
Unraveling the World of Data Structures: Vectors to Arrays Explained!

2023年9月8日

Unraveling the World of Data Structures: Vectors to Arrays Explained!

Hello, data enthusiast out there! If you've spent any time in the world of social science, you've undoubtedly stumbled…

2 条评论

See all articles

Understanding Logistic Regression in Machine Learning: Sigmoid Function, Log-Likelihood Estimation, Class Imbalance Adjustment, and More

Kay Chansiri, Ph.D.

Research Scientist | ML & GenAI for Social Impacts | Human-Computer Interaction

Why Logistic Regression?

Sigmoid Function

领英推荐

The Concept of Likelihood

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) for Logistic Regression

Regularization and Evaluation Metrics in Logistic Regression

Class Imbalance

1. SMOTE (Synthetic Minority Over-sampling Technique)

2. Undersampling the Majority Class

3. Weight Adjustment

4. Other Techniques

Example

Kay Chansiri, Ph.D.的更多文章

社区洞察

其他会员也浏览了

Ridge Regression: Tackling Bias-Variance Tradeoff

Matrix Operations in Linear Regression

Why Mean Squared Error for Linear Regression?

What Is Regression In Machine Learning?

Linear Regression

Linear Regression. Making Sense Of The Future Based On The Past.

Logictic or Linear Regression? Are they same? Look alike? Haha it's not

Understanding Gradient Descent in Linear Regression.

What is Regression?

LINEAR REGRESSION MADE EASY

Why Logistic Regression?

Sigmoid Function

领英推荐

The Concept of Likelihood

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) for Logistic Regression

Regularization and Evaluation Metrics in Logistic Regression

Class Imbalance

1. SMOTE (Synthetic Minority Over-sampling Technique)

2. Undersampling the Majority Class

3. Weight Adjustment

4. Other Techniques

Example

Kay Chansiri, Ph.D.的更多文章

The Art of Gradient Boosting Machines: A Practical Approach

Linear Regression from a Machine Learning Perspective

Let's Talk About Performance Evaluation Metrics for Machine Learning

From Trees to Forests: Exploring the Power of Random Forest in Machine Learning

Insights from the National AI Expo: Navigating the Future of AI with Ethics and Innovation

Classification Tree - Read This Before Applying Your Random Forest Algorithms

Longitudinal Multilevel Modeling: A Fundamental Pillar in the Architecture of Machine Learning and Deep Learning Algorithms

Comparing AI Paradigms: Reflex, State, Bayseian, and Logic Learning

Discovering the Spectrum of Machine Learning: A Beginner's Comprehensive Guide

Unraveling the World of Data Structures: Vectors to Arrays Explained!

社区洞察

其他会员也浏览了

Ridge Regression: Tackling Bias-Variance Tradeoff

Matrix Operations in Linear Regression

Why Mean Squared Error for Linear Regression?

What Is Regression In Machine Learning?

Linear Regression

Linear Regression. Making Sense Of The Future Based On The Past.

Logictic or Linear Regression? Are they same? Look alike? Haha it's not

Understanding Gradient Descent in Linear Regression.

What is Regression?

LINEAR REGRESSION MADE EASY