Understanding Logistic Regression in Machine Learning: Sigmoid Function, Log-Likelihood Estimation, Class Imbalance Adjustment, and More
Kay Chansiri, Ph.D.
Research Scientist | ML & GenAI for Social Impacts | Human-Computer Interaction
In my previous post, I discussed linear regression from a machine learning (ML) perspective. Today, let’s delve into a different type of regression used to predict binary outcomes — logistic regression. In this post, I will also discuss important concepts relevant to the algorithm, such as the sigmoid function, log-likelihood estimation, class imbalance adjustment, and more. If you are ready, let's get started!
Why Logistic Regression?
If you're like me, the first time you learn about logistic regression, you may wonder, "Why can’t we use linear regression to predict a binary outcome (e.g., outcomes of which levels are 0 or 1, "True" or "False", etc.)?"
The answer lies in the limitations of linear regression for binary outcomes. Linear regression predictions can extend beyond the [0, 1] range, leading to unreasonable interpretations when our target outcome is strictly within this boundary. Thus, an alternative model that works well with binary decisions is better. Logistic regression, which uses the sigmoid function, is the answer.
Before we get to the sigmoid function, there are three key terms you should familiarize yourself with in logistic regression: probability, odds, and logit:
Mathematically, you can convert logit back to odds by applying the exponential function, and you can convert odds to probability by dividing the odds by 1 plus the odds:
I know these concepts might sound confusing at first, but bear with me. I promise that by the end of this post, you'll have a better understanding. Let’s get back to the question I asked previously: why do we use the sigmoid function and why logistic regression for binary outcomes?
Sigmoid Function
The formula below represents the sigmoid function:
According to the function formula, whatever value is plugged into the function, the output is probabilities and always bounded between 0 and 1.
Imagine x is your typical linear regression function that could be represented by B0 + B1X1. When you plug in the value into the formula, the probability of detecting the event of your interest, which is mostly represented by y = 1, would be equal to:
Let’s take a look at an example. Say you want to predict the probability of a customer subscribing to a streaming service (1) versus no-subscription (0). Your predictor is customers’ age. Keep in mind that in the real world, we will likely have more than one predictor, but I will use only one predictor here for simplicity in demonstration.
Suppose that your intercept is B0 = -3 and the coefficient of age is B1 = 0.1. Assume that customer A is 25 years old. You plug in all of the values into the sigmoid function above, and the probability of the customer subscribing to the streaming service is calculated as the following.
The linear regression function X can be represented as:
X = B0+ B1?Age
X = ?3 + 0.1?25
X = ?0.5
Now, plug this value into the sigmoid function:
You will get P(y=1) ≈ 0.3775. So, the probability of the customer subscribing to the streaming service is approximately 37.75%.
Now imagine doing this with all customers in the dataset. You will get a plot like the one below:
One challenge that we face, if we end our logistic regression work here with the sigmoid function, is the challenge of interpretation. Since the line does not follow a linear relationship due to the application of the sigmoid function, a one-unit increase in X does not correspond to a one-unit increase in Y. To make our interpretation easier, we convert the sigmoid function into logit, which is mathematically equivalent and simplifies the interpretation of the results. Still confused?
To better explain, the outcome of a sigmoid function is in the form of probability, ranging between 0 and 1, according to this formula:
The logit function does the opposite: it takes a probability (a number between 0 and 1) and converts it back into a linear combination of predictors. This is done by taking the natural logarithm of the ratio of the probability of success to the probability of failure (i.e., the odds):
Remember I said earlier that X can be represented by your typical linear regression function. Thus, converting a sigmoid function to logit for an easier interpretation of the output results in the following equation:
By converting the probability to the logit (i.e., log odds), we transform the nonlinear relationship into a linear one, making it easier to interpret. The coefficients in the logit model tell us how a one-unit change in a predictor affects the log odds (i.e., logit) of the outcome.
Even though we solve the linear interpretation problem, trying to understand what ‘one unit increase in logit’ means exactly is still challenging. Thus, we often convert regression coefficients to something easier for interpretation, like odds ratios. This can be done easily by exponentiating the coefficient.
For instance, according to the age and customers' subscription example I mentioned previously (the intercept = -3 and the slope = 0.1), we can say that when age is equal to zero, the odds of subscribing to the service is e-3 ≈ 0.05. For every additional year of age, the odds of subscription increase by approximately e0.1 ≈ 1.105, which means the odds are multiplied by 1.105, or increased by about 10.5%. Putting it together, when combining both the intercept and the coefficient for customer A who is 25 years old, x=?3+0.1?25=?0.5. The probability of subscription is:
The odds of subscribing to the service when the customer is 25 years old are e-0.5 ≈ 0.607.
Note that in the real world, we tend to have more than one predictor. We can write the logit formula of logistic regression as:
I hope now you can see how are probabilities (using the sigmoid function such that the output values are bounded between 0 and 1), logit (i.e., log odds), and odds ratios used in logistic regression functions.
In conclusion, we first started by applying a sigmoid function to a typical linear regression so that our output values represent reality by being bounded between 0 and 1. As it's challenging to interpret how a one-unit increase in X would result in how many units increase in Y for a nonlinear function (i.e., the sigmoid function), we convert the function to logit or log odds, which is a linear function. Nonetheless, trying to understand how a one-unit increase in X would result in how many logit increases in Y is still challenging for us humans, so we convert the logit to odds ratios. The process can be mathematically reverted as well to get the probabilities from odds ratios.
领英推荐
The Concept of Likelihood
In my previous post about linear regression, I showed you a visualization of how a software program that you use to run regression comes up with a set of beta coefficients (e.g., by relying on matrix operations or gradient descent boosting). For logistic regression, things work a bit differently. To get the best set of regression coefficients, we use the concept of likelihood. Let’s try to understand the basic idea of this concept first.
Say you work for a streaming service company based in Northern Virginia, where the Asian population is on the rise, and you assume that the probability of Asian customers subscribing to a new streaming service from Korea should be quite high, around 0.8. In other words, we can say p = 0.8, meaning that there is an 80% chance that a customer would subscribe to the service. Then you look at the actual data and observe that at least 7 out of 10 customers subscribe to the streaming service. This observed data could be represented by the vector below:
Y = (H,H,H,T,H,H,H,T,H,T)
According to the vector, H=subscription, and T =no subscription. Now, when plugging the 0.8 probability into the vector, you would get the following likelihood:
Thus, the likelihood of observing the data given that the probability of subscription is 0.8 is approximately 0.001677. In other words, we can say that if we assume that the probability is 0.8, the likelihood of observing the outcome in our dataset (7 subscriptions and 3 no subscriptions) is 0.001677.
If you think that the probability of an Asian customer subscribing to the streaming service might be a bit lower, like about 0.5 due to the economic recession, the likelihood would be:
According to the two likelihood estimations above, we can say that an estimate of p being equal to 0.8 (likelihood = 0.001677) is more likely than an estimate of p being equal to 0.5 (likelihood = 0.0009765625). In other words, our observation of 7 subscriptions and 3 no-subscriptions is more likely if we estimate p as 0.8 rather than 0.5.
Maximum Likelihood Estimation (MLE)
Now you may have a question regarding which p you should use such that you get the highest likelihood that best reflects the actual observed data (7 subscriptions and 3 no-subscriptions). The answer is you can try different values of p from 0 to 1 and see which one yields the highest likelihood as seen in the plot below:
In the plot, the x-axis represents the probability of observing a subscription (H), and the y-axis indicates the likelihood of observing 7 subscriptions (H) and 3 no-subscriptions (T). The peak probability value here, about 0.7, yields the highest likelihood (i.e., about 0.0022) of observing 7 subscriptions and 3 no-subscriptions. Therefore, the maximum likelihood estimate of the probability of observing a subscription for this particular dataset is 0.7 given the 10 observations we have made.
Note that the concept of likelihood I mentioned above should work fine if you have only 10 observations. However, in the real world, you tend to have many more observations, like thousands to millions. Keeping multiplying p for each participant together could lead to high computational complexity. This is where the concept of log-likelihood could be helpful.
In the end, the probability that maximizes likelihood is also the same number that maximizes the log-likelihood. Thus, it does not matter that much if we use log-likelihood instead of likelihood. The formula of log-likelihood based on the concept of likelihood is below:
Maximum Likelihood Estimation (MLE) for Logistic Regression
Now that you have learned about the concept of MLE, let’s see how we can apply the concept to logistic regression. Let’s take another look at the logistic regression equation that I introduced to you previously:
Imagine that at first, we have a random set of coefficients B0=?0.3 and B1 = 0.1 as discussed previously. When we plug in the values and each X into the equation above, we get the logit for every observation. Note that unlike the example above, where I simply said that the probability of an Asian customer subscribing to the streaming service is 0.8, in reality, each sample should have a different X (such as age) and therefore should have a different probability.
For instance, say each customer has a different probability as shown in the data table below:
In this table:
You can calculate these probabilities using the logistic regression equation:
Using the above probabilities, we can calculate the log-likelihood for the entire dataset using the formula below I mentioned previously:
Plugging in the values in the dataset above, we would get the log-likelihood of about -28.29.
Just like in linear regression, one question to ask is which sets of beta coefficients would provide a higher likelihood of the data. The answer is pretty much the same. We try different sets of Betas and see which ones yield the highest log-likelihood. However, unlike linear regressions where you are trying to find the best beta coefficients that yield the lowest SSE (Sum of Squared Errors), here you find the best coefficients that yield the highest log-likelihood (see the figure below).
The way we do this is called the logistic loss function, which basically refers to trying to find the set of coefficients in a model that can obtain the maximum likelihood estimates of the coefficients for the logistic regression model.
Regularization and Evaluation Metrics in Logistic Regression
The way that regularization works for logistic regression is quite similar to how it works in linear regression, which involves adding penalty terms to the loss function to avoid large coefficients. We have different types of regularization, including ridge, lasso, and elastic net. You may refer to my previous post to read more about regularization techniques in regression functions.
Although logistic and linear regressions share a similar regularization process, the evaluation metrics are different and emphasize accuracy, precision, recall, F1 scores, and the area under the curve (AUC). Read more in the post I wrote previously regarding evaluation metrics for binary outcomes.
Class Imbalance
In the real world, we do not always have projects where the outcome is balanced between the two categories. For example, consider fraud detection in banking; the percentage of fraudulent transactions is likely much lower than non-fraudulent transactions. The imbalance of the target outcome classes can influence the model's performance. I wrote a post about which metrics are better when we have a class imbalance here. In addition to selecting the right metric to evaluate model performance, there are some strategies that could help to boost the performance of logistic regression when dealing with class imbalance.
1. SMOTE (Synthetic Minority Over-sampling Technique)
2. Undersampling the Majority Class
3. Weight Adjustment
4. Other Techniques
Another thing you may wonder if you have class imbalance is whether you should apply regularization first or the class adjustment techniques mentioned above first. The answer is to apply the class adjustment techniques first. This helps to address the imbalance in the data, allowing the regularization process to work more effectively by focusing on a balanced dataset.
Example
Now that you have a strong foundation in logistic regression, let's apply this knowledge to a real-world scenario. On my GitHub page: https://github.com/KayChansiri/Demo_logistic_regression_ML, I provide a step-by-step guide on how to perform logistic regression, especially when dealing with class imbalance. The dataset used in this guide discusses subscriptions to a streaming service among Asian Americans. This fictional streaming company is focused on detecting all potential customers who may subscribe to their service and wants to reach out to as many of them as possible to boost sales rates. Sounds interesting? If you want to practice and learn more about coding, feel free to visit the page.
I hope this post has been helpful in understanding the foundational concepts of logistic regression from a machine learning perspective. Stay tuned for the next post, where I will discuss Gradient Boosting and AdaBoost, two of the most effective algorithms for boosting model performance.
Co-Founder, Communication Strategist | Digital PR. Done right. | Anishinaabe
7 个月Kay, love the article it is so helpful!
UX Researcher | Healthcare and Social Research Scientist | Quantitative and Mixed Method Researcher
7 个月You make complicated math becomes so simple! Thanks for the article.