Logistic regression: A deep learning approach.

Logistic regression: A deep learning approach.

Logistic regression is one of the most modern machine learning algorithms, and it is important because if you want to go into neural networks, the fundamental building block is the perceptron, which doesn’t differ much from logistic regression.

Prerequisite for Logistic Regression:

If you want to apply logistic regression to any machine learning problem, you need to understand the requirement for logistic regression.

1.The data must be clearly classifiable.

Linear separability means that you can separate the data into two classes using a straight line, for example, in below image we can clearly see that both red and blue data points are clearly classifiable.

If your data is completely non-linear, for example, if you have data that looks like this, then you can never separate it with a straight line or hyperplane.

If your data is non-linear, logistic regression will not give you good results. So, this is the requirement that you must always understand. a basic intuition is given below to identify the line which separates the data



Now, let's discuss how logistic regression works in more detail.

Perceptron with step function:

We will discuss two different approaches to understanding logistic regression. The first approach is very simple. This approach is known as the perceptron trick. Before we go to perceptron trick, we need to understand that we need to draw a line in the dataset which linearly separates oy data points:

The equation of line would be Ax + By + c = 0.

where x is x-Axis and y is Yaxis and C is Y intercept.

Here our objective is to identify the value of A, B and C in above equation.

Now, let's see how the perceptron trick works. It's a simple trick where you start with random values for a, b, and c which means you start with a random line. In the beginning, this line could be any line. The idea is that this line says that everything on this side is blue, and nothing on the other side is. This is your initial classification line.

Now, what you need to do is make this line converge to the correct classification line. How will you do that? The perceptron trick says you need to run a loop. Let's say you run the loop for a thousand iterations. Each time in the loop, you randomly select a point. You ask this point whether it thinks the line is correctly classify or not. You keep adjusting the line based on the feedback from the points. This cycle continues, and you select points randomly each time.

If a point says it is correctly classified, you make no changes. If it says it is incorrectly classified, you move the line towards that point. This is the idea behind the perceptron trick. You run a loop, select points randomly, and ask them if they are correctly classified. If they are, you do nothing. If not, you move the line towards them. This process continues until convergence.

Convergence can be defined in two ways. Either you run the loop for a fixed number of times, such as a thousand iterations (Epochs), or you run it until no points are misclassified. Each loop cycle checks how many points are misclassified, and you continue until this number reaches zero. Once the number of misclassified points is zero, the loop stops, and your line is correctly placed. This is the general idea behind the perceptron trick.


Comparison with sklearn implementation with above implementation.

From above it is clear that skLearn implementation is better than the perceptron trick as sklearn implementation provides better classification. So, we need to understand what the problem with perceptron approach is?

Basically, problem with perceptron approach is that it only does the convergence based on misclassified points. If it would have been with every point the classification would have been more symmetrical as it is in case of sklearn implementation. So how do we correct it?

The problem was with the step function which we used in the perceptron implementation as it gives absolute value between 0 and 1, which essentially means that the value of classification was deterministic and not probabilistic.

Sigmoid function:

Let's talk about sigmoid function in the context of mathematical functions and solutions. When you input something into this function, you get an output. When you plot the graph of this function, it looks like this:

It's a special graph because if you look at it, the y-value is given by this function.

This is because if the value of z is very high, sigmoid approaches 1 but never fully reaches it. This means that at infinity, the value approaches 1.

Conversely, if z is very low, like minus infinity approaches 0.

No matter how large or small the input, the output of this function will always be within 0 and This is a key feature of this algorithm.

Now, instead of using the step function, we will replace z with a sigmoid function. If z is negative, the sigmoid of z will be less than 0.5.

we first calculate z, then apply the sigmoid function. If z is positive, sigmoid will be greater than 0.5, indicating a higher probability of classification.

As you compare

Red line is : Perceptron with step function

Black line is : Sklean logistic regression

Brown line is : Perceptron with Sigmoid.

So, instead of using a step function, we now have a probabilistic interpretation with the sigmoid function. This transforms our entire system into a probabilistic interpretation. However, if we see this is still not as good as the original sklearn implementation, So what's that special sklearn algorithm is doing. lets explore in the next section.

Maximum likelihood and Loss function

.

Initially, we only focused on incorrectly classified points. Then, we changed our logic to consider every point, whether correctly or incorrectly classified. But even this solution didn’t converge. Machine learning does not solve problems in this manner. It’s not a perfect solution. We are running loops, selecting random numbers, and checking if they are correctly classified.

So, how can we say that the line we get is the best one? There is no method in our approach to ensure this line is the best one. We need to use a machine learning approach. Machine learning uses a loss function, which measures the quantity of errors made by our model. Once we identify this loss function, we aim to minimize it. The solution where the loss function’s value is minimum is the optimal one.

One method to derive the loss function is Maximum Likelihood.


In above image I’ve plotted two models: one line and the second line. let's think about which model is better. Model 2 is better than Model 1 because it perfectly classifies the green and red points.

If I ask which model is better, it becomes a bit difficult to decide. Here, the loss function helps determine how good or bad a model is. It helps us select the best model mathematically. We will use the loss function to derive the probability of each point. For the first model, calculate the prediction for each point using the sigmoid function. For model 1, let’s say the probability of the green point is 0.7, 0.6, 0.4, and 0.2.For the red points, the probabilities are 0.3, 0.4, 0.8, and 0.6.

For the second model, let's say the probability of the green points is 0.7, 0.6, 0.4, and 0.3. 2:52 The probability of the red points is 0.3, 0.4, 0.7, and 0.7. 2:54.

Maximum Likelihood involves multiplying the probabilities for each point. The model with the higher product of probabilities is the better model.

Maximum likelihood for both model is and we can clearly see that Model 2 is efficient

Model 1 = 0.089

Model 2 =0.176

In a real-world dataset where there are 10,000 rows and 10,000 such coordinates, when you multiply 10,000 numbers, the result becomes extremely small, and this comparison doesn't hold up. So, You use logs.

But there's a problem. if you know, the log of any number between 0 and 7 is always negative. If you calculate the log of 0.5, it will be minus 0.3. To avoid this problem, we will make all the logs negative. Meaning, we will calculate -log instead.

Now, what you are getting here is called cross-entropy. Whenever you take the negative log of the maximum likelihood, it's called cross-entropy. Cross-entropy is the summation of the negative log of the maximum likelihood.

We are calculating cross-entropy instead of calculating maximum likelihood. In maximum likelihood, you had to select a model whose probabilities' product is the highest. You had to maximize the maximum likelihood, but cross-entropy needs to be minimized. So, if you're using cross-entropy as a loss function, you need to select the model whose cross-entropy is minimum. In maximum likelihood, you had to maximize it, but now, we need to minimize it. We need to find a model whose cross-entropy is the lowest.

Minimizing the Cost with Gradient Descent

Now that we have defined a cost function, the aim is to find the optimal coefficient such that it minimises this cost function for our data-set. This is where Gradient Descent comes In. (Explaining gradient descent needs another article. For now, assume it's a slope function in a high dimensional space). By doing this, the model learns the parameters to reduce its penalty thus making much more accurate predictions.


ARUN YADAV

(Senior SDET》| Java | Rest API | Selenium|JavaScript| Playwright|Appium| Automation Innovator | Cross Browser Testing| Continuous Testing | Web Development | Leading with Quality

2 个月

Thanks for sharing

回复

nice explanation sir

Prerit Sharma

Innovative Data Science Enthusiast | Ex-Intern @Drewry | Boston Hult Prize 2024 Semi-Finalist | Building Informative Insights with AI, ML & Python | Exploring Generative AI & Prompt Engineering

2 个月

Great breakdown of logistic regression from both geometric and probabilistic perspectives! For those delving deeper into the custom implementation, consider experimenting with regularization techniques like L1 or L2 to prevent overfitting and improve model performance. Looking forward to more insightful discussions! #AI #MachineLearning #ML

要查看或添加评论,请登录

社区洞察

其他会员也浏览了