Logistic regression: A deep learning approach.

Jitender Malik

Senior Vice President at NatWest Group | AI/ML | Gen AI

发布日期: 2024年7月20日

Logistic regression is one of the most modern machine learning algorithms, and it is important because if you want to go into neural networks, the fundamental building block is the perceptron, which doesn’t differ much from logistic regression.

Prerequisite for Logistic Regression:

If you want to apply logistic regression to any machine learning problem, you need to understand the requirement for logistic regression.

1.The data must be clearly classifiable.

Linear separability means that you can separate the data into two classes using a straight line, for example, in below image we can clearly see that both red and blue data points are clearly classifiable.

If your data is completely non-linear, for example, if you have data that looks like this, then you can never separate it with a straight line or hyperplane.

If your data is non-linear, logistic regression will not give you good results. So, this is the requirement that you must always understand. a basic intuition is given below to identify the line which separates the data

Now, let's discuss how logistic regression works in more detail.

Perceptron with step function:

We will discuss two different approaches to understanding logistic regression. The first approach is very simple. This approach is known as the perceptron trick. Before we go to perceptron trick, we need to understand that we need to draw a line in the dataset which linearly separates oy data points:

The equation of line would be Ax + By + c = 0.

where x is x-Axis and y is Yaxis and C is Y intercept.

Here our objective is to identify the value of A, B and C in above equation.

Now, let's see how the perceptron trick works. It's a simple trick where you start with random values for a, b, and c which means you start with a random line. In the beginning, this line could be any line. The idea is that this line says that everything on this side is blue, and nothing on the other side is. This is your initial classification line.

Now, what you need to do is make this line converge to the correct classification line. How will you do that? The perceptron trick says you need to run a loop. Let's say you run the loop for a thousand iterations. Each time in the loop, you randomly select a point. You ask this point whether it thinks the line is correctly classify or not. You keep adjusting the line based on the feedback from the points. This cycle continues, and you select points randomly each time.

If a point says it is correctly classified, you make no changes. If it says it is incorrectly classified, you move the line towards that point. This is the idea behind the perceptron trick. You run a loop, select points randomly, and ask them if they are correctly classified. If they are, you do nothing. If not, you move the line towards them. This process continues until convergence.

Convergence can be defined in two ways. Either you run the loop for a fixed number of times, such as a thousand iterations (Epochs), or you run it until no points are misclassified. Each loop cycle checks how many points are misclassified, and you continue until this number reaches zero. Once the number of misclassified points is zero, the loop stops, and your line is correctly placed. This is the general idea behind the perceptron trick.

Comparison with sklearn implementation with above implementation.

From above it is clear that skLearn implementation is better than the perceptron trick as sklearn implementation provides better classification. So, we need to understand what the problem with perceptron approach is?

Basically, problem with perceptron approach is that it only does the convergence based on misclassified points. If it would have been with every point the classification would have been more symmetrical as it is in case of sklearn implementation. So how do we correct it?

The problem was with the step function which we used in the perceptron implementation as it gives absolute value between 0 and 1, which essentially means that the value of classification was deterministic and not probabilistic.

Sigmoid function:

Let's talk about sigmoid function in the context of mathematical functions and solutions. When you input something into this function, you get an output. When you plot the graph of this function, it looks like this:

It's a special graph because if you look at it, the y-value is given by this function.

Prof. (Dr.) D G Mahto 3 周前

Types of Machine Learning

Dr. Hari Thapliyaal, PMP 1 年前

Deep Learning Resources and Study Path For Aspiring…

Srivatsan Srinivasan 5 年前

This is because if the value of z is very high, sigmoid approaches 1 but never fully reaches it. This means that at infinity, the value approaches 1.

Conversely, if z is very low, like minus infinity approaches 0.

No matter how large or small the input, the output of this function will always be within 0 and This is a key feature of this algorithm.

Now, instead of using the step function, we will replace z with a sigmoid function. If z is negative, the sigmoid of z will be less than 0.5.

we first calculate z, then apply the sigmoid function. If z is positive, sigmoid will be greater than 0.5, indicating a higher probability of classification.

As you compare

Red line is : Perceptron with step function

Black line is : Sklean logistic regression

Brown line is : Perceptron with Sigmoid.

So, instead of using a step function, we now have a probabilistic interpretation with the sigmoid function. This transforms our entire system into a probabilistic interpretation. However, if we see this is still not as good as the original sklearn implementation, So what's that special sklearn algorithm is doing. lets explore in the next section.

Maximum likelihood and Loss function

Initially, we only focused on incorrectly classified points. Then, we changed our logic to consider every point, whether correctly or incorrectly classified. But even this solution didn’t converge. Machine learning does not solve problems in this manner. It’s not a perfect solution. We are running loops, selecting random numbers, and checking if they are correctly classified.

So, how can we say that the line we get is the best one? There is no method in our approach to ensure this line is the best one. We need to use a machine learning approach. Machine learning uses a loss function, which measures the quantity of errors made by our model. Once we identify this loss function, we aim to minimize it. The solution where the loss function’s value is minimum is the optimal one.

One method to derive the loss function is Maximum Likelihood.

In above image I’ve plotted two models: one line and the second line. let's think about which model is better. Model 2 is better than Model 1 because it perfectly classifies the green and red points.

If I ask which model is better, it becomes a bit difficult to decide. Here, the loss function helps determine how good or bad a model is. It helps us select the best model mathematically. We will use the loss function to derive the probability of each point. For the first model, calculate the prediction for each point using the sigmoid function. For model 1, let’s say the probability of the green point is 0.7, 0.6, 0.4, and 0.2.For the red points, the probabilities are 0.3, 0.4, 0.8, and 0.6.

For the second model, let's say the probability of the green points is 0.7, 0.6, 0.4, and 0.3. 2:52 The probability of the red points is 0.3, 0.4, 0.7, and 0.7. 2:54.

Maximum Likelihood involves multiplying the probabilities for each point. The model with the higher product of probabilities is the better model.

Maximum likelihood for both model is and we can clearly see that Model 2 is efficient

Model 1 = 0.089

Model 2 =0.176

In a real-world dataset where there are 10,000 rows and 10,000 such coordinates, when you multiply 10,000 numbers, the result becomes extremely small, and this comparison doesn't hold up. So, You use logs.

But there's a problem. if you know, the log of any number between 0 and 7 is always negative. If you calculate the log of 0.5, it will be minus 0.3. To avoid this problem, we will make all the logs negative. Meaning, we will calculate -log instead.

Now, what you are getting here is called cross-entropy. Whenever you take the negative log of the maximum likelihood, it's called cross-entropy. Cross-entropy is the summation of the negative log of the maximum likelihood.

We are calculating cross-entropy instead of calculating maximum likelihood. In maximum likelihood, you had to select a model whose probabilities' product is the highest. You had to maximize the maximum likelihood, but cross-entropy needs to be minimized. So, if you're using cross-entropy as a loss function, you need to select the model whose cross-entropy is minimum. In maximum likelihood, you had to maximize it, but now, we need to minimize it. We need to find a model whose cross-entropy is the lowest.

Minimizing the Cost with Gradient Descent

Now that we have defined a cost function, the aim is to find the optimal coefficient such that it minimises this cost function for our data-set. This is where Gradient Descent comes In. (Explaining gradient descent needs another article. For now, assume it's a slope function in a high dimensional space). By doing this, the model learns the parameters to reduce its penalty thus making much more accurate predictions.

ARUN YADAV

2 个月

Thanks for sharing

Jatin Dev

Ai/ML

2 个月

nice explanation sir

2 次回应

Prerit Sharma

Innovative Data Science Enthusiast | Ex-Intern @Drewry | Boston Hult Prize 2024 Semi-Finalist | Building Informative Insights with AI, ML & Python | Exploring Generative AI & Prompt Engineering

2 个月

Great breakdown of logistic regression from both geometric and probabilistic perspectives! For those delving deeper into the custom implementation, consider experimenting with regularization techniques like L1 or L2 to prevent overfitting and improve model performance. Looking forward to more insightful discussions! #AI #MachineLearning #ML

2 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Logistic regression: A deep learning approach.

Jitender Malik

Senior Vice President at NatWest Group | AI/ML | Gen AI

Prerequisite for Logistic Regression:

Perceptron with step function:

Comparison with sklearn implementation with above implementation.

Sigmoid function:

领英推荐

Maximum likelihood and Loss function

Minimizing the Cost with Gradient Descent

更多精彩文章

社区洞察

其他会员也浏览了

My Review on Deep learning Book "The Deep Learning with Keras Workshop"

Understanding Deep Neural Networks Training Course

Unlock The Mysteries Of Keras

A Thorough Overview of Mathematics in AI/ML and Deep Learning.

Configure Deep Learning Architecture

Deep Learning Guide: Introduction to Implementing Neural Networks using TensorFlow in Python

Statistical Modeling to be used for such Numerical Predictions:

Binary Classification in Neural Networks with Tensorflow

Guide to Commonly Used Deep Learning Kernel_Initializers in Real-World Projects

Word2Vec: The Basics

Prerequisite for Logistic Regression:

Perceptron with step function:

Comparison with sklearn implementation with above implementation.

Sigmoid function:

领英推荐

Maximum likelihood and Loss function

Minimizing the Cost with Gradient Descent

LLM's: Chain of thought prompting

2024年10月6日

Deep Learning 1: ANN (Artificial Neural Network) Architecture

2024年7月28日

Encoder decoder to Transfer learning: An analysis of all research papers contributed towards journey of Transformers Architecture (LLM's)

2024年7月13日

A data driven approach for scalable Integration testing.

2024年6月29日

EMIR Refit Pairing and matching : A machine learning approach.

2024年6月22日

Comparison of Multivariate Data Using Principal Component Analysis

2024年6月16日

社区洞察

其他会员也浏览了

My Review on Deep learning Book "The Deep Learning with Keras Workshop"

Understanding Deep Neural Networks Training Course

Unlock The Mysteries Of Keras

A Thorough Overview of Mathematics in AI/ML and Deep Learning.

Configure Deep Learning Architecture

Deep Learning Guide: Introduction to Implementing Neural Networks using TensorFlow in Python

Statistical Modeling to be used for such Numerical Predictions:

Binary Classification in Neural Networks with Tensorflow

Guide to Commonly Used Deep Learning Kernel_Initializers in Real-World Projects

Word2Vec: The Basics