Logistics Regression using Gradient Descent

Logistics Regression using Gradient Descent

Introduction

Hello LinkedIn Family!

Welcome to another edition of my series, and I'm excited to share that our future posts will be even more educational and engaging. I'm committed to providing in-depth explanations to help you grasp complex concepts easily.

In my previous post, I introduced you to the fascinating world of machine learning classification. While there's an abundance of resources on this topic, I promise to make my explanations not only informative but also enjoyable. Understanding how these models work can truly elevate your skills as a data scientist!

In this edition, I'll guide you through my manual implementation of Logistic Regression using Gradient Descent. I like to think of machine learning as the smart and free application of mathematics. It's "smart" because it leverages both simple and intricate math through repetition to produce remarkable results. It's "free" because the techniques used are intuitive and accessible to everyone for example taking negative of partial derivatives in gradient descent or using logarithm in the binary cross entropy loss or log loss cost function. Machine learning is all about innovation, and I'm here to demystify it for you.

In my previous post, I mentioned terms like "maximum likelihood," "sigmoid function," "binary cross-entropy," and perhaps some other fancy terms. In this edition, I'll show you, in practical and straightforward terms, what these concepts mean and how we combine them to create powerful models. These ideas weren't conjured by magicians or exclusive to geniuses; they were developed by people like you and me. In fact, in a future post, I'll share an implementation of an algorithm I created to ensure data consistency.

Maximum Likelihood

Explain the two words maximum and Likelihood and that's exactly what this is. This concept means we're trying to get a classification situation where our model puts as many data points as possible in their correct class.

No alt text provided for this image
Green dotted lines are the classification boundaries learnt by the model at different epochs and the black line is the final boundary at the last epoch.

Looking at the image above, the left classification boundary doesn't conform very well to the real classes of the dataset. Thus, it's a bad model or we could say the likelihood of those point being in the class the model put them is low. In the right model, we see a better classification boundary.

Sigmoid Function

No alt text provided for this image

Let's unravel the sigmoid function further. The denominator adjustment, represented as e^-z, is the secret sauce that fine-tunes our probabilities.

Here's the breakdown: When data points fall squarely on the decision boundary, z equals 0. For points on one side of the boundary, z becomes positive; for those on the other side, it's negative.

Now, the intriguing part: Negative z values yield probabilities below 0.5, while positive z values push them past 0.5 when inputted in the sigmoid function.

Ordinarily, we set a threshold of >= 0.5 to classify as the positive class. However, for optimal model performance, you can delve into threshold tuning, adjusting this threshold to achieve the best results.

Binary Cross Entropy Loss Function

The log loss function is given as;

No alt text provided for this image

The equation above assesses the classification state. In a binary classification scenario, data points can belong to either class 0 or class 1, often referred to as negative or positive classes, respectively. The sigmoid function predicts the probability of being in the positive class, so a low probability indicates a high likelihood of belonging to the negative class (class 0), hence, the probability of being negative is (1 - sigmoid(z)).

Now, let's break down the equation:

  • The first part of the equation measures how well the prediction aligns with the positive class (class 1). Plugging in 1 for the actual class in the second part makes it 0.

second part of the equation

  • The second part of the equation assesses the proximity of predicted values to the negative class (class 0).
  • We multiply by a negative sign because the logarithm of a number less than 1 is negative. Consequently, this ensures that the loss becomes positive, capturing the deviation between predictions and actual classes.

Implementing Gradient Descent

To employ gradient descent effectively, we require a differentiable function. Our chosen loss function, binary cross-entropy (or log loss), indeed satisfies this criterion.

My initial task involved computing the partial derivatives of the loss function with respect to the weight and bias. It's worth noting that my implementation can be viewed as a single-input layer network.

Upon calculation, the following are obtained as the partial derivatives of the loss function with respect to the weight and bias.

The negative of the partial derivative of the log-loss with respect to the weights, is expressed as:

Similarly, the negative of the partial derivative with respect to the bias/intercept can be represented as;

In my Jupyter notebook, the dataset used had two features. So, let's take w1 (the weight coefficient of the first feature) as an example. The update rule for w1 becomes w1 + ((actual_y - predicted_y)*feature1)*learning_rate, and the same principle applies to other weight.

Note: actual_y, predicted_y and feature_1 are of shape (nx1) where n is the number of rows. This is because we're using the full batch gradient descent approach. I opted for the full-batch gradient descent approach since my dataset featured a limited number of rows and a small number of features. In the full batch gradient descent, at an epoch, every row is passed through the current weights and bias and then the gradient increase due to each point is computed and at the end of the epoch, the weights are updated with the sum of updates from all data points.

Towards initialization of weights, commonly used methods include;

  1. Random Initialization:- Weights: We initialize with small random values, often drawn from a normal distribution (Gaussian distribution) with mean 0 and a small variance. This method is also known as "Xavier" or "Glorot" initialization.- Bias: We initialize bias terms to zeros or small constants.
  2. He Initialization:- Weights: He initialization is commonly used for deep neural networks. We initialize the weights with random values drawn from a normal distribution with mean 0 and a variance of 2 divided by the number of input units in the layer.- Bias: We initialize bias terms to zeros or small constants.

For my implementation, I opted for the Random initialization and the weights were initialized following a normal distribution with mean of 0 and standard deviation of 1/√no_of_features(2), the bias was chosen as -7 and learning rate of 0.00000051. These values yielded the separation boundary below. The small learning rate was due to the scale of features. This also points to why feature scaling is important to neural networks.

By iteratively updating the weights using the learning rate, we achieved an effective separation boundary, as illustrated in the plot below.

Good Classification Boundary obtained after 155 epochs; green dotted lines are model learnt boundaries at different epochs. Features aren't scaled in this implementation.

Why understanding the Model is good for Data Scientists.

As promised, let's discuss why I believe this knowledge is valuable for a Data Scientist. Here are the key reasons:

1. Enhanced EDA Focus: Understanding the inner workings of models like Logistic Regression can significantly improve your Exploratory Data Analysis (EDA) approach. For instance, when preparing to build a Logistic Regression model, your EDA will be more focused. You'll pay special attention to features that exhibit a linear separation boundary when visualized by class, and you'll be vigilant for outliers that might impact the model's learned weights.

2. Empowered Feature Engineering: Feature engineering is a crucial aspect of machine learning. It involves optimizing your features, creating new ones from existing data, and sometimes transforming features to better suit your model. This knowledge equips you with the skills to excel in feature engineering. You'll know when to craft new features to enrich your model's understanding (similar to information gain in Decision Trees). Furthermore, feature engineering can be influenced by domain expertise, but it can also be guided by your comprehension of the model, refined through EDA. For example, you might experiment with feature combinations to assess whether they reduce entropy and provide Trees with more valuable information.

Code Implementation

In my GitHub repository, you'll discover a treasure trove—a meticulously detailed implementation of logistic regression and a lot more. But it's not just code; it's a pathway to practical understanding.

I earnestly implore you to take a closer look. Why? Because:

1. Code is Knowledge in Action: Reading about algorithms is one thing, but seeing them in action is where true understanding flourishes.

2. Learning by Example: My GitHub is an open book of practical examples. You'll find well-commented code that demystifies complex algorithms and makes them accessible.

3. Your Data Science Arsenal: Think of it as your data science arsenal. Every line of code is a tool, and the more tools you have at your disposal, the more effective you become.

4. Continuous Learning: The field of data science is dynamic. By exploring real code, you stay at the forefront of this ever-evolving landscape.

So, I invite you to visit my GitHub repository. It's where theory meets practice, and where knowledge transforms into expertise.

Exploring the Cutting Edge of AI

Now, let's delve into the exciting world of recent AI advancements, where innovation knows no bounds.

Prompt Engineering: You may have come across the term "Prompt Engineering." If not, let me enlighten you. It's a fascinating technique that involves crafting effective prompts to extract the best results from Language Models. It's a game-changer in the realm of AI, enabling us to harness the full potential of these powerful models. This opens more job roles in the AI space amongst other benefits so I implore you to research more about this.

Conclusion

Our journey through the inner workings of the Logistic Regression model, powered by Gradient Descent, has been enlightening. But remember, this is just the tip of the iceberg in the vast ocean of Logistics Regression. The key to mastery is practice and exploration. Dive in, experiment, and embrace the ever-evolving world of AI.

Thank you once again for embarking on this adventure with me. Stay curious, stay innovative, keep pushing the boundaries of what's possible in the realm of AI and machine learning and stay tuned for my next article.

Aditya Chourasiya

AI/ML Engineer | Data Science | LLM | Computer Vision | GenAI

1 年
Promise Obioma

Agricultural Engineering, Student at University of Nigeria, Nsukka, Machine Learning, Data Science tutor, Relational Database Management, Excel and Python programmer.

1 年

Thanks for posting this massive knowledge

Emmanuel C. Ejiofor

Research || Journalist

1 年

Nice stuff Man ??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了