Logistic Regression

Logistic Regression

I appreciate all the support, you have given for my Simple Linear Regression article. Thanks to all the reviewers, your constructive feedback helped me enhance the quality of the content.

I would like to invite you to join me for a cup of coffee this week, as I plan to discuss another interesting algorithm - Logistic Regression.

No alt text provided for this image

Logistic Regression

Logistic regression is a commonly used algorithm in Machine Learning that falls under the category of Supervised Learning.

I know it may be difficult for you to accept, but I still need to convey that logistic regression is commonly used to solve classification problems.

No alt text provided for this image
No alt text provided for this image

Similar to a bat stating that it has wings and can fly, despite not being a bird, the bat is a proud mammal that is capable of flight.




Let us understand why ?

In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts two maximum values (0 or 1).

No alt text provided for this image

In logistic regression, the independent variables are used to predict the dependent variable, just like in other regression techniques. However, the key difference is that the dependent variable in logistic regression is categorical or binary, whereas in other regression techniques, the dependent variable is continuous.

No alt text provided for this image

Why shouldn't we call Logistic Regression "Logistic Classification"?

No alt text provided for this image

it's not entirely accurate to call logistic regression "logistic classification" is because, logistic regression is a statistical method that estimates the probability of a binary outcome, while classification is the task of predicting the category or class that a new data point belongs to based on a set of features. While logistic regression can be used for classification tasks, it is still a regression method at its core and is primarily used for estimating probabilities rather than making direct classifications.

Do logistic regression generates only binary outcomes ?

No alt text provided for this image


Yes, Binomial logistic regression is used to predict binary outcomes or dependent variables that have only two possible values or categories.As a result, the output of logistic regression should be categorical or discrete. It could take on values such as Yes or No, 0 or 1, true or false, and so on. However, rather than providing a precise value of 0 or 1, the logistic regression model generates probability scores that fall between 0 and 1.

However, there are extensions of logistic regression, such as multinomial logistic regression and ordinal logistic regression, that can handle dependent variables with more than two categories.

What are the types of Logistic Regression?

No alt text provided for this image


  1. Binary Logistic Regression: This is the most common type of logistic regression where the dependent variable has only two possible outcomes, such as yes or no, pass or fail, and so on. For example, predicting whether a customer will buy a product or not.

No alt text provided for this image


  1. Multinomial Logistic Regression: This type of logistic regression is used when the dependent variable has three or more categories that are not ordered. For example, predicting whether a person will vote for one of five political parties.

No alt text provided for this image

3.Ordinal Logistic Regression: This type of logistic regression is used when the dependent variable has three or more ordered categories. For example, it can be used to predict the level of customer satisfaction based on their feedback, such as poor, average, and excellent.

No alt text provided for this image
In summary, binary logistic regression is used for binary outcomes, multinomial logistic regression is used for non-ordered categorical outcomes, and ordinal logistic regression is used for ordered categorical outcomes.

What happens when we use linear regression to Logistic Regression Problems?

No alt text provided for this image

For example instead of saying pass or fail, you are supposed to say the marks obtained. Using linear regression for logistic regression problems in marks can lead to inaccurate results because it assumes a linear relationship between the variables and cannot handle binary outcomes.

A Quick Recap

No alt text provided for this image

  1. A category of Supervised Learning, commonly used to solve classification problems.
  2. In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function.
  3. The output of logistic regression is a probability score between 0 and 1, indicating the likelihood of the binary outcome.
  4. Logistic regression uses a sigmoid function to convert the linear regression output to a probability score.
  5. Types of logistic Regression - Binomial, Multinomial, Ordinal. The Binary logistic regression is used for binary outcomes, multinomial logistic regression is used for non-ordered categorical outcomes, and ordinal logistic regression is used for ordered categorical outcome.

What does that graph exactly convey to us?

No alt text provided for this image

Logistic regression uses a sigmoid function to map the linear regression output to a probability score, which is then used to predict the binary outcome. The sigmoid function is an S-shaped curve that ranges from 0 to 1.

When we plot the sigmoid function on a graph, the X-axis represents the linear regression output, while the Y-axis represents the probability score. At the extreme ends of the graph, the probability score is either 0 or 1, indicating a certain binary outcome.

As we move towards the middle of the graph, the probability score approaches 0.5, indicating uncertainty about the binary outcome. The point on the X-axis where the probability score is exactly 0.5 is known as the decision boundary.

The decision boundary separates the two classes in the binary classification problem. Any input data point on one side of the decision boundary is classified as one class (e.g., positive outcome), while any input data point on the other side of the decision boundary is classified as the other class (e.g., negative outcome).

Thus, logistic regression uses the sigmoid function and the decision boundary to predict the binary outcome based on the input variables.

What happens when decision boundary is 0.5 in logistic regression?

No alt text provided for this image


When the decision boundary is 0.5 in logistic regression, it means that the model is equally uncertain about both classes.Therefore, data points that fall close to the decision boundary may be misclassified, and the overall accuracy of the model may be lower. To improve the performance of the model, one could consider adjusting the threshold of the decision boundary or using a different model altogether.

No alt text provided for this image


Come on, let us implement binary logistic regression!

Problem statement :

Suppose we have a data frame containing information about customers of an online store, including their age, gender, income, and whether or not they made a purchase. We want to use binary logistic regression to predict which customers are likely to make a purchase based on their age, gender, income.

Step 1: Import necessary libraries

No alt text provided for this image
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

        

Step 2 : Create a dataframe with age, gender, income, and purchase columns

No alt text provided for this image
data = {'age': [23, 25, 35, 42, 30, 48, 20, 29, 31, 40],
? ? ? ? 'gender': ['M', 'F', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'F'],
? ? ? ? 'income': [50000, 60000, 75000, 90000, 80000, 100000, 45000, 55000, 65000, 85000],
? ? ? ? 'purchase': [0, 1, 1, 1, 1, 0, 0, 0, 1, 1]}


df = pd.DataFrame(data)

        

Step 3 : Convert the gender column to a binary indicator variable

No alt text provided for this image

Next, we can use one-hot encoding to convert the categorical variable "gender" into a numerical variable. We can use the pandas get_dummies function to do this. ( This method is called as one hot encoding)

df['gender'] = pd.get_dummies(df['gender'])['M']        

Step 4 : Split the data into training and testing sets

No alt text provided for this image


X = df[['age', 'gender', 'income']
y = df['purchase']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)]        

Step 5 : Fit a logistic regression model to the training data


No alt text provided for this image


logreg = LogisticRegression()
logreg.fit(X_train, y_train)        

Step 6 : make predictions on the testing data

No alt text provided for this image
y_pred = logreg.predict(X_test)        

Step 7 : Evaluate the performance of the model

No alt text provided for this image
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)        

we can evaluate the performance of the model using a confusion matrix

from?sklearn.metrics?import?confusion_matrix
cm?=?confusion_matrix(y_test,?y_pred)
print(cm)        

This calculates the confusion matrix for the testing data, which shows the number of true positives, true negatives, false positives, and false negatives.

Step 8 : Make new prediction

No alt text provided for this image


Finally, to predict the outcome for new customers, we can create a new dataframe with the same columns as the original, but with the values for age, gender, and income for the new customers:

new_customers = {'age': [26, 33, 45],
? ? ? ? ? ? ? ? ?'gender': ['F', 'M', 'M'],
? ? ? ? ? ? ? ? ?'income': [70000, 55000, 95000]}


new_df = pd.DataFrame(new_customers)


new_df['gender'] = pd.get_dummies(new_df['gender'])['M']


new_predictions = logreg.predict(new_df[['age', 'gender', 'income']])


        

Display? new predictions in?dataframe

new_predictions_df?=?pd.DataFrame({'predicted_purchase':?new_predictions})
result_df?=?pd.concat([new_df,?new_predictions_df],?axis=1)
result_df        

Code Summary:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


data = {'age': [23, 25, 35, 42, 30, 48, 20, 29, 31, 40],
? ? ? ? 'gender': ['M', 'F', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'F'],
? ? ? ? 'income': [50000, 60000, 75000, 90000, 80000, 100000, 45000, 55000, 65000, 85000],
? ? ? ? 'purchase': [0, 1, 1, 1, 1, 0, 0, 0, 1, 1]}


df = pd.DataFrame(data)


df['gender'] = pd.get_dummies(df['gender'])['M']
df


X = df[['age', 'gender', 'income']]
y = df['purchase']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


logreg = LogisticRegression()
logreg.fit(X_train, y_train)


y_pred = logreg.predict(X_test)


from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)


from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)


new_customers = {'age': [26, 33, 45],
? ? ? ? ? ? ? ? ?'gender': ['F', 'M', 'M'],
? ? ? ? ? ? ? ? ?'income': [70000, 55000, 95000]}


new_df = pd.DataFrame(new_customers)
new_df['gender'] = pd.get_dummies(new_df['gender'])['M']
new_predictions = logreg.predict(new_df[['age', 'gender', 'income']])
new_predictions_df = pd.DataFrame({'predicted_purchase': new_predictions})
result_df = pd.concat([new_df, new_predictions_df], axis=1)
result_df
        

Eager to see how it executes : Link to Colab

Finally we made it!

No alt text provided for this image

Meeting you next week for Cup of Coffee with an ML Algorithm!


Cheers,

Kiruthika.

M Somanath

Senior Project Manager - Technology

1 年

Thank you KIRUTHIKA S for making a pleasent read this weekend. I also observe that one could do Cheers over a coffee too. Nice going.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了