Logistic Regression
Kiruthika Subramani
Innovating AI for a Better Tomorrow | AI Engineer | Google Developer Expert | Author | IBM Dual Champion | 200+ Global AI Talks | Master's Student at MILA
I appreciate all the support, you have given for my Simple Linear Regression article. Thanks to all the reviewers, your constructive feedback helped me enhance the quality of the content.
I would like to invite you to join me for a cup of coffee this week, as I plan to discuss another interesting algorithm - Logistic Regression.
Logistic Regression
Logistic regression is a commonly used algorithm in Machine Learning that falls under the category of Supervised Learning.
I know it may be difficult for you to accept, but I still need to convey that logistic regression is commonly used to solve classification problems.
Similar to a bat stating that it has wings and can fly, despite not being a bird, the bat is a proud mammal that is capable of flight.
Let us understand why ?
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts two maximum values (0 or 1).
In logistic regression, the independent variables are used to predict the dependent variable, just like in other regression techniques. However, the key difference is that the dependent variable in logistic regression is categorical or binary, whereas in other regression techniques, the dependent variable is continuous.
Why shouldn't we call Logistic Regression "Logistic Classification"?
it's not entirely accurate to call logistic regression "logistic classification" is because, logistic regression is a statistical method that estimates the probability of a binary outcome, while classification is the task of predicting the category or class that a new data point belongs to based on a set of features. While logistic regression can be used for classification tasks, it is still a regression method at its core and is primarily used for estimating probabilities rather than making direct classifications.
Do logistic regression generates only binary outcomes ?
Yes, Binomial logistic regression is used to predict binary outcomes or dependent variables that have only two possible values or categories.As a result, the output of logistic regression should be categorical or discrete. It could take on values such as Yes or No, 0 or 1, true or false, and so on. However, rather than providing a precise value of 0 or 1, the logistic regression model generates probability scores that fall between 0 and 1.
However, there are extensions of logistic regression, such as multinomial logistic regression and ordinal logistic regression, that can handle dependent variables with more than two categories.
What are the types of Logistic Regression?
3.Ordinal Logistic Regression: This type of logistic regression is used when the dependent variable has three or more ordered categories. For example, it can be used to predict the level of customer satisfaction based on their feedback, such as poor, average, and excellent.
In summary, binary logistic regression is used for binary outcomes, multinomial logistic regression is used for non-ordered categorical outcomes, and ordinal logistic regression is used for ordered categorical outcomes.
What happens when we use linear regression to Logistic Regression Problems?
For example instead of saying pass or fail, you are supposed to say the marks obtained. Using linear regression for logistic regression problems in marks can lead to inaccurate results because it assumes a linear relationship between the variables and cannot handle binary outcomes.
A Quick Recap
What does that graph exactly convey to us?
Logistic regression uses a sigmoid function to map the linear regression output to a probability score, which is then used to predict the binary outcome. The sigmoid function is an S-shaped curve that ranges from 0 to 1.
When we plot the sigmoid function on a graph, the X-axis represents the linear regression output, while the Y-axis represents the probability score. At the extreme ends of the graph, the probability score is either 0 or 1, indicating a certain binary outcome.
As we move towards the middle of the graph, the probability score approaches 0.5, indicating uncertainty about the binary outcome. The point on the X-axis where the probability score is exactly 0.5 is known as the decision boundary.
The decision boundary separates the two classes in the binary classification problem. Any input data point on one side of the decision boundary is classified as one class (e.g., positive outcome), while any input data point on the other side of the decision boundary is classified as the other class (e.g., negative outcome).
Thus, logistic regression uses the sigmoid function and the decision boundary to predict the binary outcome based on the input variables.
What happens when decision boundary is 0.5 in logistic regression?
领英推荐
When the decision boundary is 0.5 in logistic regression, it means that the model is equally uncertain about both classes.Therefore, data points that fall close to the decision boundary may be misclassified, and the overall accuracy of the model may be lower. To improve the performance of the model, one could consider adjusting the threshold of the decision boundary or using a different model altogether.
Come on, let us implement binary logistic regression!
Problem statement :
Suppose we have a data frame containing information about customers of an online store, including their age, gender, income, and whether or not they made a purchase. We want to use binary logistic regression to predict which customers are likely to make a purchase based on their age, gender, income.
Step 1: Import necessary libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
Step 2 : Create a dataframe with age, gender, income, and purchase columns
data = {'age': [23, 25, 35, 42, 30, 48, 20, 29, 31, 40],
? ? ? ? 'gender': ['M', 'F', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'F'],
? ? ? ? 'income': [50000, 60000, 75000, 90000, 80000, 100000, 45000, 55000, 65000, 85000],
? ? ? ? 'purchase': [0, 1, 1, 1, 1, 0, 0, 0, 1, 1]}
df = pd.DataFrame(data)
Step 3 : Convert the gender column to a binary indicator variable
Next, we can use one-hot encoding to convert the categorical variable "gender" into a numerical variable. We can use the pandas get_dummies function to do this. ( This method is called as one hot encoding)
df['gender'] = pd.get_dummies(df['gender'])['M']
Step 4 : Split the data into training and testing sets
X = df[['age', 'gender', 'income']
y = df['purchase']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)]
Step 5 : Fit a logistic regression model to the training data
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Step 6 : make predictions on the testing data
y_pred = logreg.predict(X_test)
Step 7 : Evaluate the performance of the model
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
we can evaluate the performance of the model using a confusion matrix
from?sklearn.metrics?import?confusion_matrix
cm?=?confusion_matrix(y_test,?y_pred)
print(cm)
This calculates the confusion matrix for the testing data, which shows the number of true positives, true negatives, false positives, and false negatives.
Step 8 : Make new prediction
Finally, to predict the outcome for new customers, we can create a new dataframe with the same columns as the original, but with the values for age, gender, and income for the new customers:
new_customers = {'age': [26, 33, 45],
? ? ? ? ? ? ? ? ?'gender': ['F', 'M', 'M'],
? ? ? ? ? ? ? ? ?'income': [70000, 55000, 95000]}
new_df = pd.DataFrame(new_customers)
new_df['gender'] = pd.get_dummies(new_df['gender'])['M']
new_predictions = logreg.predict(new_df[['age', 'gender', 'income']])
Display? new predictions in?dataframe
new_predictions_df?=?pd.DataFrame({'predicted_purchase':?new_predictions})
result_df?=?pd.concat([new_df,?new_predictions_df],?axis=1)
result_df
Code Summary:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
data = {'age': [23, 25, 35, 42, 30, 48, 20, 29, 31, 40],
? ? ? ? 'gender': ['M', 'F', 'M', 'F', 'M', 'M', 'F', 'F', 'M', 'F'],
? ? ? ? 'income': [50000, 60000, 75000, 90000, 80000, 100000, 45000, 55000, 65000, 85000],
? ? ? ? 'purchase': [0, 1, 1, 1, 1, 0, 0, 0, 1, 1]}
df = pd.DataFrame(data)
df['gender'] = pd.get_dummies(df['gender'])['M']
df
X = df[['age', 'gender', 'income']]
y = df['purchase']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
new_customers = {'age': [26, 33, 45],
? ? ? ? ? ? ? ? ?'gender': ['F', 'M', 'M'],
? ? ? ? ? ? ? ? ?'income': [70000, 55000, 95000]}
new_df = pd.DataFrame(new_customers)
new_df['gender'] = pd.get_dummies(new_df['gender'])['M']
new_predictions = logreg.predict(new_df[['age', 'gender', 'income']])
new_predictions_df = pd.DataFrame({'predicted_purchase': new_predictions})
result_df = pd.concat([new_df, new_predictions_df], axis=1)
result_df
Eager to see how it executes : Link to Colab
Finally we made it!
Meeting you next week for Cup of Coffee with an ML Algorithm!
Cheers,
Kiruthika.
Senior Project Manager - Technology
1 年Thank you KIRUTHIKA S for making a pleasent read this weekend. I also observe that one could do Cheers over a coffee too. Nice going.