Learn Logistic Regression for Classification with Python: 10 Practical Examples.
Logistic regression is a powerful statistical tool used to analyze the relationship between a dependent variable and one or more independent variables.
This method is widely used in machine learning, statistics, and social sciences to model the probability of a specific event occurring. Essentially, logistic regression is a method for predicting binary outcomes, which are events that can only have two possible outcomes, such as “yes” or “no”, “true” or “false”, or “success” or “failure”. This makes logistic regression a particularly useful tool for analyzing a wide range of phenomena in various fields.
Whether you are working in data science, social research, or any other field that involves analyzing binary outcomes, logistic regression is an essential tool that you should definitely consider using.
10 Applications of logistic regression
1. Customer Churn Prediction
Logistic regression can be used for churn prediction, which involves identifying?customers who are likely to stop using a product or service.
The goal of churn prediction is to proactively retain customers by taking targeted actions such as offering promotions or discounts.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
data = pd.read_csv('customer_churn.csv')
X = data.drop('churn', axis=1)
y = data['churn']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))
In this example, we’re using a labeled dataset of customer interactions with a binary label of ‘churned’ or ‘not churned’. We split the dataset into training and test sets, and train a logistic regression model on the training set.
Then, we evaluate the model’s performance on the test set using the?score()?method, which?returns the accuracy of the model on the test set.
Once the model is trained, it can be used to predict the probability of churn for new customers using the?predict_proba()?method:
new_customer = [25, 'female', 0, 10, 'no']
churn_probability = model.predict_proba([new_customer])[0][1]
print("Probability of churn: {:.2f}%".format(churn_probability * 100))
In this example, we’re predicting the probability of churn for a new customer who is a?25-year-old female, has been using the product for?10 months?with?0 customer service interactions. The ‘no’ value in the last feature represents whether they have made a recent purchase or not.
2. Credit risk analysis
Logistic regression can be used for credit risk analysis, which involves assessing the creditworthiness of a borrower based on various factors such as income, credit history, and debt-to-income ratio. The goal of credit risk analysis is to determine the probability of a borrower defaulting on a loan.
To use logistic regression for credit risk analysis, a dataset of past loan applications and their outcomes can be used to train a model.
The dataset should include information about the borrower’s credit score, income, employment status, debt-to-income ratio, and other relevant factors, as well as the outcome of the loan application (approved or denied).
Once the model is trained, it can be used to predict the probability of default for new loan applications based on their features.
Here’s an example code snippet for credit risk analysis using logistic regression in Python:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
data = pd.read_csv('credit_risk.csv')
X = data.drop('default', axis=1)
y = data['default']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))
In this example, we’re using a labeled dataset of loan applications with a binary label of ‘default’ or ‘non-default’. We split the dataset into training and test sets, and train a logistic regression model on the training set.
Then, we evaluate the model’s performance on the test set using the?score()?method, which returns the accuracy of the model on the test set.
Once the model is trained, it can be used to predict the probability of default for new loan applications using the?predict_proba()?method:
new_loan_application = [750, 50000, 'employed', 0.4]
default_probability = model.predict_proba([new_loan_application])[0][1]
print("Probability of default: {:.2f}%".format(default_probability * 100))
In this example, we’re predicting the probability of default for a new loan application with a credit score of 750, an income of 50000, employed status, and a debt-to-income ratio of 0.4.
3. Fraud detection
Fraud detection involves identifying fraudulent activities or transactions. Logistic regression can help build a model that predicts the likelihood of fraud based on various factors.
The model can be trained on labeled data and used to predict fraud probability for new transactions. If the probability is high, the transaction can be flagged for investigation or declined.
Here’s an example Python code for fraud detection using logistic regression:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Load the dataset
data = pd.read_csv('credit_card_transactions.csv')
X = data.drop('is_fraud', axis=1)
y = data['is_fraud']
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Train a logistic regression model on the training set
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluate the model's performance on the test set
accuracy = model.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))
# Predict the probability of a transaction being fraudulent
new_transaction = [1000, 'Internet', 'Monday']
fraud_probability = model.predict_proba([new_transaction])[0][1]
print("Probability of fraud: {:.2f}%".format(fraud_probability * 100))
This code uses logistic regression to build a fraud detection model. It first imports the necessary libraries: pandas for data manipulation and analysis, scikit-learn’s logistic regression model for classification tasks, and the train_test_split function from scikit-learn for splitting the dataset into training and testing sets.
The credit card transactions dataset is then loaded from a CSV file and stored in a pandas dataframe. The?is_fraud?column is dropped from the dataframe and stored in a new dataframe?X, while the?is_fraud?column itself is stored in a pandas series?y.
The dataset is then split into training and testing sets using the?train_test_split?function with a 0.2 test size and a random state of 42 for reproducibility. An instance of the logistic regression model is created and fitted to the training data. The model's performance is then evaluated on the test set using the?score?method, which calculates the accuracy of the model's predictions.
Finally, a new transaction is created, and the model is used to predict the probability of fraud for this new transaction using the?predict_proba?method. The probability of fraud is then printed as a percentage.
4. Email spam classification
Logistic regression can be used to?classify emails as spam or not spam?based on various factors such as?email content,?sender information, and?subject line.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Load the dataset
data = pd.read_csv('spam_emails.csv')
X = data.drop('is_spam', axis=1)
y = data['is_spam']
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Train a logistic regression model on the training set
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluate the model's performance on the test set
accuracy = model.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))
# Predict the probability of an email being spam for new emails
new_email = [0.3, 0.1, 0.8, 0.5, 0.2, 0.6]
spam_probability = model.predict_proba([new_email])[0][1]
print("Probability of spam: {:.2f}%".format(spam_probability * 100))
In this code, a logistic regression model is trained to predict whether an email is spam or not based on various features of the email such as the frequency of certain words or characters.
First, the dataset of spam emails is loaded using Pandas, and the independent variables are separated from the dependent variable. Then, the dataset is split into training and testing sets using the?train_test_split?function from scikit-learn.
Next, a logistic regression model is created using scikit-learn’s?LogisticRegression?class, and the model is trained on the training set using the?fit?method.
After training, the performance of the model is evaluated on the test set using the?score?method, which calculates the accuracy of the model.
Finally, the model is used to predict the probability of an email being spam for new emails. The features of the new email are represented by a list, and the?predict_proba?method of the model is used to calculate the probability of the email being spam. The probability of the email being spam is printed to the console.
5. Medical diagnosis
Logistic regression can be used to diagnose diseases based on various factors such as symptoms, medical history, and demographics.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Load the dataset
data = pd.read_csv('medical_diagnosis_data.csv')
# Split the dataset into features and labels
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Train a logistic regression model on the training set
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluate the model's performance on the test set
accuracy = model.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))
# Predict the probability of a diagnosis for a new patient
new_patient = [45, 'female', 'no', 'yes', 'yes']
diagnosis_probability = model.predict_proba([new_patient])[0][1]
print("Probability of diagnosis: {:.2f}%".format(diagnosis_probability * 100))
In this example, we have a labeled dataset of medical records with a binary label of ‘diagnosis’ or ‘no diagnosis’.
领英推荐
We split the dataset into training and test sets, and train a logistic regression model on the training set. Then, we evaluate the model’s performance on the test set using the score() method, which returns the accuracy of the model on the test set.
Once the model is trained, it can be used to predict the probability of diagnosis for new patients using the predict_proba() method. We define a new patient with some features (age, gender, symptoms, etc.) and predict the probability of diagnosis for this patient.
Note that the features in the dataset and the new patient are encoded as categorical variables (e.g. ‘female’, ‘no’) which need to be converted to numerical values before training the model. This can be done using techniques such as one-hot encoding or label encoding.
Additionally, some preprocessing steps such as missing value imputation or feature scaling may be necessary to prepare the dataset for modeling.
6. Sentiment analysis
Logistic regression can be used to analyze the sentiment of text data such as product reviews or social media posts.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Load the dataset
data = pd.read_csv('sentiment_analysis.csv')
X = data.drop('label', axis=1)
y = data['label']
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Train a logistic regression model on the training set
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluate the model's performance on the test set
accuracy = model.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))
# Predict the probability of a sentence being positive for new data
new_sentence = [0.3, 0.1, 0.8, 0.5, 0.2, 0.6]
positive_probability = model.predict_proba([new_sentence])[0][1]
print(
"Probability of positive sentence: {:.2f}%".format(positive_probability * 100))
7. Website conversion optimization
Logistic regression can be used to optimize website conversion rates by predicting the probability of a user converting based on various factors such as website design, user behavior, and demographics.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Load the dataset
data = pd.read_csv('website_conversion.csv')
X = data.drop('converted', axis=1)
y = data['converted']
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Train a logistic regression model on the training set
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluate the model's performance on the test set
accuracy = model.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))
# Predict the probability of a website visitor converting for new visitors
new_visitor = [25, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1]
conversion_probability = model.predict_proba([new_visitor])[0][1]
print("Probability of conversion: {:.2f}%".format(conversion_probability * 100))
This code loads the “website_conversion.csv” dataset using Pandas, and splits it into training and test sets using?train_test_split?function. It then trains a logistic regression model using?LogisticRegression()?from Scikit-learn library, and evaluates its performance on the test set using?score()?method.
Finally, it uses the trained model to predict the probability of a new website visitor converting using?predict_proba()?method, and prints the result in percentage format. The input data for the new visitor is an array of values representing various features of the visitor.
8. Predicting heart disease
Logistic regression can be used for predicting heart disease based on various risk factors such as?age,?gender,?blood pressure,?cholesterol levels, and?smoking habits.
The goal is to identify individuals who are at higher risk of developing heart disease so that appropriate interventions can be taken to prevent or manage the disease.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
data = pd.read_csv('heart_disease_data.csv')
X = data.drop('heart_disease', axis=1)
y = data['heart_disease']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))
In this example, we’re using a labeled dataset of medical records with a binary label of ‘heart disease’ or ‘no heart disease’. We split the dataset into training and test sets, and train a logistic regression model on the training set.
Then, we evaluate the model’s performance on the test set using the?score()?method, which returns the accuracy of the model on the test set.
Once the model is trained, it can be used to predict the probability of heart disease for new individuals using the?predict_proba()?method:
new_patient = [60, 'male', 140, 80, 210, 'no', 'yes', 'no', 2]
heart_disease_probability = model.predict_proba([new_patient])[0][1]
print("Probability of heart disease: {:.2f}%".format(heart_disease_probability * 100))
In this example, we’re predicting the probability of heart disease for a new patient who is a 60-year-old male, has a blood pressure of 140/80 mmHg, a total cholesterol level of 210 mg/dL, doesn’t smoke, has a family history of heart disease and exercise-induced chest pain, and has an electrocardiogram (ECG) reading indicating possible left ventricular hypertrophy.
9. Spam classification
Logistic regression can be used for spam classification, which involves identifying whether an incoming message is spam or not based on its content.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Load the dataset
data = pd.read_csv('spam_emails.csv')
X = data.drop('is_spam', axis=1)
y = data['is_spam']
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Train a logistic regression model on the training set
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluate the model's performance on the test set
accuracy = model.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))
# Predict the probability of an email being spam for new emails
new_email = [0.2, 0.8, 0.5, 0.1, 0.3, 0.9]
spam_probability = model.predict_proba([new_email])[0][1]
print("Probability of spam: {:.2f}%".format(spam_probability * 100))
Here is a code in the same format for a logistic regression algorithm on “Spam Classification”:
pythonCopy code
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Load the dataset
data = pd.read_csv('spam_emails.csv')
X = data.drop('is_spam', axis=1)
y = data['is_spam']
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Train a logistic regression model on the training set
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluate the model's performance on the test set
accuracy = model.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))
# Predict the probability of an email being spam for new emails
new_email = [0.2, 0.8, 0.5, 0.1, 0.3, 0.9]
spam_probability = model.predict_proba([new_email])[0][1]
print("Probability of spam: {:.2f}%".format(spam_probability * 100))
This code loads the “spam_emails.csv” dataset using Pandas, and splits it into training and test sets using?train_test_split?function. It then trains a logistic regression model using?LogisticRegression()?from Scikit-learn library, and evaluates its performance on the test set using?score()?method.
Finally, it uses the trained model to predict the probability of a new email being spam using?predict_proba()?method, and prints the result in percentage format.
10. Breast Cancer Classification
Logistic regression is a binary classification algorithm commonly used in machine learning to predict the probability of an event occurring.
In the case of breast cancer classification, logistic regression can be used to predict whether a tumor is malignant or benign based on various features extracted from medical imaging data.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Load the dataset
data = pd.read_csv('breast_cancer.csv')
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Train a logistic regression model on the training set
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluate the model's performance on the test set
accuracy = model.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))
# Predict the probability of a tumor being malignant for new data
new_tumor = [14.68, 20.13, 94.74, 684.5, 0.09867, 0.07200, 0.07395, 0.05259, 0.1586, 0.05922, 0.4727, 1.2400, 3.195, 45.40, 0.005718, 0.01501, 0.01477, 0.00813, 0.01870, 0.002626, 17.12, 30.70, 115.70, 981.7, 0.1411, 0.3542, 0.2779, 0.1383, 0.2589, 0.1030]
malignant_probability = model.predict_proba([new_tumor])[0][1]
print(
"Probability of malignant tumor: {:.2f}%".format(malignant_probability * 100))In this code, we load the Breast Cancer dataset using the load_breast_cancer function from sklearn.datasets module. We then split the features and target into separate variables. We use the train_test_split function from sklearn.model_selection module to split the data into training and testing sets.
This code starts by importing the necessary libraries: pandas for data manipulation, LogisticRegression from sklearn.linear_model to create the logistic regression model and train_test_split from sklearn.model_selection to split the dataset into training and test sets.
Next, the breast cancer dataset is loaded from a CSV file and split into input features (X) and target variable (y), where ‘diagnosis’ is the target variable that indicates if a tumor is benign or malignant.
The dataset is then split into training and test sets, with 20% of the data being used for testing, and a random_state of 42 is set for reproducibility.
A logistic regression model is then created and trained on the training set, followed by evaluating the model’s performance on the test set.
Finally, a new tumor is created, and its probability of being malignant is predicted using the trained model’s predict_proba method. The predicted probability is printed to the console.
11. Titanic survival prediction
This code performs logistic regression on the Titanic dataset to predict whether a passenger survived or not based on their features.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Load the dataset
data = pd.read_csv('titanic_data.csv')
X = data.drop('Survived', axis=1)
y = data['Survived']
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
# Train a logistic regression model on the training set
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluate the model's performance on the test set
accuracy = model.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(accuracy * 100))
# Predict the probability of a passenger surviving for new data
new_passenger = [1, 0, 30, 1, 0, 7.25]
survival_probability = model.predict_proba([new_passenger])[0][1]
print(
"Probability of survival: {:.2f}%".format(survival_probability * 100))
In this code, we load the Titanic survival prediction dataset and split it into training and test sets. We then train a logistic regression model on the training set, and evaluate its performance on the test set. Finally, we use the model to predict the probability of survival for a new passenger using their features (e.g. age, gender, ticket fare).
Conclusion
In conclusion, logistic regression is a powerful machine learning algorithm that can be used for a variety of classification problems, including spam classification, breast cancer classification, survival prediction in the Titanic dataset, and predicting heart disease. It works by finding the best-fit line that separates the classes in the dataset and uses that line to make predictions on new data.
In Python, we can use libraries like pandas, scikit-learn, and matplotlib to perform logistic regression and visualize the results. By splitting the dataset into training and testing sets and evaluating the accuracy of the model on the testing set, we can assess how well the algorithm is performing and make improvements if necessary.
Overall, logistic regression is a valuable tool for data analysis and machine learning, and can be used in a wide range of applications in various fields.