Predicting Telco Customer Churn and findings from data analysis: Using machine learning.
Jabo Justin
Technical Support Engineer at Micro Focus && at Tek-Experts (||Advanced Authentication ||Secure Login||Network Security Products Team),, ||Data Analyst|| Data Engineer|| (BI) Analyst|| Team Leader Manager At Azubi Africa
Introduction:
For the purpose of analyzing customer attrition at a telecoms firm (Telco), this repository contains code and resources. The research seeks to forecast customer turnover and comprehend the demographic differences between churned and non-churned consumers.
Summary of Contents Project Summary Results of Dataset Installation and Usage Contributor’s Permit Project Summary For telecom firms, customer churn is a major worry. Companies may prevent customer churn by establishing predictive models and understanding the elements that influence it. This project compares the demographic traits of churned customers against those that did not churn and uses machine learning techniques to forecast customer churn.
Dataset The project makes use of the Telco Customer turnover dataset, which includes details on Telco clients’ demographics, services they subscribe to, and turnover rates. The data directory contains the dataset, which is provided in CSV format.
By examining previous data, seeing trends, and using other statistical techniques, we will use machine learning to identify which consumers are most likely to leave. This article will explore how to use customer data and behavioral traits to develop a classification model that can predict customer turnover using the CRISP-DM framework.
Plan Scenario:
In this project, we seek to determine the possibility that a client would leave the business, the primary churn indicators, as well as the retention tactics that may be used to avoid this issue.
Project Description:
The amount of customers who discontinue doing business with a company during a specific time period is referred to as customer churn. In other terms, it refers to the frequency with which customers stop using a company’s goods or services. Churn can result from a number of circumstances, including unhappiness with the product or service, rival alternatives, modifications in consumer needs, or outside influences. Businesses need to understand and control customer turnover because it can significantly affect sales, expansion, and client happiness. In this project, we’ll determine the possibility that a client will leave the business, the important churn indicators, and the retention tactics that may be used to prevent this issue.
Objective:
To create a classification model that can reliably predict whether a customer would churn or not.
customer churn prediction using machine learning algorithms. For each model, evaluation measures (such accuracy, precision, recall, and F1-score). Comparison of the demographic makeup of churned and non-churned consumers. Visualizations, such as stacked bar charts, are used to display the findings.
Resources and Tools:
Steps of the project
The project consists of the following sections:
Exploratory Data Analysis (EDA)
Finding the pertinent features in your data can be one of the major challenges in developing a classification model that makes predictions. You may distinguish between churning and non-churning clients by identifying key features; this takes a thorough understanding of the business as well as significant data analysis to find patterns and trends in the data. Understanding data is aided by the strategy of posing queries and formulating hypotheses. To better comprehend the data, the following hypotheses and questions were developed.
Hypothesis:
Hypothesis?Null:?Customers who have been with the company for a longer time are less likely to leave, Compared to clients who have been with the firm for a shorter time.
Altenative:?The length of time a customer has been a customer of the business has no bearing on customer churn.
Business Questions:
Importing Libraries
from google.colab import drive
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import Pipeline
import pickle
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
Data Reading:
The first step of the analysis consists of?reading and storing the data?in a Pandas data frame using the?pandas.read_csv?function.
#Loading the dataset from google drive
drive.mount('/content/drive', force_remount=True)
customer = pd.read_csv('/content/drive/MyDrive/Azubi/lp3/Customer_Churn.csv')
Mounted at /content/drive
customer.head()
As was previously demonstrated, the data set consists of 19 independent variables that fall into one of three categories:
(1)?Demographic Information:
(2) Information about customer accounts
(3) Services Information Details:
# set customerID column as index
customer.set_index('customerID')
2. Exploratory Data Analysis and Data Cleaning
Exploratory data analysis?involves examining the key elements of a data set, typically using visualization techniques and summary statistics. Before conducting further analyses, the goal is to comprehend the data, find trends and anomalies, and test presumptions.
Missing values and data types
When EDA first started, we wanted to learn as much as we could about the data, which is why we used pandas.It’s useful to use the DataFrame.info method. This method outputs a brief summary of the data frame that includes the names of the columns and their data types, the number of non-null values, and the memory consumption.
# check the basic info of the dataset
customer.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerID 7043 non-null object
1 gender 7043 non-null object
2 SeniorCitizen 7043 non-null int64
3 Partner 7043 non-null object
4 Dependents 7043 non-null object
5 tenure 7043 non-null int64
6 PhoneService 7043 non-null object
7 MultipleLines 7043 non-null object
8 InternetService 7043 non-null object
9 OnlineSecurity 7043 non-null object
10 OnlineBackup 7043 non-null object
11 DeviceProtection 7043 non-null object
12 TechSupport 7043 non-null object
13 StreamingTV 7043 non-null object
14 StreamingMovies 7043 non-null object
15 Contract 7043 non-null object
16 PaperlessBilling 7043 non-null object
17 PaymentMethod 7043 non-null object
18 MonthlyCharges 7043 non-null float64
19 TotalCharges 7043 non-null object
20 Churn 7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
# Statistical distribution of the dataset
customer.describe()
The data set has 21 columns and 7043 observations, as seen above. The data collection doesn’t appear to contain any null values, but we notice that the column?TotalCharges?was?incorrectly identified as an object.?This column is a numeric variable since it shows the total cost incurred by the customer. We need to convert this column into a?numeric data type?so that we may analyze it further. We may achieve this by use the?pd.to_numeric?function. When non-numeric data is encountered, this function raises an exception by default. However, we may use the?errors=’coerce’?argument to bypass those cases and substitute them with a?NaN.
The dataset contains 21 columns and rows:
# check the datatype
customer.dtypes
customerID object
gender object
SeniorCitizen int64
Partner object
Dependents object
tenure int64
PhoneService object
MultipleLines object
InternetService object
OnlineSecurity object
OnlineBackup object
DeviceProtection object
TechSupport object
StreamingTV object
StreamingMovies object
Contract object
PaperlessBilling object
PaymentMethod object
MonthlyCharges float64
TotalCharges object
Churn object
dtype: object
The Total Charges column is object instead of float this would be changed to numeric datatyppe
# Convert the dtype of TotalCharges column from object to numeric
customer['TotalCharges'] = pd.to_numeric(customer['TotalCharges'], errors='coerce')
customer.dtypes
customerID object
gender object
SeniorCitizen int64
Partner object
Dependents object
tenure int64
PhoneService object
MultipleLines object
InternetService object
OnlineSecurity object
OnlineBackup object
DeviceProtection object
TechSupport object
StreamingTV object
StreamingMovies object
Contract object
PaperlessBilling object
PaymentMethod object
MonthlyCharges float64
TotalCharges float64
Churn object
dtype: object
#check for missing values
customer.isna().sum()
customerID 0
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 11
Churn 0
dtype: int64
Now, we can see:
That there are 11 missing values in the TotalCharges column. Therefore we are going to drop them since the quantity is not huge enough to affect our data analysis.
customer.dropna(subset=['TotalCharges'], inplace=True)
#Univariate Analysis
# Ananlyze the Churn column
customer["Churn"].value_counts()
No 5163
Yes 1869
Name: Churn, dtype: int64
From the above a total of 1869 customers churned
# Create a histogram of the Tenure column
plt.hist(customer['tenure'], bins=20, color='purple')
plt.xlabel('Tenure (months)')
plt.ylabel('Frequency')
plt.show()
From the above chart we could see that the company get a lot of new customers signing up for services. Most of which stay up to about 74 months before churning.
#Bivariate Analysis
Lets see the relationship between Monthly charge and Total Charge and Customer Churn
领英推荐
sns.scatterplot(x='MonthlyCharges', y='TotalCharges', data=customer, hue="Churn")
plt.xlabel('Monthly Charges')
plt.ylabel('Total Charges')
plt.show()
From the scatter plot one could identify that lot more customers churned when their monthly bills were rising between 70 to about 105 dollars. However lot more customers stayed when their Total Charges were rather going high.
#sns.boxplot(x='Churn', y='MonthlyCharges', data=customer);
Lets check the relationship between Contract type and Customer Churn
sns.countplot(x='Contract', hue='Churn', data=customer)
plt.title('Contract vs Churn')
plt.xlabel('Contract')
plt.ylabel('Number of Customers')
plt.show()
We noticed that the longer the contract duration, the lower the churn rate. This might probably be due to the fact that customers pay more when they sign on to Monthly contracts.
The diagonal plots show the distribution of each column. For example, we can see that the distribution of Tenure is skewed to the right, indicating that there are more customers with shorter tenures. The distribution of MonthlyCharges is roughly bell-shaped, indicating a normal distribution.
#Data Cleaning
#Issues With Data And How They Were Resolved
The customerID column has no impact on our analysis therefore we are going to drop it
customer.drop("customerID", axis=1, inplace= True )
Answering Business Questions
Question 1.?What is the overall churn rate for the company.
churn_rate = customer['Churn'].value_counts(normalize=True)['Yes']
print("Overall Churn Rate: {:.2f}".format(churn_rate))
Overall Churn Rate: 0.27
# Graphical representation of the churn rate
plt.figure(figsize=(6,5))
churn_rate = customer['Churn'].value_counts(normalize=True)['Yes']
plt.bar(['Churned', 'Not Churned'], [churn_rate, 1-churn_rate])
plt.xlabel('Churn')
plt.ylabel('Proportion')
plt.title('Overall Churn Rate')
plt.show()
The Total Churn rate of the company is 27%, where as the rate of customers who did not churn stood at 77%
#Save The Cleaned Dataset Into A New CSV File
customer.to_csv("df.csv", index= False)
#load the new file
df=pd.read_csv("df.csv")
#check the head()
df.head()
Machine Learning and Modelling
Feature Processing and Engineering
Feature engineering involves transforming and creating new features from the existing data to enhance the predictive power of the model.
Drop duplicate values
#Lets check the shape of the dataset first
df.shape[0]
7032
# Lets check for duplicates
df.duplicated().any().sum()
1
# drop duplicate values
df.drop_duplicates(inplace=True, keep='first')
# recheck the shape
df.shape[0]
7010
df.isna().sum()
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 0
Churn 0
dtype: int64
Replacing mising values using simple imputer
imputer = SimpleImputer(strategy='most_frequent')
#Fit the imputer to the data, focusing on the "Tenure Category" column:
imputer.fit(df[['tenure']])
#Transform the data by replacing the missing values with the imputed values:
df[['tenure']] = imputer.transform(df[['tenure']])
#check again for missing values
df.isna().sum()
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 0
Churn 0
dtype: int64
Creating New Features
# Create a new feature for the ratio of MonthlyCharges to TotalCharges
df['MonthlyChargesRatio'] = df['MonthlyCharges'] / df['TotalCharges']
# Calculate the average MonthlyCharges for each customer
df['AverageMonthlyCharges'] = df['TotalCharges'] / df['tenure']
# Create a new feature indicating whether the customer has both online security and backup
df['HasSecurityAndBackup'] = (df['OnlineSecurity'] == 'Yes') & (df['OnlineBackup'] == 'Yes')
# Create a new feature representing the number of additional services subscribed to
additional_services = ['DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']
df['AdditionalServices'] = df[additional_services].apply(lambda row: row.sum(), axis=1)
import pandas as pd
# Define the bin ranges and labels
bins = [0, 6, 24, float('inf')]
labels = ['New', 'Established', 'Long-term']
# Convert the Tenure column into categorical bins
df['TenureCategory'] = pd.cut(customer['tenure'], bins=bins, labels=labels, right=False)
import pandas as pd
encoded_df = pd.get_dummies(df, columns=['gender', 'InternetService', 'PaymentMethod'])
df.head()
Feature Encoding
# Create an instance of the OneHotEncoder
encoder = OneHotEncoder()
# Specify the categorical columns to encode
categorical_columns = ['gender', 'InternetService', 'Contract']
# Fit the encoder on the categorical columns and transform the data
encoded_data = encoder.fit_transform(df[categorical_columns])
# Convert the encoded data to a DataFrame
encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names_out(categorical_columns))
# Concatenate the encoded DataFrame with the original data
data_encoded = pd.concat([df, encoded_df], axis=1)
# Drop the original categorical columns
data_encoded.drop(categorical_columns, axis=1, inplace=True)
#Spliting The Dataset Into Training & Testing
# Exclude the 'Churn' column from the feature variables
# Assign the feature variables to 'x'
x = df.drop('Churn', axis=1)
# Assign the target variable to 'y'
y = df['Churn']
# split the data into training and evaluation sets.
x_train, x_eval, y_train, y_eval = train_test_split(x, y, test_size=0.2, random_state=0)
#check the shape of the split datset
print(x_train.shape, y_train.shape)
print(x_eval.shape, y_eval.shape)
(5608, 24) (5608,)
(1402, 24) (1402,)
Feature Encoding
#Impute missing values with mode
x_train = x_train.fillna(x_train.mode().iloc[0])
x_eval = x_eval.fillna(x_eval.mode().iloc[0])
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder
# Perform label encoding on X_train
le = LabelEncoder()
x_train =x_train.apply(le.fit_transform)
# Perform label encoding on X_test
le = LabelEncoder()
x_eval =x_eval.apply(le.fit_transform)
# Perform label encoding on y_train and y_test
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_eval_encoded = le.transform(y_eval)
# Check head of X_test
x_eval.head()
# Check head of X_train
x_train.head()
# Check y_test
y_eval_encoded
array([0, 0, 0, ..., 1, 0, 1])
# Check y_train
y_train_encoded
array([1, 0, 1, ..., 0, 1, 0])
Feature Scaling
# Create a scaler object
scaler = StandardScaler()
# Fit on the training data
scaler.fit(x_train)
# Apply the scaler transform to both the training and testing sets
x_train_scaled = scaler.transform(x_train)
x_eval_scaled = scaler.transform(x_eval)
#Train set Balancing
ros = RandomOverSampler()
x_train_resampled, y_train_resampled = ros.fit_resample(x_train, y_train)
# Check sample of X_train
x_train_resampled.sample(5, random_state=3)
# Check sample of y_train
y_train.sample(5, random_state=4)
1328 No
3379 Yes
791 No
4087 No
86 No
Name: Churn, dtype: object
Insights on the interpreter
Features with positive coefficients:
Partner: A positive coefficient indicates that the likelihood of the expected result is increased when a partner is present. MultipleLines: A positive coefficient means that the likelihood of the expected result is increased when there are several phone lines. DeviceProtection: A positive coefficient indicates that the likelihood of the expected result is increased by device protection. StreamingTV: A positive coefficient means that the likelihood of the expected result is increased when streaming TV is present. A positive coefficient indicates that the likelihood of the expected result is increased by streaming movies. PaymentMethod: A positive coefficient means that the likelihood of the expected outcome is increased by specific payment methods. Feature coefficients that are negative
Gender: A negative coefficient indicates that a woman has a lower chance of the expected result. SeniorCitizen: A negative coefficient means that the projected event is less likely to occur if you are a senior citizen. Dependents: A negative coefficient indicates that the presence of dependents reduces the probability of the expected result. Duration: A negative coefficient means that the projected consequence is less likely to occur with a longer tenure. PhoneService: A negative coefficient indicates that the possibility of the anticipated result is decreased. InternetService: A negative coefficient means there is a lower chance that the expected result will occur. OnlineSecurity: A negative coefficient indicates that the likelihood of the expected result is decreased by having online security. TechSupport:?
The presence of tech assistance reduces the chance of the anticipated result, as indicated by a negative coefficient. Contract: A negative coefficient denotes a decreased probability of the anticipated outcome with a longer contract duration. PaperlessBilling: A negative coefficient means there is a lower chance that the expected result will occur when paperless billing is used. Higher monthly fees are thought to make the chance of the expected result less likely, according to a negative coefficient for monthly payments. TotalCharges: A positive coefficient means that the likelihood of the expected result is increased by higher total charges. When all other factors are zero or ignored, the anticipated outcome's intrinsic probability is represented by the intercept term (-0.06534620607998345).
#creating my pipeline
# Define the scaler and encoder
scaler = StandardScaler()
encoder = OrdinalEncoder()
# Define the logistic regression model
logreg_model = LogisticRegression()
# Define the pipeline
pipeline = Pipeline([
('scaler', scaler),
('encoder', encoder),
('model', logreg_model),
])
print(pipeline)
Pipeline(steps=[('scaler', StandardScaler()), ('encoder', OrdinalEncoder()),
('model', LogisticRegression())])
# Define the pipeline steps
steps = [
('scaler', StandardScaler()),
('encoder', OrdinalEncoder()),
('classifier', LogisticRegression())
]
# Create the pipeline
pipeline = Pipeline(steps)
# Fit the pipeline on the training data
pipeline.fit(x_train, y_train)
# Make predictions on the test data
y_pred = pipeline.predict(x_eval)
y_pred
array(['No', 'No', 'No', ..., 'Yes', 'No', 'No'], dtype=object)
# Defining the models
knn_model = KNeighborsClassifier()
logreg_model = LogisticRegression()
gbm_model = GradientBoostingClassifier()
# Define the scaler and encoder
scaler = StandardScaler()
encoder = OrdinalEncoder()
# Define the pipeline
pipeline = Pipeline([
('scaler', scaler),
('encoder', encoder),
])
# Save the models, scaler, encoder, and pipeline
models = [knn_model, logreg_model, gbm_model]
components = [scaler, encoder]
file_prefix = 'my_project'
for idx, model in enumerate(models):
with open(f'{file_prefix}_model_{idx}.pkl', 'wb') as f:
pickle.dump(model, f)
for idx, component in enumerate(components):
with open(f'{file_prefix}_{type(component).__name__}_{idx}.pkl', 'wb') as f:
pickle.dump(component, f)
with open(f'{file_prefix}_pipeline.pkl', 'wb') as f:
pickle.dump(pipeline, f)
print('Project components exported successfully.')
Project components exported successfully.
Making judgments — Summary:
Using the Telco customer Churn dataset, we have gone through a whole end-to-end machine learning exercise in this post. We began by cleaning the data and using visualization to analyze it. After that, we feature engineered the categorical data into numeric variables in order to create a machine learning model. We experimented with six different machine learning algorithms using default parameters once the data had been transformed. Finally, we optimized the Gradient Boosting Classifier’s hyperparameters (best performance model) to achieve an accuracy of approximately 80% (about 6% better than the baseline).
It is crucial to emphasize that each project has a different?set of precise machine learning task steps.?Even though we followed a linear procedure in this post, machine learning?projects typically follow an iterative approach rather than a linear one.?As we learn more about the issue we’re trying to solve, earlier steps are frequently revisited.
Note: You can find more details about this project on my?GitHub?repository or visit my?medium?account if you're interested in doing so.