Predicting Telco Customer Churn and findings from data analysis: Using machine learning.

For the purpose of analyzing customer attrition at a telecoms firm (Telco), this repository contains code and resources. The research seeks to forecast customer turnover and comprehend the demographic differences between churned and non-churned consumers.

Summary of Contents Project Summary Results of Dataset Installation and Usage Contributor’s Permit Project Summary For telecom firms, customer churn is a major worry. Companies may prevent customer churn by establishing predictive models and understanding the elements that influence it. This project compares the demographic traits of churned customers against those that did not churn and uses machine learning techniques to forecast customer churn.

Dataset The project makes use of the Telco Customer turnover dataset, which includes details on Telco clients’ demographics, services they subscribe to, and turnover rates. The data directory contains the dataset, which is provided in CSV format.

By examining previous data, seeing trends, and using other statistical techniques, we will use machine learning to identify which consumers are most likely to leave. This article will explore how to use customer data and behavioral traits to develop a classification model that can predict customer turnover using the CRISP-DM framework.

Plan Scenario:

In this project, we seek to determine the possibility that a client would leave the business, the primary churn indicators, as well as the retention tactics that may be used to avoid this issue.

Project Description:

The amount of customers who discontinue doing business with a company during a specific time period is referred to as customer churn. In other terms, it refers to the frequency with which customers stop using a company’s goods or services. Churn can result from a number of circumstances, including unhappiness with the product or service, rival alternatives, modifications in consumer needs, or outside influences. Businesses need to understand and control customer turnover because it can significantly affect sales, expansion, and client happiness. In this project, we’ll determine the possibility that a client will leave the business, the important churn indicators, and the retention tactics that may be used to prevent this issue.

To create a classification model that can reliably predict whether a customer would churn or not.

customer churn prediction using machine learning algorithms. For each model, evaluation measures (such accuracy, precision, recall, and F1-score). Comparison of the demographic makeup of churned and non-churned consumers. Visualizations, such as stacked bar charts, are used to display the findings.

Resources and Tools:

  1. A dataset.
  2. Jupyter Notebook:?Scikit Learn, Pandas Profiling, Pandas, Matplolib, Seaborn, and other machine learning libraries are available.

Steps of the project

The project consists of the following sections:

  1. Data Reading
  2. Exploratory Data Analysis and Data Cleaning
  3. Data Visualization
  4. Feature Importance
  5. Feature Engineering
  6. Setting a baseline
  7. Splitting the data in training and testing sets
  8. Assessing multiple algorithms
  9. Algorithm selected: Gradient Boosting
  10. Hyperparameter tuning
  11. Performance of the model
  12. Drawing conclusions — Summary

Exploratory Data Analysis (EDA)

Finding the pertinent features in your data can be one of the major challenges in developing a classification model that makes predictions. You may distinguish between churning and non-churning clients by identifying key features; this takes a thorough understanding of the business as well as significant data analysis to find patterns and trends in the data. Understanding data is aided by the strategy of posing queries and formulating hypotheses. To better comprehend the data, the following hypotheses and questions were developed.


Hypothesis?Null:?Customers who have been with the company for a longer time are less likely to leave, Compared to clients who have been with the firm for a shorter time.

Altenative:?The length of time a customer has been a customer of the business has no bearing on customer churn.

Business Questions:

  1. What is the overall churn rate for the company?
  2. What are the demographics of customers who churned compared to those who did not?
  3. How can the company reduce churn rate and retain more customers?

Importing Libraries

from google.colab import drive
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt 
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OrdinalEncoder
from sklearn.pipeline import Pipeline
import pickle

import warnings

%matplotlib inline
Data Reading:

The first step of the analysis consists of?reading and storing the data?in a Pandas data frame using the?pandas.read_csv?function.

#Loading the dataset from google drive

drive.mount('/content/drive', force_remount=True)
customer = pd.read_csv('/content/drive/MyDrive/Azubi/lp3/Customer_Churn.csv')
Mounted at /content/drive

As was previously demonstrated, the data set consists of 19 independent variables that fall into one of three categories:

(1)?Demographic Information:

  • tenure:?The length of time a customer has been a customer of the business (multiple different number values).
  • client’s “gender”: (Female, Male) Whether the client is a woman or a man.
  • SeniorCitizen:?Indicates whether the client is an older person (0, 1).
  • Partner:?Indicates whether or not the client is partnered (Yes, No).
  • Dependents:?Indicates whether the client is supported by others (Yes, No).

(2) Information about customer accounts

  • tenure:?The length of time a customer has been a customer of the business (multiple different number values).
  • Contract:?The type of existing contract for the customer (Month-to-Month, One-Year, Two-Year).
  • PaperlessBilling:?Whether the client uses paperless billing (Yes, No).
  • PaymentMethod:?The chosen payment method by the consumer (credit card, bank transfer, electronic check, paper check).
  • MontlyCharges:?The monthly charge made to the consumer (various numeric quantities).
  • TotalCharges?(many different numeric values): The total amount charged to the consumer.

(3) Services Information Details:

  • Phone service:?If the client has a phone service, it is indicated by the words “Yes” or “No.”
  • MultipleLines:?Whether the customer has more than one line (no phone service, no service, yes service).
  • InternetServices:?Whether the client has a subscription to the company’s Internet service (DSL, Fiber optic, or No).
  • OnlineSecurity:?Indicates if the client has access to online security (Internet service available, No, Yes).
  • OnlineBackup:?Indicates whether or not the client has an online backup (Internet service unavailable, No, Yes).
  • DeviceProtection:?Indicates whether the client has device protection (Internet service not available, Not Available, Yes).
  • Tech support:?Whether the customer has access to tech help (no internet service, no, yes).
  • Streaming TV:?Whether the customer has access (no internet service, no, yes).
  • Streaming movies:?Indicates whether or not the client offers or has access streaming movies(no internet service, no, yes).

# set customerID column as index

2. Exploratory Data Analysis and Data Cleaning

Exploratory data analysis?involves examining the key elements of a data set, typically using visualization techniques and summary statistics. Before conducting further analyses, the goal is to comprehend the data, find trends and anomalies, and test presumptions.

Missing values and data types

When EDA first started, we wanted to learn as much as we could about the data, which is why we used pandas.It’s useful to use the method. This method outputs a brief summary of the data frame that includes the names of the columns and their data types, the number of non-null values, and the memory consumption.

# check the basic info of the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
# Statistical distribution of the dataset

The data set has 21 columns and 7043 observations, as seen above. The data collection doesn’t appear to contain any null values, but we notice that the column?TotalCharges?was?incorrectly identified as an object.?This column is a numeric variable since it shows the total cost incurred by the customer. We need to convert this column into a?numeric data type?so that we may analyze it further. We may achieve this by use the?pd.to_numeric?function. When non-numeric data is encountered, this function raises an exception by default. However, we may use the?errors=’coerce’?argument to bypass those cases and substitute them with a?NaN.

The dataset contains 21 columns and rows:

# check the datatype

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object        

The Total Charges column is object instead of float this would be changed to numeric datatyppe

# Convert the dtype of TotalCharges column from object to numeric
customer['TotalCharges'] = pd.to_numeric(customer['TotalCharges'], errors='coerce')
customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object        

#check for missing values

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64        

Now, we can see:

That there are 11 missing values in the TotalCharges column. Therefore we are going to drop them since the quantity is not huge enough to affect our data analysis.

customer.dropna(subset=['TotalCharges'], inplace=True)        

#Univariate Analysis

# Ananlyze the Churn column

No     5163
Yes    1869
Name: Churn, dtype: int64        

From the above a total of 1869 customers churned

# Create a histogram of the Tenure column
plt.hist(customer['tenure'], bins=20, color='purple')
plt.xlabel('Tenure (months)')

From the above chart we could see that the company get a lot of new customers signing up for services. Most of which stay up to about 74 months before churning.

#Bivariate Analysis

Lets see the relationship between Monthly charge and Total Charge and Customer Churn

sns.scatterplot(x='MonthlyCharges', y='TotalCharges', data=customer, hue="Churn")
plt.xlabel('Monthly Charges')
plt.ylabel('Total Charges')        

From the scatter plot one could identify that lot more customers churned when their monthly bills were rising between 70 to about 105 dollars. However lot more customers stayed when their Total Charges were rather going high.

#sns.boxplot(x='Churn', y='MonthlyCharges', data=customer);        

Lets check the relationship between Contract type and Customer Churn

sns.countplot(x='Contract', hue='Churn', data=customer)
plt.title('Contract vs Churn')
plt.ylabel('Number of Customers')        

We noticed that the longer the contract duration, the lower the churn rate. This might probably be due to the fact that customers pay more when they sign on to Monthly contracts.

The diagonal plots show the distribution of each column. For example, we can see that the distribution of Tenure is skewed to the right, indicating that there are more customers with shorter tenures. The distribution of MonthlyCharges is roughly bell-shaped, indicating a normal distribution.

#Data Cleaning

#Issues With Data And How They Were Resolved

The customerID column has no impact on our analysis therefore we are going to drop it

customer.drop("customerID", axis=1, inplace= True )        
Answering Business Questions

Question 1.?What is the overall churn rate for the company.

churn_rate = customer['Churn'].value_counts(normalize=True)['Yes']
print("Overall Churn Rate: {:.2f}".format(churn_rate))
Overall Churn Rate: 0.27
# Graphical representation of the churn rate


churn_rate = customer['Churn'].value_counts(normalize=True)['Yes']['Churned', 'Not Churned'], [churn_rate, 1-churn_rate])
plt.title('Overall Churn Rate')        

The Total Churn rate of the company is 27%, where as the rate of customers who did not churn stood at 77%

#Save The Cleaned Dataset Into A New CSV File

customer.to_csv("df.csv", index= False)

#load the new file


#check the head()

Machine Learning and Modelling

Feature Processing and Engineering

Feature engineering involves transforming and creating new features from the existing data to enhance the predictive power of the model.

Drop duplicate values

#Lets check the shape of the dataset first

# Lets check for duplicates
# drop duplicate values

df.drop_duplicates(inplace=True, keep='first')
#  recheck the shape

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64        

Replacing mising values using simple imputer

imputer = SimpleImputer(strategy='most_frequent')

#Fit the imputer to the data, focusing on the "Tenure Category" column:[['tenure']])

 #Transform the data by replacing the missing values with the imputed values:

df[['tenure']] = imputer.transform(df[['tenure']])        

#check again for missing values

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64        
Creating New Features

# Create a new feature for the ratio of MonthlyCharges to TotalCharges
df['MonthlyChargesRatio'] = df['MonthlyCharges'] / df['TotalCharges']

# Calculate the average MonthlyCharges for each customer
df['AverageMonthlyCharges'] = df['TotalCharges'] / df['tenure']

# Create a new feature indicating whether the customer has both online security and backup
df['HasSecurityAndBackup'] = (df['OnlineSecurity'] == 'Yes') & (df['OnlineBackup'] == 'Yes')

# Create a new feature representing the number of additional services subscribed to
additional_services = ['DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']
df['AdditionalServices'] = df[additional_services].apply(lambda row: row.sum(), axis=1)
import pandas as pd

# Define the bin ranges and labels
bins = [0, 6, 24, float('inf')]
labels = ['New', 'Established', 'Long-term']

# Convert the Tenure column into categorical bins
df['TenureCategory'] = pd.cut(customer['tenure'], bins=bins, labels=labels, right=False)
import pandas as pd

encoded_df = pd.get_dummies(df, columns=['gender', 'InternetService', 'PaymentMethod'])

Feature Encoding

# Create an instance of the OneHotEncoder
encoder = OneHotEncoder()

# Specify the categorical columns to encode
categorical_columns = ['gender', 'InternetService', 'Contract']

# Fit the encoder on the categorical columns and transform the data
encoded_data = encoder.fit_transform(df[categorical_columns])

# Convert the encoded data to a DataFrame
encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names_out(categorical_columns))

# Concatenate the encoded DataFrame with the original data
data_encoded = pd.concat([df, encoded_df], axis=1)

# Drop the original categorical columns
data_encoded.drop(categorical_columns, axis=1, inplace=True)        

#Spliting The Dataset Into Training & Testing

# Exclude the 'Churn' column from the feature variables

# Assign the feature variables to 'x'
x = df.drop('Churn', axis=1)  

# Assign the target variable to 'y'
y = df['Churn']

# split the data into training and evaluation sets. 

x_train, x_eval, y_train, y_eval = train_test_split(x, y, test_size=0.2, random_state=0)        

#check the shape of the split datset

print(x_train.shape, y_train.shape)
print(x_eval.shape, y_eval.shape)
(5608, 24) (5608,)
(1402, 24) (1402,)        
Feature Encoding

#Impute missing values with mode

x_train = x_train.fillna(x_train.mode().iloc[0])
x_eval = x_eval.fillna(x_eval.mode().iloc[0])

# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Perform label encoding on X_train
le = LabelEncoder()
x_train =x_train.apply(le.fit_transform)

# Perform label encoding on X_test
le = LabelEncoder()
x_eval =x_eval.apply(le.fit_transform)

# Perform label encoding on y_train and y_test
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_eval_encoded = le.transform(y_eval)
#  Check head of X_test

# Check head of X_train

# Check  y_test

array([0, 0, 0, ..., 1, 0, 1])
# Check y_train

array([1, 0, 1, ..., 0, 1, 0])        
Feature Scaling

# Create a scaler object
scaler = StandardScaler()

# Fit on the training data

# Apply the scaler transform to both the training and testing sets
x_train_scaled = scaler.transform(x_train)
x_eval_scaled = scaler.transform(x_eval)        

#Train set Balancing

ros = RandomOverSampler()
x_train_resampled, y_train_resampled = ros.fit_resample(x_train, y_train)
# Check sample of X_train

x_train_resampled.sample(5, random_state=3)
# Check sample of y_train

y_train.sample(5, random_state=4)
1328     No
3379    Yes
791      No
4087     No
86       No
Name: Churn, dtype: object        
Insights on the interpreter

Features with positive coefficients:

Partner: A positive coefficient indicates that the likelihood of the expected result is increased when a partner is present. MultipleLines: A positive coefficient means that the likelihood of the expected result is increased when there are several phone lines. DeviceProtection: A positive coefficient indicates that the likelihood of the expected result is increased by device protection. StreamingTV: A positive coefficient means that the likelihood of the expected result is increased when streaming TV is present. A positive coefficient indicates that the likelihood of the expected result is increased by streaming movies. PaymentMethod: A positive coefficient means that the likelihood of the expected outcome is increased by specific payment methods. Feature coefficients that are negative

Gender: A negative coefficient indicates that a woman has a lower chance of the expected result. SeniorCitizen: A negative coefficient means that the projected event is less likely to occur if you are a senior citizen. Dependents: A negative coefficient indicates that the presence of dependents reduces the probability of the expected result. Duration: A negative coefficient means that the projected consequence is less likely to occur with a longer tenure. PhoneService: A negative coefficient indicates that the possibility of the anticipated result is decreased. InternetService: A negative coefficient means there is a lower chance that the expected result will occur. OnlineSecurity: A negative coefficient indicates that the likelihood of the expected result is decreased by having online security. TechSupport:?

The presence of tech assistance reduces the chance of the anticipated result, as indicated by a negative coefficient. Contract: A negative coefficient denotes a decreased probability of the anticipated outcome with a longer contract duration. PaperlessBilling: A negative coefficient means there is a lower chance that the expected result will occur when paperless billing is used. Higher monthly fees are thought to make the chance of the expected result less likely, according to a negative coefficient for monthly payments. TotalCharges: A positive coefficient means that the likelihood of the expected result is increased by higher total charges. When all other factors are zero or ignored, the anticipated outcome's intrinsic probability is represented by the intercept term (-0.06534620607998345).

#creating my pipeline

# Define the scaler and encoder

scaler = StandardScaler()
encoder = OrdinalEncoder()

# Define the logistic regression model
logreg_model = LogisticRegression()

# Define the pipeline
pipeline = Pipeline([
    ('scaler', scaler),
    ('encoder', encoder),
    ('model', logreg_model),
Pipeline(steps=[('scaler', StandardScaler()), ('encoder', OrdinalEncoder()),
                ('model', LogisticRegression())])
# Define the pipeline steps
steps = [
    ('scaler', StandardScaler()),
    ('encoder', OrdinalEncoder()),
    ('classifier', LogisticRegression())

# Create the pipeline
pipeline = Pipeline(steps)

# Fit the pipeline on the training data, y_train)

# Make predictions on the test data
y_pred = pipeline.predict(x_eval)

array(['No', 'No', 'No', ..., 'Yes', 'No', 'No'], dtype=object)
# Defining the models
knn_model = KNeighborsClassifier()
logreg_model = LogisticRegression()
gbm_model = GradientBoostingClassifier()

# Define the scaler and encoder
scaler = StandardScaler()
encoder = OrdinalEncoder()

# Define the pipeline
pipeline = Pipeline([
    ('scaler', scaler),
    ('encoder', encoder),

# Save the models, scaler, encoder, and pipeline
models = [knn_model, logreg_model, gbm_model]
components = [scaler, encoder]
file_prefix = 'my_project'

for idx, model in enumerate(models):
    with open(f'{file_prefix}_model_{idx}.pkl', 'wb') as f:
        pickle.dump(model, f)

for idx, component in enumerate(components):
    with open(f'{file_prefix}_{type(component).__name__}_{idx}.pkl', 'wb') as f:
        pickle.dump(component, f)

with open(f'{file_prefix}_pipeline.pkl', 'wb') as f:
    pickle.dump(pipeline, f)

print('Project components exported successfully.')
Project components exported successfully.        

Making judgments — Summary:

Using the Telco customer Churn dataset, we have gone through a whole end-to-end machine learning exercise in this post. We began by cleaning the data and using visualization to analyze it. After that, we feature engineered the categorical data into numeric variables in order to create a machine learning model. We experimented with six different machine learning algorithms using default parameters once the data had been transformed. Finally, we optimized the Gradient Boosting Classifier’s hyperparameters (best performance model) to achieve an accuracy of approximately 80% (about 6% better than the baseline).

It is crucial to emphasize that each project has a different?set of precise machine learning task steps.?Even though we followed a linear procedure in this post, machine learning?projects typically follow an iterative approach rather than a linear one.?As we learn more about the issue we’re trying to solve, earlier steps are frequently revisited.

Note: You can find more details about this project on my?GitHub?repository or visit my?medium?account if you're interested in doing so.


