Telco Churn Prediction With Machine Learning
Luis Fernando Torres
AML/FT Intelligence Analyst @ CloudWalk, Inc. | Microsoft Certified AI Engineer
Introduction
The churn rate is an important metric to measure the number of customers a business has lost in a certain period.
A high churn rate implies trouble for growth, affecting a company's profitability in the long run, so it's crucial to keep track of a business's ability to retain clients and avoid increasing the churn rate.
For this project, I'll do an exploratory data analysis on a Telco Customer Churn database to identify patterns among clients who discontinue their subscriptions, helping the company develop strategies to retain these clients. Lastly, I'll build up a machine learning model for churn predictions intended to identify clients that are more likely to churn.?
About the Dataset
The dataset used for this project is the Telco Customer Churn dataset posted on Kaggle by user Blastchar.
Each row represents a customer, while each column represents some information about this customer.?
. CustomerID => Is the unique identification for each client
. Gender => Contains "Male" and "Female" for each client.
. SeniorCitizen => Binary feature, where 0 represents "no" and 1 represents "yes".
. Partner => Contains "Yes" and "No" as values for each client.
. Dependents => Whether that customer has dependents or not. Also a binary variable (Yes or No).
. Tenure => It contains how many months that customer has been a client of the company.
. PhoneService => Another binary feature indicating if that customer has signed for phone service or not.
. MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies = > All multicategory features based on additional services.?
. PaperlessBilling => Another binary feature indicating if that customer has opted for paperless billing or not.?
. PaymentMethod => Multicategory feature related to the payment method opted by the client.
. MonthlyCharges => The amount charged to the customer monthly.
. TotalCharges => The total amount charged to the customer.
. Churn => Finally, our target variable. It indicates if the customer churned or not. It's also a binary variable containing "No" and "Yes" values.
Exploratory Data Analysis?
Let's start then!
Disclaimer: I'll only run through the most relevant parts of this project in this article. If you wish to see my full notebook, you can check it out on Kaggle or GitHub. I'd love to get some upvotes and some comments on Kaggle!
The first step is importing all libraries we will need.
# Importing libraries
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns, plotly.express as px
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import ShuffleSplit
We then load our data and drop the customerID column right away. We won't need it for this project.
df = pd.read_csv('data.csv')
df.drop(columns = ['customerID'], axis=1, inplace=True)
df
We have a dataset with 7,043 rows, or clients, and 20 columns, representing the features for each client.
The next step is looking deep into our data to find inconsistencies that we have to work on.
I proceeded to check the data types.
df.dtypes
TotalCharges is a feature containing the total amount charged to each customer. It contains financial amounts, so it doesn't make sense for it to be of "object" type. I'll change it to float, just like the MontlyCharges feature, and check for any missing data.
df.TotalCharges = pd.to_numeric(df.TotalCharges, errors = 'coerce')
# Checking missing data
df.isnull().sum()
We have 11 missing data in TotalCharges.
I've seen some folks working on this dataset downright dropping these clients from the data frame, but I've decided to investigate it and found something interesting about these data. Let's take a look.
null = df[df.isnull().any(axis=1)]
null
If we look closely, we can see that all customers with missing TotalCharges values have tenure as 0, meaning that they're new clients who just started their subscription, so they don't have a total amount paid yet.
For this reason, I decided that it wouldn't make sense to drop these from the dataset and decided, instead, to fill these values with 0, maintaining these clients in our dataset.
# Filling NaN TotalCharges with a 0
df.TotalCharges = df.TotalCharges.fillna(0)
When we use .isnull( ) method to check missing data, there are some blank values that may be unidentified. For this reason, it's also important to check the value counts of our features to identify any blank data that we couldn't have previously identified.
# Printing unique value counts for each variable
for i in df.columns:
??? print(f"------ {i} ------\n")
??? print(df[i].value_counts())
??? print('='*35)
Ok! It seems no blank values were found.
Now, we're all set to work on some data visualization to extract information!
Visualizing Data
The first thing I want to see is how both classes of our target variable are distributed throughout the dataset
# Visualizing Churn proportion
fig = px.pie(df, names = 'Churn', template = 'seaborn',
??????????? title = 'Churn')
fig.update_traces(rotation=90, pull = [0.1], textinfo = "percent+label")
fig.show('png')
Only 26.5% of customers in this dataset are among those who left the company, which means that we have some class imbalance with our data, meaning there's a larger representation for those clients who are still in the company. We will have to address this issue later on.
Let's see how the churn variable relates to other features.
# Statistics on churn x tenure
print(df.groupby('Churn').tenure.describe().round(0))
# Visualizing how churn interacts with tenure
fig = px.box(df, x = 'tenure', y = 'Churn', template = 'seaborn',
??????????? title = 'Churn x Tenure')
fig.show('png')
Obviously, customers who have left have less tenure time than those who've stayed. But my intention when printing this is to figure out how long these clients stay with the company before leaving.
75% of churned clients discontinue their subscriptions within the first 29 months of service, while half of them leave by the 10th month, not staying even for a year with the company!
Let's find some more information on these churned customers.
I'll refrain from reposting the code for these following graphics because they're very similar to the one above, so let's go straight to the visualization part.
This one is very interesting!
Clearly, customers who are leaving are paying much higher amounts than those who are staying! On average, the monthly cost for these customers was 21.31% higher, and even among the customers who paid less, the ones who left still paid 124% more than those who stayed!
Let's see a plot comparing Tenure and Monthly Charges distribution among clients who left and stayed.
# Tenure x Monthly Charges distribution
g = sns.lmplot(height = 7, data = df, x = 'tenure', y = 'MonthlyCharges',
????????????? hue = 'Churn', scatter_kws={'s': 200}, line_kws = None, palette = 'muted')
g.set(xlabel = 'Tenure',
???? ylabel = 'Monthly Charges')
g = plt.title('Tenure x Monthly Charges Distribution')
Not only can we see a large concentration of clients who have left in the upper left part of the plot, but we can also see a higher trendline for these clients, confirming that they have been charged much more monthly than those who stayed.
I'll create a new data frame containing only clients who have churned and I'll check how they relate to the categorical variables.
# Creating a Dataframe containing only clients who left
churn = df.query("Churn == 'Yes'")
churn
We have a lot of features in this dataset, and I've used all of them to plot different kinds of charts, but not all of them provided useful information. For this reason, I decided to show only the most relevant ones in this article, don't forget you can see all of them by clicking on this link for my notebook on Kaggle.
There are a lot of indications that these clients are young, probably between 18 and 35 years old, considering that most of them don't fall into the senior citizen category and also have no dependents and no partners, as shown below.
Also, most of these clients opted for paperless billing, and 57.3% of them chose electronic checks as a payment method.
This makes me question if there's an issue regarding billing receiving. Are these clients receiving their bills at the right platform, be it email or SMS? Are they receiving their bills at the right time? Are they facing issues when trying to pay? Is there anything getting in the way and making it harder for them to pay? These questions are fundamental to understanding what kind of improvements must be made to avoid losing these customers.
Another interesting piece of information can be observed when we look at the churn per contract graph.
88.6% of these customers opted for the month-to-month contract, making it easier for them to cancel their subscriptions. The company must study what benefits should be given to clients that choose the yearly and biyearly contracts to make them much more appealing than a month-to-month contract, considering this may help retain these customers.
Also, a big portion of churned clients didn't have additional services that I consider to be important, such as Tech Support and Device Protection.
领英推荐
Clients may be leaving the company because they're facing issues and they don't have tech support! Or maybe their device broke and they didn't hire the device protection service.
How can the company convince these clients to hire these services? Maybe giving them a free trial period?
69.4% of churned clients had fiber-optic services. This makes me question if everything is going fine with this service. For instance, if we had data on the region, we could investigate if these clients are all from a certain area where the company could be facing issues regarding fiber-optic service.
Just by gathering data, putting it on charts, and looking at them, we could find a lot of information that can be helpful for the company to draw a profile on the customers that are leaving.
With this in mind, the company may now be able to track down each one of the issues these clients may be facing and help increase satisfaction with their services.
Churn Prediction Model
Our job hasn't finished yet! We still have to develop a machine learning model to identify customers more likely to leave.
We must split our data into a dependent variable (y) and independent variables (X), considering that the dependent variable is the churn.
# Let's split our dataset into independent (X) and dependent (y) variables
X = df.drop('Churn', axis = 1)
y = df.Churn
After doing that, we split our data into train and test sets.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 1)
Machine learning models require numeric values for all variables, meaning we must encode all categorical features.
# Encoding categorical variables
encoder = OrdinalEncoder()
categorical_columns_train = [col for col in X_train.columns if X_train[col].dtype == 'object']
X_train[categorical_columns_train] = encoder.fit_transform(X_train[categorical_columns_train])
categorical_columns_test = [col for col in X_test.columns if X_test[col].dtype == 'object']
X_test[categorical_columns_test] = encoder.transform(X_test[categorical_columns_test])
OrdinalEncoder assumes a different int number for each categoric class in the categoric variables.
When we look at the X_train set, for instance, we now have only numeric inputs in our dataset.
Tenure, MonthlyCharges, and TotalCharges are on a different scale of values compared to the other variables, meaning we must standardize them so the algorithms won't give bigger importance to these values to the detriment of others.
# Let's standardize tenure, MonthlyCharges and TotalCharges
num_col = ['tenure', 'MonthlyCharges', 'TotalCharges']
scaler = StandardScaler()
?
X_train[num_col] = scaler.fit_transform(X_train[num_col])
X_test[num_col] = scaler.transform(X_test[num_col])
When we look at the X_test dataset, this is what we have now.
Just as a reminder, it is crucial to encode and standardize values separately on training and test sets because it avoids data leakage, an effect that causes our model to receive privileged information on the test set during training, and this overestimates our model's predicting abilities.
I also replaced the 'Yes' and 'No' values from the y_train and y_test sets with 1 and 0, respectively.
# Transform Churns into binary values
y_train.replace({'Yes' : 1,
???????????????'No' : 0}, inplace = True)
# Transform Churns into binary values
y_test.replace({'Yes' : 1,
???????????????'No' : 0}, inplace = True)
Remember, we have some class imbalance to deal with here!
If we print the value counts for the y_train set, this is what we have.
y_train.value_counts()
I'll use SMOTE to increase the number of churned clients synthetically at the training set to avoid developing a model that would be too biased in guessing every client as a non-churned because of the class imbalance. We only do this in the training set, leaving the test set intact with its original proportions since it's supposed to represent a real scenario.
X_train, y_train = SMOTE().fit_resample(X_train, y_train)
# Printing new value counts for y_train
y_train.value_counts()
Now we have a 50/50 proportion where each class occupies half of the training set.
Let's run our models then!
# Running Random Forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf_y_predict = rf.predict(X_test)
?
# Running Ada Boost
ab = AdaBoostClassifier()
ab.fit(X_train, y_train)
ab_y_predict = ab.predict(X_test)
?
# Running Gradient Boosting
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
gb_y_predict = gb.predict(X_test)
?
# Running Extra Trees Classifier
et = ExtraTreesClassifier()
et.fit(X_train, y_train)
et_y_predict = et.predict(X_test)
Random Forest Scores
# Random Forest
print('Accuracy: %.2f%%' % (accuracy_score(y_test, rf_y_predict) * 100 ))
print('Precision: %.2f%%' % (precision_score(y_test, rf_y_predict) * 100))
print('Recall: %.2f%%' % (recall_score(y_test, rf_y_predict) * 100))
print('F1_Score: %.2f%%' % (f1_score(y_test, rf_y_predict) * 100))
confusion_matrix_rf = confusion_matrix(y_test, rf_y_predict)
plt.figure(figsize=(12,8))
ax = plt.subplot()
sns.heatmap(confusion_matrix_rf, annot=True, fmt='g', ax = ax)
ax.set_xlabel('Predicted Label')
ax.set_ylabel('Actual Label')
ax.set_title('Confusion Matrix - Random Forest')
ax.xaxis.set_ticklabels(['0','1'])
ax.yaxis.set_ticklabels(['0','1'])
Ada Boost Scores
Gradient Boosting Scores
Extra Trees Scores
y_test.value_counts()
Considering that recall is the metric that tells us how well our model predicts the class we want to predict and that we have 528 customers who left in the y_test set, Ada Boost, and Gradient Boosting were the best models.
Ada Boost Classifier correctly predicted that 400 out of 528 clients were more likely to churn, with a 75.76% recall. Gradient Boosting Classifier correctly predicted 381 out of 528 customers more likely to churn, with a 72.16% recall.
These results aren't bad, but I'm not satisfied yet and want to achieve better results. I'll use RandomizedSearchCV to test different parameters for both classifiers and identify the best ones to increase recall on both models.
# Tuning Ada Boost
grid = {'n_estimators' : [50,100,500,1500,2000],
??????'learning_rate' : [0.05,0.1,1.0,0.15,0.2,1.5,2.0]}
?
cv = ShuffleSplit()
adaboost = RandomizedSearchCV(AdaBoostClassifier(),
????????????????????????????param_distributions = grid,
????????????????????????????cv = cv,
????????????????????????????n_iter = 10,
????????????????????????????scoring = 'recall')
adaboost.fit(X_train, y_train)
I then get the best params
adaboost.best_params_
?We run Ada Boost Classifier again with these parameters.
tune_adaboost = AdaBoostClassifier(**adaboost.best_params_
tune_adaboost.fit(X_train, y_train)
y_pred = tune_adaboost.predict(X_test))
Let's see how it performs!
# Tuned Ada Boost
print('Accuracy: %.2f%%' % (accuracy_score(y_test, y_pred) * 100 ))
print('Precision: %.2f%%' % (precision_score(y_test, y_pred) * 100))
print('Recall: %.2f%%' % (recall_score(y_test, y_pred) * 100))
print('F1_Score: %.2f%%' % (f1_score(y_test, y_pred) * 100))
confusion_matrix_tuned_adaboost = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(12,8))
ax = plt.subplot()
sns.heatmap(confusion_matrix_tuned_adaboost, annot=True, fmt='g', ax = ax)
ax.set_xlabel('Predicted Label')
ax.set_ylabel('Actual Label')
ax.set_title('Confusion Matrix - Tuned Ada Boost')
ax.xaxis.set_ticklabels(['0','1'])
ax.yaxis.set_ticklabels(['0','1'])
?It looks like it didn't help. We've achieved worse results than before, correctly predicting 388, while the former Ada Boost model predicted 400.
I'll do the same thing with the Gradient Boosting Classifier, trying to find better parameters to achieve better results. Let's see if we're successful this time.
# Tuning Gradient Boosting
grid2 = {'n_estimators':[500,1000,1500,2000,2500],
???????'learning_rate':[0.01,0.05,0.2,0.1,0.15,0.3,0.5],
???????'min_samples_split':[2,5,6,10],
???????'min_samples_leaf':[3,5,8,12]}
?
gradientboosting = RandomizedSearchCV(GradientBoostingClassifier(),
????????????????????????????param_distributions = grid2,
????????????????????????????cv = cv,
????????????????????????????n_iter = 10,
????????????????????????????scoring = 'recall')
gradientboosting.fit(X_train, y_train)
gradientboosting.best_params_
tune_gradientboosting = GradientBoostingClassifier(**gradientboosting.best_params_)
tune_gradientboosting.fit(X_train, y_train)
y_pred_gb = tune_gradientboosting.predict(X_test)
# Tuned Gradient Boosting scores
print('Accuracy: %.2f%%' % (accuracy_score(y_test, y_pred_gb) * 100 ))
print('Precision: %.2f%%' % (precision_score(y_test, y_pred_gb) * 100))
print('Recall: %.2f%%' % (recall_score(y_test, y_pred_gb) * 100))
print('F1_Score: %.2f%%' % (f1_score(y_test, y_pred_gb) * 100))
confusion_matrix_tuned_gb = confusion_matrix(y_test, y_pred_gb)
plt.figure(figsize=(12,8))
ax = plt.subplot()
sns.heatmap(confusion_matrix_tuned_gb, annot=True, fmt='g', ax = ax)
ax.set_xlabel('Predicted Label')
ax.set_ylabel('Actual Label')
ax.set_title('Confusion Matrix - Tuned Gradient Boosting')
ax.xaxis.set_ticklabels(['0','1'])
ax.yaxis.set_ticklabels(['0','1'])
Now we have it! An optimized Gradient Boosting model with a 77.84% recall, correctly predicting 411 churn clients out of 528!
This is an improvement from the former models tested above!
Conclusion
After testing four different ensemble method algorithms, which are considered more robust thanks to their approach of combining predictions from multiple "weak" models to get to a final result, we ended up finding that the Ada Boost Classifier and Gradient Boosting Classifier achieved better recall scores.
SKlearn's RandomizedSearchCV helped us find the best parameters for an optimized Gradient Boosting model with a 77.84% recall score that correctly predicted 411 clients more likely to churn out of 528.
Tuning the Gradient Boosting model resulted in a slight improvement, and applying some feature engineering techniques could help us achieve even better results with a higher recall. However, I'm still a data scientist in formation and I'll soon take a feature engineering course on Kaggle to learn more on the matter and use it to improve my future machine learning projects.?
I hope you liked the behind-the-scenes making of this project. Once again, I'll leave this link for you to access my notebook on Kaggle. I'd love to get some upvotes and comments over there!
I'm always open to suggestions and recommendations, especially from data scientists that have more experience than me and could give me tips on how to improve my work.
Thank you so much!
?
Luís Fernando Torres