Jiliph6 login Register.Enjoy Free 888+200 Daily Legal Bonus

Introduction

The churn rate is an important metric to measure the number of customers a business has lost in a certain period.

A high churn rate implies trouble for growth, affecting a company's profitability in the long run, so it's crucial to keep track of a business's ability to retain clients and avoid increasing the churn rate.

For this project, I'll do an exploratory data analysis on a Telco Customer Churn database to identify patterns among clients who discontinue their subscriptions, helping the company develop strategies to retain these clients. Lastly, I'll build up a machine learning model for churn predictions intended to identify clients that are more likely to churn.?

About the Dataset

The dataset used for this project is the Telco Customer Churn dataset posted on Kaggle by user Blastchar.

Each row represents a customer, while each column represents some information about this customer.?

. CustomerID => Is the unique identification for each client

. Gender => Contains "Male" and "Female" for each client.

. SeniorCitizen => Binary feature, where 0 represents "no" and 1 represents "yes".

. Partner => Contains "Yes" and "No" as values for each client.

. Dependents => Whether that customer has dependents or not. Also a binary variable (Yes or No).

. Tenure => It contains how many months that customer has been a client of the company.

. PhoneService => Another binary feature indicating if that customer has signed for phone service or not.

. MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies = > All multicategory features based on additional services.?

. PaperlessBilling => Another binary feature indicating if that customer has opted for paperless billing or not.?

. PaymentMethod => Multicategory feature related to the payment method opted by the client.

. MonthlyCharges => The amount charged to the customer monthly.

. TotalCharges => The total amount charged to the customer.

. Churn => Finally, our target variable. It indicates if the customer churned or not. It's also a binary variable containing "No" and "Yes" values.

Exploratory Data Analysis?

Let's start then!

Disclaimer: I'll only run through the most relevant parts of this project in this article. If you wish to see my full notebook, you can check it out on Kaggle or GitHub. I'd love to get some upvotes and some comments on Kaggle!

The first step is importing all libraries we will need.

# Importing libraries
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns, plotly.express as px
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import ShuffleSplit

We then load our data and drop the customerID column right away. We won't need it for this project.

df = pd.read_csv('data.csv')
df.drop(columns = ['customerID'], axis=1, inplace=True)
df

We have a dataset with 7,043 rows, or clients, and 20 columns, representing the features for each client.

The next step is looking deep into our data to find inconsistencies that we have to work on.

I proceeded to check the data types.

df.dtypes

TotalCharges is a feature containing the total amount charged to each customer. It contains financial amounts, so it doesn't make sense for it to be of "object" type. I'll change it to float, just like the MontlyCharges feature, and check for any missing data.

df.TotalCharges = pd.to_numeric(df.TotalCharges, errors = 'coerce')
# Checking missing data
df.isnull().sum()

We have 11 missing data in TotalCharges.

I've seen some folks working on this dataset downright dropping these clients from the data frame, but I've decided to investigate it and found something interesting about these data. Let's take a look.

null = df[df.isnull().any(axis=1)]
null

If we look closely, we can see that all customers with missing TotalCharges values have tenure as 0, meaning that they're new clients who just started their subscription, so they don't have a total amount paid yet.

For this reason, I decided that it wouldn't make sense to drop these from the dataset and decided, instead, to fill these values with 0, maintaining these clients in our dataset.

# Filling NaN TotalCharges with a 0
df.TotalCharges = df.TotalCharges.fillna(0)

When we use .isnull( ) method to check missing data, there are some blank values that may be unidentified. For this reason, it's also important to check the value counts of our features to identify any blank data that we couldn't have previously identified.

# Printing unique value counts for each variable
for i in df.columns:
??? print(f"------ {i} ------\n")
??? print(df[i].value_counts())
??? print('='*35)

Ok! It seems no blank values were found.

Now, we're all set to work on some data visualization to extract information!

Visualizing Data

The first thing I want to see is how both classes of our target variable are distributed throughout the dataset

# Visualizing Churn proportion
fig = px.pie(df, names = 'Churn', template = 'seaborn',
??????????? title = 'Churn')
fig.update_traces(rotation=90, pull = [0.1], textinfo = "percent+label")
fig.show('png')

Only 26.5% of customers in this dataset are among those who left the company, which means that we have some class imbalance with our data, meaning there's a larger representation for those clients who are still in the company. We will have to address this issue later on.

Let's see how the churn variable relates to other features.

# Statistics on churn x tenure
print(df.groupby('Churn').tenure.describe().round(0))
# Visualizing how churn interacts with tenure
fig = px.box(df, x = 'tenure', y = 'Churn', template = 'seaborn',
??????????? title = 'Churn x Tenure')
fig.show('png')

Obviously, customers who have left have less tenure time than those who've stayed. But my intention when printing this is to figure out how long these clients stay with the company before leaving.

75% of churned clients discontinue their subscriptions within the first 29 months of service, while half of them leave by the 10th month, not staying even for a year with the company!

Let's find some more information on these churned customers.

I'll refrain from reposting the code for these following graphics because they're very similar to the one above, so let's go straight to the visualization part.

This one is very interesting!

Clearly, customers who are leaving are paying much higher amounts than those who are staying! On average, the monthly cost for these customers was 21.31% higher, and even among the customers who paid less, the ones who left still paid 124% more than those who stayed!

Let's see a plot comparing Tenure and Monthly Charges distribution among clients who left and stayed.

# Tenure x Monthly Charges distribution
g = sns.lmplot(height = 7, data = df, x = 'tenure', y = 'MonthlyCharges',
????????????? hue = 'Churn', scatter_kws={'s': 200}, line_kws = None, palette = 'muted')
g.set(xlabel = 'Tenure',
???? ylabel = 'Monthly Charges')
g = plt.title('Tenure x Monthly Charges Distribution')

Not only can we see a large concentration of clients who have left in the upper left part of the plot, but we can also see a higher trendline for these clients, confirming that they have been charged much more monthly than those who stayed.

I'll create a new data frame containing only clients who have churned and I'll check how they relate to the categorical variables.

# Creating a Dataframe containing only clients who left

churn = df.query("Churn == 'Yes'")

churn

We have a lot of features in this dataset, and I've used all of them to plot different kinds of charts, but not all of them provided useful information. For this reason, I decided to show only the most relevant ones in this article, don't forget you can see all of them by clicking on this link for my notebook on Kaggle.

There are a lot of indications that these clients are young, probably between 18 and 35 years old, considering that most of them don't fall into the senior citizen category and also have no dependents and no partners, as shown below.

Also, most of these clients opted for paperless billing, and 57.3% of them chose electronic checks as a payment method.

This makes me question if there's an issue regarding billing receiving. Are these clients receiving their bills at the right platform, be it email or SMS? Are they receiving their bills at the right time? Are they facing issues when trying to pay? Is there anything getting in the way and making it harder for them to pay? These questions are fundamental to understanding what kind of improvements must be made to avoid losing these customers.

Another interesting piece of information can be observed when we look at the churn per contract graph.

88.6% of these customers opted for the month-to-month contract, making it easier for them to cancel their subscriptions. The company must study what benefits should be given to clients that choose the yearly and biyearly contracts to make them much more appealing than a month-to-month contract, considering this may help retain these customers.

Also, a big portion of churned clients didn't have additional services that I consider to be important, such as Tech Support and Device Protection.

Clients may be leaving the company because they're facing issues and they don't have tech support! Or maybe their device broke and they didn't hire the device protection service.

How can the company convince these clients to hire these services? Maybe giving them a free trial period?

69.4% of churned clients had fiber-optic services. This makes me question if everything is going fine with this service. For instance, if we had data on the region, we could investigate if these clients are all from a certain area where the company could be facing issues regarding fiber-optic service.

Just by gathering data, putting it on charts, and looking at them, we could find a lot of information that can be helpful for the company to draw a profile on the customers that are leaving.

With this in mind, the company may now be able to track down each one of the issues these clients may be facing and help increase satisfaction with their services.

Churn Prediction Model

Our job hasn't finished yet! We still have to develop a machine learning model to identify customers more likely to leave.

We must split our data into a dependent variable (y) and independent variables (X), considering that the dependent variable is the churn.

# Let's split our dataset into independent (X) and dependent (y) variables

X = df.drop('Churn', axis = 1)

y = df.Churn

After doing that, we split our data into train and test sets.

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 1)

Machine learning models require numeric values for all variables, meaning we must encode all categorical features.

# Encoding categorical variables

encoder = OrdinalEncoder()

categorical_columns_train = [col for col in X_train.columns if X_train[col].dtype == 'object']

X_train[categorical_columns_train] = encoder.fit_transform(X_train[categorical_columns_train])

categorical_columns_test = [col for col in X_test.columns if X_test[col].dtype == 'object']

X_test[categorical_columns_test] = encoder.transform(X_test[categorical_columns_test])

OrdinalEncoder assumes a different int number for each categoric class in the categoric variables.

When we look at the X_train set, for instance, we now have only numeric inputs in our dataset.

Tenure, MonthlyCharges, and TotalCharges are on a different scale of values compared to the other variables, meaning we must standardize them so the algorithms won't give bigger importance to these values to the detriment of others.

# Let's standardize tenure, MonthlyCharges and TotalCharges

num_col = ['tenure', 'MonthlyCharges', 'TotalCharges']

scaler = StandardScaler()

?

X_train[num_col] = scaler.fit_transform(X_train[num_col])

X_test[num_col] = scaler.transform(X_test[num_col])

When we look at the X_test dataset, this is what we have now.

Just as a reminder, it is crucial to encode and standardize values separately on training and test sets because it avoids data leakage, an effect that causes our model to receive privileged information on the test set during training, and this overestimates our model's predicting abilities.

I also replaced the 'Yes' and 'No' values from the y_train and y_test sets with 1 and 0, respectively.

#  Transform Churns into binary values

y_train.replace({'Yes' : 1,

???????????????'No' : 0}, inplace = True)

# Transform Churns into binary values

y_test.replace({'Yes' : 1,

???????????????'No' : 0}, inplace = True)

Remember, we have some class imbalance to deal with here!

If we print the value counts for the y_train set, this is what we have.

y_train.value_counts()

I'll use SMOTE to increase the number of churned clients synthetically at the training set to avoid developing a model that would be too biased in guessing every client as a non-churned because of the class imbalance. We only do this in the training set, leaving the test set intact with its original proportions since it's supposed to represent a real scenario.

X_train, y_train = SMOTE().fit_resample(X_train, y_train)

#  Printing new value counts for y_train

y_train.value_counts()

Now we have a 50/50 proportion where each class occupies half of the training set.

Let's run our models then!

# Running Random Forest

rf = RandomForestClassifier()

rf.fit(X_train, y_train)

rf_y_predict = rf.predict(X_test)

?
# Running Ada Boost
ab = AdaBoostClassifier()

ab.fit(X_train, y_train)

ab_y_predict = ab.predict(X_test)

?

# Running Gradient Boosting

gb = GradientBoostingClassifier()

gb.fit(X_train, y_train)

gb_y_predict = gb.predict(X_test)

?

# Running Extra Trees Classifier

et = ExtraTreesClassifier()

et.fit(X_train, y_train)

et_y_predict = et.predict(X_test)

Random Forest Scores

# Random Forest

print('Accuracy: %.2f%%' % (accuracy_score(y_test, rf_y_predict) * 100 ))

print('Precision: %.2f%%' % (precision_score(y_test, rf_y_predict) * 100))

print('Recall: %.2f%%' % (recall_score(y_test, rf_y_predict) * 100))

print('F1_Score: %.2f%%' % (f1_score(y_test, rf_y_predict) * 100))

confusion_matrix_rf = confusion_matrix(y_test, rf_y_predict)

plt.figure(figsize=(12,8))

ax = plt.subplot()

sns.heatmap(confusion_matrix_rf, annot=True, fmt='g', ax = ax)

ax.set_xlabel('Predicted Label')

ax.set_ylabel('Actual Label')

ax.set_title('Confusion Matrix - Random Forest')

ax.xaxis.set_ticklabels(['0','1'])

ax.yaxis.set_ticklabels(['0','1'])

Ada Boost Scores

Gradient Boosting Scores

Extra Trees Scores

y_test.value_counts()

Considering that recall is the metric that tells us how well our model predicts the class we want to predict and that we have 528 customers who left in the y_test set, Ada Boost, and Gradient Boosting were the best models.

Ada Boost Classifier correctly predicted that 400 out of 528 clients were more likely to churn, with a 75.76% recall. Gradient Boosting Classifier correctly predicted 381 out of 528 customers more likely to churn, with a 72.16% recall.

These results aren't bad, but I'm not satisfied yet and want to achieve better results. I'll use RandomizedSearchCV to test different parameters for both classifiers and identify the best ones to increase recall on both models.

# Tuning Ada Boost

grid = {'n_estimators' : [50,100,500,1500,2000],

??????'learning_rate' : [0.05,0.1,1.0,0.15,0.2,1.5,2.0]}

?

cv = ShuffleSplit()

adaboost = RandomizedSearchCV(AdaBoostClassifier(),

????????????????????????????param_distributions = grid,

????????????????????????????cv = cv,

????????????????????????????n_iter = 10,

????????????????????????????scoring = 'recall')

adaboost.fit(X_train, y_train)

I then get the best params

adaboost.best_params_

?We run Ada Boost Classifier again with these parameters.

tune_adaboost = AdaBoostClassifier(**adaboost.best_params_

tune_adaboost.fit(X_train, y_train)

y_pred = tune_adaboost.predict(X_test))

Let's see how it performs!

# Tuned Ada Boost

print('Accuracy: %.2f%%' % (accuracy_score(y_test, y_pred) * 100 ))

print('Precision: %.2f%%' % (precision_score(y_test, y_pred) * 100))

print('Recall: %.2f%%' % (recall_score(y_test, y_pred) * 100))

print('F1_Score: %.2f%%' % (f1_score(y_test, y_pred) * 100))

confusion_matrix_tuned_adaboost = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(12,8))

ax = plt.subplot()

sns.heatmap(confusion_matrix_tuned_adaboost, annot=True, fmt='g', ax = ax)

ax.set_xlabel('Predicted Label')

ax.set_ylabel('Actual Label')

ax.set_title('Confusion Matrix - Tuned Ada Boost')

ax.xaxis.set_ticklabels(['0','1'])

ax.yaxis.set_ticklabels(['0','1'])

?It looks like it didn't help. We've achieved worse results than before, correctly predicting 388, while the former Ada Boost model predicted 400.

I'll do the same thing with the Gradient Boosting Classifier, trying to find better parameters to achieve better results. Let's see if we're successful this time.

# Tuning Gradient Boosting

grid2 = {'n_estimators':[500,1000,1500,2000,2500],

???????'learning_rate':[0.01,0.05,0.2,0.1,0.15,0.3,0.5],

???????'min_samples_split':[2,5,6,10],

???????'min_samples_leaf':[3,5,8,12]}

?

gradientboosting = RandomizedSearchCV(GradientBoostingClassifier(),

????????????????????????????param_distributions = grid2,

????????????????????????????cv = cv,

????????????????????????????n_iter = 10,

????????????????????????????scoring = 'recall')

gradientboosting.fit(X_train, y_train)

gradientboosting.best_params_

tune_gradientboosting = GradientBoostingClassifier(**gradientboosting.best_params_)

tune_gradientboosting.fit(X_train, y_train)

y_pred_gb = tune_gradientboosting.predict(X_test)

# Tuned Gradient Boosting scores

print('Accuracy: %.2f%%' % (accuracy_score(y_test, y_pred_gb) * 100 ))

print('Precision: %.2f%%' % (precision_score(y_test, y_pred_gb) * 100))

print('Recall: %.2f%%' % (recall_score(y_test, y_pred_gb) * 100))

print('F1_Score: %.2f%%' % (f1_score(y_test, y_pred_gb) * 100))

confusion_matrix_tuned_gb = confusion_matrix(y_test, y_pred_gb)

plt.figure(figsize=(12,8))

ax = plt.subplot()

sns.heatmap(confusion_matrix_tuned_gb, annot=True, fmt='g', ax = ax)

ax.set_xlabel('Predicted Label')

ax.set_ylabel('Actual Label')

ax.set_title('Confusion Matrix - Tuned Gradient Boosting')

ax.xaxis.set_ticklabels(['0','1'])

ax.yaxis.set_ticklabels(['0','1'])

Now we have it! An optimized Gradient Boosting model with a 77.84% recall, correctly predicting 411 churn clients out of 528!

This is an improvement from the former models tested above!

Conclusion

After testing four different ensemble method algorithms, which are considered more robust thanks to their approach of combining predictions from multiple "weak" models to get to a final result, we ended up finding that the Ada Boost Classifier and Gradient Boosting Classifier achieved better recall scores.

SKlearn's RandomizedSearchCV helped us find the best parameters for an optimized Gradient Boosting model with a 77.84% recall score that correctly predicted 411 clients more likely to churn out of 528.

Tuning the Gradient Boosting model resulted in a slight improvement, and applying some feature engineering techniques could help us achieve even better results with a higher recall. However, I'm still a data scientist in formation and I'll soon take a feature engineering course on Kaggle to learn more on the matter and use it to improve my future machine learning projects.?

I hope you liked the behind-the-scenes making of this project. Once again, I'll leave this link for you to access my notebook on Kaggle. I'd love to get some upvotes and comments over there!

I'm always open to suggestions and recommendations, especially from data scientists that have more experience than me and could give me tips on how to improve my work.

Thank you so much!

?

Luís Fernando Torres

References

Project on GitHub

Project on Kaggle

Telco Customer Churn Dataset

What is Customer Churn?

Telco Churn Prediction With Machine Learning

Luis Fernando Torres

AML/FT Intelligence Analyst @ CloudWalk, Inc. | Microsoft Certified AI Engineer

Introduction

About the Dataset

Exploratory Data Analysis?

Visualizing Data

领英推荐

Churn Prediction Model

Random Forest Scores

Ada Boost Scores

Gradient Boosting Scores

Extra Trees Scores

Conclusion

References

更多精彩文章

社区洞察

其他会员也浏览了

Telco Customer Churn Prediction - An ML Classification Problem

Predicting Customer Churn Using Machine Learning, and Strategies to Reduce It

Xavient Research – ‘Next Gen Churn Models’ for ICT domain

Unveiling the Future: Machine Learning's Power in Predicting Customer Churn in Subscription-Based Enterprises

Predicting Customer Churn in Telecommunications: A Machine Learning Approach

Hands-on: Predict Customer Churn

Machine Learning for Enterprise - Churn

Analyzing Telco Customer Churn: Insights and Predictive Modelling.

Boosting Retention with Advanced Customer Churn Prediction Models

Data Science's Dirty Little Secret

Introduction

About the Dataset

Exploratory Data Analysis?

Visualizing Data

领英推荐

Churn Prediction Model

Random Forest Scores

Ada Boost Scores

Gradient Boosting Scores

Extra Trees Scores

Conclusion

References

A Closer Look Into Spatial Computing

2024年2月28日

Insta-Fake? Detecting Fake Accounts on Instagram with Machine Learning

2023年3月14日

Iniciando em Aprendizado de Máquina: Os 4 Modelos Essenciais

2023年3月8日

Getting Started with Machine Learning: 4 Essential Models to Know

2023年3月8日

Melhorando a Qualidade dos seus Dados com Adversarial Validation

2023年3月3日

A Practical Guide to Adversarial Validation

2023年3月3日

The Science of Smart Investing: Portfolio Evaluation with Python

2023年2月27日

Data Science Aplicada ao Mercado Financeiro: Criando e Avaliando Carteiras de Investimentos

2023年2月24日

Building an Accurate Gender Recognition Model with InceptionV3 and CelebA Dataset

2023年2月13日

Introduction to Quant Investing with Python

2023年2月13日

社区洞察

其他会员也浏览了

Telco Customer Churn Prediction - An ML Classification Problem

Predicting Customer Churn Using Machine Learning, and Strategies to Reduce It

Xavient Research – ‘Next Gen Churn Models’ for ICT domain

Unveiling the Future: Machine Learning's Power in Predicting Customer Churn in Subscription-Based Enterprises

Predicting Customer Churn in Telecommunications: A Machine Learning Approach

Hands-on: Predict Customer Churn

Machine Learning for Enterprise - Churn

Analyzing Telco Customer Churn: Insights and Predictive Modelling.

Boosting Retention with Advanced Customer Churn Prediction Models

Data Science's Dirty Little Secret