Predicting Customer Churn in Telecommunications: A Machine Learning Approach
Felix Kwemoi
Certified Data analyst || Tech maniac || Statistician || Data Visualizations || SQL || Power BI || Tableau || python || advocate for environmental sustainability || Poetry lover || Student of Life
Introduction :
In today’s highly competitive telecommunications industry, retaining customers is crucial for business success. Customer churn, the phenomenon where customers switch from one provider to another, poses a significant challenge for telecom companies. To address this challenge, leveraging machine learning (ML) techniques has become increasingly essential. In this article, we dive into the application of ML in predicting customer churn, exploring how advanced analytics can empower telecom companies to proactively identify and retain at-risk customers. By harnessing the power of data-driven insights, telecom providers can enhance customer satisfaction, optimize retention strategies, and ultimately drive business growth
Project structure
The steps involved in this project are as follows;
Technical content
1.Business Understanding
Description : This project aims to develop a machine learning model to predict customer churn in a telecommunications company. By leveraging historical customer data, including usage patterns, demographics, and service subscriptions, the model will identify customers at risk of churning. This predictive capability will enable the company to implement targeted retention strategies and improve customer retention rates.
Problem Statement: Telecommunications companies face significant challenges in retaining customers due to the competitive nature of the industry and the ease with which customers can switch providers. Customer churn, the rate at which customers discontinue their services, can have a substantial impact on revenue and profitability. Therefore, it is crucial for telecommunications companies to proactively identify customers at risk of churning and implement effective retention strategies.
Objective: The objective of the project is to develop a machine learning model that accurately predicts customer churn for a telecommunications company.
Business Success Criteria: The success of the project will be measured by the model’s ability to accurately identify customers at risk of churning, thereby allowing the company to implement proactive retention strategies and minimize customer attrition. Specifically, achieving a high accuracy, precision, and recall in predicting churn will be key success metrics.
Select Technologies and Tools: Choose appropriate machine learning frameworks (e.g., TensorFlow, scikit-learn) and data processing tools (e.g., pandas, SQL) for model development. Decide on visualization libraries (e.g., Matplotlib, Seaborn) for result interpretation.
Risks and Contingencies: Identify potential risks such as data quality issues, model overfitting, or regulatory compliance. Develop contingency plans to address these risks and mitigate their impact on project timelines and outcomes.
Cost Benefit Analysis: Conduct a cost-benefit analysis to determine the financial implications of implementing the churn prediction model compared to the potential revenue losses resulting from customer churn
HYPOTHESIS:
(H0)Null Hypothesis: There is no significant relationship between MonthlyCharges and whether a customer churns
(H1)Alternative Hypothesis: There is significant relationship between MonthlyCharges and whether a customer churns
Business Questions:
1. What is the relationship between totalcharges and customers churning?
2. What is the relationship between monthlycharges and customers churning?
3. Which customer gender churned the most?
4. Between the male and female gender who were charged the most on a monthly basis?
5. Customers that churned were mostly using which type of InternetService?
2.Data Understanding
It is imperative to gain a comprehensive understanding of the underlying data before predicting customer churn. Data was retrieved from different sources including; SQL databases, one-drive file and from a Github repository.
Before loading data we had to import the necessary packages into the notebook:
We then proceeded into loading the dataset from the different sources.
After collecting the data sets , loading them and concatenating them, we went ahead to conduct Exploratory Data Analysis (E.D.A) of the data set to get an overview of our data. We discovered the following findings:
Issues with the dataset:
1. The TotalCharges column and the Tenure column have the wrong datatype
2. There are missing values in the TotalCharges column
3. The customerID is not necessary in building the ML model
Course of Action
1. Correct the Total Charges column datatype i.e using ‘pd.to_numeric(df2[‘TotalCharges’], errors=’coerce’)’
2. For missing values we will leave them for now until when building pipelines
3. Drop the customerID column as it is not used or necessary
Findings:
1. The train-df dataset has 21 columns and 5043 rows
2. Most customers were of the male gender, were not SeniorCitizens , lacked dependents and patners
3. The average monthlycharges was around 65 and majority of them did not churn
4. Majority of the customers prefferred Fiber optic internet service and a month to month type of contract
5. The customers were averagely charged 64.7 monthly with the highest charge on monthly basis being 118.65
6. For customers that churned their monthly charges were about 80 while those that did not churn had monthlycharges of around 60
7. Customers that churned had their total charges at around 702.2 while those that did not churn had their totalcharges at around 1730
8. More males churned 279 than females 277
9. For males that churned their monthly charges were around 75 while those that did not churn was around 65
10. For females that churned their monthly charges were around 65 while those that did not churn their monthly charges was around 70
11. Tenure is highly correlated with the TotalCharges column
12. MonthlyCharge is also highly correlated with the Total Charges column
Univariate analysis
univariate analysis serves as the foundational step towards unraveling the intricate patterns hidden within datasets. Focuses solely on exploring and comprehending individual variables in isolation.
Bivariate Analysis
Understanding the intricate relationships between variables is paramount. Bivariate analysis emerges as a powerful tool to unravel these connections by examining the interplay between pairs of variables within a dataset.
Findings
1. The train-df dataset has 21 columns and 5043 rows
2. Most customers were of the male gender, were not SeniorCitizens , lacked dependents and patners
3. The average monthlycharges was around 65 and majority of them did not churn
4. Majority of the customers prefferred Fiber optic internet service and a month to month type of contract
5. The customers were averagely charged 64.7 monthly with the highest charge on monthly basis being 118.65
6. For customers that churned their monthly charges were about 80 while those that did not churn had monthlycharges of around 60
7. Customers that churned had their total charges at around 702.2 while those that did not churn had their totalcharges at around 1730
8. More males churned 279 than females 277
9. For males that churned their monthly charges were around 75 while those that did not churn was around 65
10. For females that churned their monthly charges were around 65 while those that did not churn their monthly charges was around 70
11. Tenure is highly correlated with the TotalCharges column
领英推荐
12. MonthlyCharge is also highly correlated with the Total Charges column
3. Data preparation
Data cleaning is a critical step in preparing the dataset for analysis. It involves identifying and addressing issues such as missing values, duplicates, and outliers. By removing or correcting these discrepancies, we ensure the integrity and quality of the data, laying a solid foundation for subsequent analysis and modeling. Through data cleaning, we aim to create a clean and reliable dataset that accurately reflects the underlying patterns and relationships within the data, enabling us to derive meaningful insights and build robust predictive models.
We changed the datatype of the TotalCharges column to float and dropped the customerID column as it was not useful in building Machine Learning models for predicting customer churn.
4.Answering business questions and Visualizations
Answering key business questions and creating insightful visualizations play a pivotal role in driving strategic decision-making and fostering business growth. Through advanced analytics and data visualization techniques, we extracted actionable insights from the dataset, empowering telecom companies to make informed decisions and optimize their operations.
Question one: What is the relationship between Totalcharges and customers churning?
This question was answered using the following codes that produced the below visual.
The findings were as follows:
1. Customers that churned most have them have their totalcharges to be below 2000.
2. Those that did not churn, most of them have their totalcharges above 2000 compared to those that churned
Question two: What is the relationship between monthlycharges and customers churning?
Findings:
Those that churned were mostly charged around 80 monthly while those that did not churn were charged around 65 monthly
Question Three: Which customer gender churned the most?
findings:
The male gender churned the most 396 than the female gender 384
Question four: Between the male and female gender who were charged the most on a monthly basis?
Findings:
The male gender was charged the most on a monthly basis 99194.50 while the female gender was charged 9684.70
Question five : Customers that churned were mostly using which type of InternetService?
findings:
2. While most of them that did not churn using DSL 835
5.Summary and recommendations
Our analysis of customer churn in the telecommunications industry has revealed key insights into factors influencing churn and enabled the development of accurate predictive models. To reduce churn rates and foster customer loyalty, we recommend implementing proactive retention strategies, enhancing the customer experience, optimizing subscription plans, investing in data-driven decision-making, and fostering a culture of innovation within the organization.
Since customers that churned were mostly charged around 80 on a monthly basis ,while those that did not churn were charged around 65 , we recommend the company to charge less than 65 to keep more customers
The male gender churned the most compared to the female gender . This is because they were charged more ,thus reducing the amount charged to the males would perhaps keep them .
Customers using the DSL , majority them did not churn . Thus the company should advise more of its customers to use the DSL internet service
By following these recommendations, telecom companies can mitigate churn rates, enhance customer satisfaction, and drive sustainable business growth.
6.Hypothesis Testing
In the pursuit of understanding and drawing insights from data, hypothesis testing emerges as a fundamental technique in statistical analysis. It enables analysts to make informed decisions by assessing the validity of assumptions and drawing conclusions based on empirical evidence.
We set our null hypothesis to be :There is no significant relationship between MonthlyCharges and whether a customer churns
and the alternative hypothesis to be :There is significant relationship between MonthlyCharges and whether a customer churns
We used Mann WhitneyU test to perform the hypothesis testing since the data was not normally distributed.
Our finding was to reject the null as seen below:
7.Data Preparation
Before embarking on the journey of predictive modeling, it’s imperative to prepare the data meticulously to ensure the success and accuracy of the models. Data preparation involves a series of steps aimed at cleaning, transforming, and optimizing the raw data for modeling purposes
Check if the data is balanced or not :
Balanced data ia when the number of values or entities in the target column are equal both the negative part and positive part . Some models work better on certain datasets when they are balanced .
Determining the input and output data
We set X to be our input data and y our output vriable before spliting the dataset into train and test . The test part is used to train the models while the test part is used to evaluate the model.
Preparing pipeplines
Pipelines are a powerful tool in the machine learning arsenal, streamlining the end-to-end process from data preprocessing to model evaluation. They encapsulate a sequence of data processing steps into a single entity, facilitating reproducibility, scalability, and efficiency in machine learning workflows. They help in:
8.Modeling
Modeling is the heart of the machine learning process, where algorithms learn patterns from data to make predictions or decisions. It encompasses selecting, training, and evaluating machine learning models to solve specific tasks or make informed decisions based on data.
Key Components of Modeling:
Common Machine Learning Algorithms:
Best Practices in Modeling:
9.Evaluation
Model evaluation is essential for assessing the effectiveness and reliability of predictive models. It involves quantifying the performance of trained models using various evaluation metrics and techniques to ensure they generalize well to unseen data and fulfill the desired objectives.
Best Practices in Model Evaluation:
REFERENCES
APPRECIATION
I would like to express my gratitude to the following individuals and organizations whose contributions were instrumental in the completion of this project :
TAGS
Azubi Data Science