Predicting the Unpredictable: A Data-Driven Approach to Arresting Customer Churn in Banking
Tazkera Sharifi
AI/ML Engineer @ Booz Allen Hamilton | LLM | Generative AI | Deep Learning | AWS certified | Snowflake Builder DevOps | DataBricks| Innovation | Astrophysicist | Travel
The banking industry is going through a seismic shift, characterized by changing customer expectations and an increasingly competitive landscape. Customer churn—defined as the loss of customers to competitors—poses a significant and immediate threat to long-term profitability. Bank XYZ, like many others in the sector, has found itself grappling with this challenge. Over the past few quarters, there has been a troubling increase in the number of customers closing their accounts and switching to competitor banks. This exodus has had a domino effect, significantly impacting quarterly revenues and threatening to derail the bank's financial projections for the ongoing year. If unaddressed, this could lead to a drastic drop in stock prices and market capitalization.
Understanding customer churn and finding actionable insights to mitigate it has thus become a strategic imperative for Bank XYZ. A multidisciplinary team of business analysts, product managers, engineers, and data scientists has been assembled to address this critical issue. The goal is clear: leverage data analytics and predictive modeling to understand the patterns of customer churn and develop targeted interventions to retain at-risk customers.
This project aims to delve deep into this challenge, offering a model that not only predicts which customers are likely to churn but also estimates when this churn will happen. The insights generated through this project will serve as a roadmap for targeted customer retention strategies, ultimately helping to stabilize and potentially increase Bank XYZ's revenue streams.
Data Science Metrics:
The objective of our data science efforts is to create a predictive model that performs robustly in identifying potential churn customers. Specifically, we aim to achieve the following benchmarks:
Business Metrics:
For the business side, the goal is to enable targeted interventions that would result in a tangible decrease in customer churn rates. Drawing from our data science metric targets:
By setting these metrics, we strive to create a model that is not only statistically sound but also has a real and significant business impact. Our multi-disciplinary team will use these metrics as the north star for performance, ensuring alignment across business and data science initiatives.
Data Import and Initial Exploration
Reading the Dataset
Our journey starts by obtaining the data. For this project, the dataset is hosted on an S3 bucket and can be directly accessed using its URL. We use the pandas library to read the CSV file and load it into a DataFrame, which is essentially a table-like data structure that makes data manipulation and analysis more efficient.
Dataset Dimensions
The DataFrame contains 10,000 rows and 14 columns, offering a sufficiently large dataset for meaningful data analysis and model training.
Data Overview:
The dataset contains various customer details like 'CustomerId', 'CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance', etc., along with a target variable 'Exited', which tells us whether the customer has churned or not.
Features:
Basic Statistical Summary
A quick look at the summary statistics provides some valuable insights:
Column Unique Values
By understanding the nature of our dataset, we lay the foundation for the subsequent data cleaning, feature engineering, and predictive modeling steps.
Data Preprocessing and Feature Engineering
Identifying Unique Customers and Non-Essential Columns
First and foremost, we verify the integrity of the dataset by ensuring that each row represents a unique customer.
df.shape[0], df.CustomerId.nunique() # Output: (10000, 10000)
Since the number of rows and the number of unique CustomerIds both are 10,000, we confirm that each row corresponds to a distinct customer. Given this, RowNumber and CustomerId columns can be removed as they don't contribute any meaningful information for our analysis.
Categorical and Numerical Features
We categorize the features into different types to make it easier for subsequent preprocessing steps.
Here, it's worth noting that 'Tenure' and 'NumOfProducts' can be considered as ordinal variables, whereas 'HasCrCard' and 'IsActiveMember' are binary categorical variables.
Separating Target Variable
Finally, we isolate the target variable, which is 'Exited' in this case, into a separate array for use in model training.
Critical Questions and Considerations for Data Understanding
Before diving into data modeling, it's crucial to question the data we have at hand and understand its limitations, as well as potential avenues for enrichment. Here are some key considerations:
Date/Time Column Missing
Snapshot or Time Series?
Interpreting Churn and Activity
Objectives and Goals
The ultimate aim is to distill the problem statement further, ideally into quantifiable metrics. More context or data can significantly impact the performance of the eventual model. Without knowing the answers to the above questions, any model we build will have inherent limitations, affecting its reliability and applicability.
By addressing these questions upfront, we are setting the stage for a more informed and effective data analysis and predictive modeling process.
Data Splitting Strategy and Evaluation Metrics
Data Splitting Approach
In the absence of a temporal variable or the possibility of time-series analysis, the data is randomly partitioned into three distinct sets to ensure a comprehensive evaluation of the predictive models:
Here is how the split looks in terms of data shape and target mean:
Why This Strategy?
The dataset is split in a manner that ensures all sets are representative of the overall data distribution, as indicated by the similar means of the target variable (Exited) across the three sets. This helps in mitigating overfitting and provides a realistic estimate of how the model will perform on unseen data.
Considerations
By adhering to this robust data splitting strategy, we aim to develop a machine learning model that generalizes well to new data. This also sets the stage for the upcoming phases of model selection, tuning, and evaluation.
Exploratory Data Analysis - Univariate Plots of Numerical Variables
Analytical Approach
The univariate plots provide us with a first look at the data, helping us to understand the distribution of individual numerical variables. The following plots were constructed:
Key Observations
Label Encoding for Binary Variables in Our Dataset
During the data preprocessing phase of our project, it became evident that some of our categorical variables, particularly binary ones, needed encoding to be compatible with machine learning algorithms. We employed the 'Label Encoding' technique for this purpose. Here's a breakdown:
1. Direct Method Using Pandas:
Before employing any sophisticated tools, we attempted a direct approach using pandas:
# Convert 'Gender' column to category type and then map to codes (0 or 1)
df_train['Gender_cat'] = df_train.Gender.astype('category').cat.codes
# Displaying a sample for verification
df_train.sample(10)
# Dropping the temporarily created 'Gender_cat' column
df_train.drop('Gender_cat', axis=1, inplace=True)
This method provided a quick look at how encoding can be done without external libraries.
2. Using Scikit-learn's LabelEncoder:
For a more scalable and robust solution, we turned to scikit-learn's LabelEncoder:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# We fit only on the training dataset, treating validation and test sets as unseen data.
df_train['Gender'] = le.fit_transform(df_train['Gender'])
# Mapping the encoding for reference
le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print(le_name_mapping) # {'Female': 0, 'Male': 1}
3. Handling Unseen Categorical Levels:
We need to anticipate situations where new categorical values appear in the validation or test set that were not present in the training set.
# If 'Gender' column has new values in test or validation set:
print(le.transform([['Male']])) # Shows array([1])
# What if there's an unseen value like 'ABC'?
# This will throw an error, so we map using pandas
pd.Series(['ABC']).map(le_name_mapping) # Returns NaN for unseen values
4. Encoding Gender for Validation and Test Sets:
After handling possible unseen values, we applied encoding to the validation and test sets:
领英推荐
df_val['Gender'] = df_val.Gender.map(le_name_mapping)
df_test['Gender'] = df_test.Gender.map(le_name_mapping)
# Filling any missing/NaN values that might arise due to new categorical levels
df_val['Gender'].fillna(-1, inplace=True)
df_test['Gender'].fillna(-1, inplace=True)
We used a placeholder value of -1 for any new categories that didn't exist in the training data.
Through the employment of Label Encoding, we transformed our binary categorical variables into a format suitable for machine learning models. By considering both direct methods and leveraging the robustness of scikit-learn, we ensured that our data preprocessing was both thorough and scalable.
Bivariate Analysis and Correlation Matrix:
A correlation matrix helps identify the linear relationships between features in a dataset. It aids in understanding which variables might be influencing each other and provides insights into potential multicollinearity, which can affect model performance. Understanding these relationships is crucial when making decisions about feature selection and interpretation of model results.
Observations in Our Case:
In summary, while the correlations observed were not strong, they provide valuable insights into which features might be influential in predicting the target variable in the given dataset.
Individual features versus their distibution across target variable values
Upon examining the data, several key insights emerge regarding customer behavior:
The age distribution reveals that older customers are more inclined to leave the service than their younger counterparts, possibly indicating that the service caters better to younger demographics. The balance data further accentuates that customers with higher balances are more likely to exit, suggesting they might be seeking enhanced value or superior services elsewhere.
In terms of gender, females exhibit a higher exit rate (24.8%) compared to males (16.5%), hinting at potential disparities in the service's appeal between the genders. Moreover, inactive members showcase an elevated propensity to exit (26.6%) compared to their active peers (14.4%), emphasizing the importance of customer engagement for retention.
Geographically, customers based in Germany present a significantly higher exit rate (32.5%) compared to those in France (16.1%), pointing towards possible regional challenges or discrepancies in offerings.
Interestingly, while holding two products correlates with customer loyalty, possessing three or more leads to starkly increased exit rates. This could signify that customers find managing numerous products cumbersome or not beneficial. As we move forward, these insights can pave the way for devising targeted strategies to bolster customer satisfaction and retention.
Feature Engineering:
In the initial phase of data analysis, we undertook feature engineering to create new variables based on existing data. This is a crucial step as it can enhance the predictive power of a model by introducing new information derived from current variables.
The heatmap visually depicts the correlation matrix of these newly engineered features with the target variable, 'Exited'. From this:
Feature Selection Analysis: Recursive Feature Elimination (RFE)
The primary objective is to identify the most significant variables or features that contribute to the prediction of customer churn. An initial set of features was shortlisted based on Exploratory Data Analysis (EDA) and bivariate analysis. The intent was to validate and potentially refine this list using the RFE methodology.
Methodology:
The dataset shows a notable class imbalance. The majority class '0' (customers who did not churn) has a significantly larger number of samples compared to the minority class '1' (customers who churned). This imbalance is common in many real-world datasets, especially in scenarios where the event of interest, like churning, is less frequent. Addressing this imbalance is crucial for building a robust predictive model, as it ensures the minority class is not overshadowed by the majority one.
The class imbalance ratio revealed that the samples for class '1' were almost four times less than those for class '0'. To compensate for this imbalance, we assigned class weights. Specifically, the class '1' was assigned a weight of approximately 3.93, whereas the class '0' retained a weight of 1.
With these weights in place, a Logistic Regression model was defined and trained. The model parameters showed varying influence from different features on the model's predictions. Subsequently, the model's training performance was assessed using several metrics. The ROC-AUC score stood at around 0.71, and the recall for the minority class (1) was approximately 0.70, which indicates the model's enhanced sensitivity to the minority class after weighting. However, the precision for the minority class remained low, causing the F1-score for the minority class to be around 0.50. This suggests that while the model is identifying the positive class more frequently, it's also making some false positive predictions.
The model's performance on the validation set was similarly evaluated. The ROC-AUC score remained consistent at 0.70, and the recall was also at 0.70. The precision for the minority class in the validation set was 0.40, leading to an F1-score of 0.51. This confirmed the model's behavior observed in the training set, implying that while the weighting improved the recall, there's a trade-off in precision, especially for the minority class.
In summary, addressing the class imbalance improved the model's sensitivity to the minority class but at the expense of precision. Such trade-offs need to be carefully considered, especially in applications where false positives have significant consequences.
In addressing the customer churn prediction challenge, a Support Vector Machine (SVM) model was utilized, emphasizing its capacity to handle class imbalances by assigning differential class weights. Specifically, non-churned customers (class '0') were given a weight of 1.0, while churned customers (class '1') were weighted at 3.92. Post-training, the SVM displayed a consistent performance across training and validation datasets, achieving around 72% accuracy on the training set and 70% on the validation set, with a notable ability to detect customers likely to churn.
The visualization showcases the decision boundaries of two linear classification models: Logistic Regression (LogReg) and Support Vector Machine (SVM) on a 2-dimensional dataset. To produce this plot, a dimensionality reduction technique, Principal Component Analysis (PCA), was employed to transform the original dataset (which had more than two features) into a 2-dimensional space. After applying PCA, the first two principal components explained approximately 26.03% and 18.79% of the variance, respectively. Upon training both the LogReg and SVM models on this reduced dataset, their respective decision boundaries were plotted. The background shading differentiates the regions classified by each model, with the contour lines representing the SVM's boundary. Points in the scatterplot represent individual data instances, color-coded based on their true labels, with blue circles representing class '0' and orange circles representing class '1'. The visualization assists in understanding how each model classifies data in this 2D space, providing insights into their linear separation capabilities.
The visualization presents the decision boundary of a Decision Tree classifier trained on a 2-dimensional dataset. To generate this 2D dataset, the original data, which had more than two features, underwent a dimensionality reduction using Principal Component Analysis (PCA). The first two principal components from the PCA accounted for about 51.07% and 48.93% of the variance, respectively. The plotted data points represent individual data samples, with blue circles indicating class '0' and orange circles indicating class '1'. The shaded regions reflect the classifications made by the Decision Tree: the peach area denotes class '0' and the grey area denotes class '1'. The non-linear boundaries dividing these regions exemplify the tree's ability to capture intricate patterns in the data. This plot provides an intuitive visual representation of how the Decision Tree model differentiates between the two classes in this reduced feature space.
The two visualizations depict decision boundaries for linear models (Logistic Regression and SVM) and a non-linear model (Decision Tree) trained on a 2-dimensional dataset derived via PCA. The linear models exhibit a straight-line boundary, indicating a simplistic distinction between the two classes. In contrast, the Decision Tree presents a more intricate, non-linear decision boundary, capturing more complex patterns within the data. While linear models rely on a straightforward distinction, often yielding generalized interpretations, the Decision Tree's ability to form segmented areas showcases its adaptability to underlying data structures, but might suggest a higher susceptibility to overfitting. The choice between linear and non-linear models depends on the data's nature and the desired trade-off between interpretability and model flexibility.
Machine learning pipeline for a Decision Tree classifier
Through this pipeline, categorical data is first transformed into numeric format, new features are generated, specific columns are scaled for normalization, and finally, the Decision Tree model is trained. The model considers class imbalance by assigning different weights to classes. When evaluated on validation data, the classifier shows good recall for predicting customers who exited the bank, albeit at the cost of precision, indicating it might be identifying too many false positives for that category.
Next, we do a spot-check and evaluate the performance of several machine learning models using k-fold cross-validation on a banking dataset. These models are housed within a "model zoo" which includes models like Random Forest (RF), Light Gradient Boosting Machine (LGBM), XGBoost (XGB), k-Nearest Neighbors (kNN), and various Naive Bayes models (Gaussian, Multinomial, Complement, and Bernoulli).
Initially, the training data is prepared and class weights are calculated to account for any class imbalance in the target variable. Certain features like 'CreditScore', 'Age', and 'Balance' are identified for scaling later on.
The models selected predominantly fall under the tree model category. Tree models, including RF, LGBM, and XGB, are often chosen for their ability to capture non-linear relationships and provide feature importance. However, the code also considers kNN and Naive Bayes models.
Automated pipelines are set up for each model. If a model requires feature scaling, like kNN, then the pipeline will include a scaling step; otherwise, this step is omitted for models like tree-based or Naive Bayes models which do not require feature scaling.
Subsequently, a cross-validation approach is employed to evaluate each model's performance based on recall and F1-score metrics. The results are then printed out for each model under both metrics.
Upon evaluating the models, LightGBM emerges as the preferred choice for further hyperparameter tuning. This decision is grounded in its performance: it showcased the highest recall and was a close second in terms of F1-score. Recall, which measures the proportion of actual positives correctly identified, is especially crucial in contexts where missing a positive instance can have significant implications, like in banking scenarios where predicting customer churn is vital.
Hyperparameter Tuning:
Training Final Best Model and Saving for Deployment:
The boxplot visualizes the predicted churn probabilities generated by the finally chosen model LightGBM for two categories: 0 (customers who did not churn) and 1 (customers who churned). For customers categorized as '0', the model's predicted probabilities are mostly clustered around the lower end, indicating a lower likelihood of churning. On the other hand, for the '1' category, the probabilities are notably higher, emphasizing the model's prediction of a higher churn risk. The boxplot's range and interquartile spread further indicate the variance in predictions. The presence of outliers, especially in the '0' category, hints at instances where the model might be less certain.
SHAP:
This plot represents the average impact of different features on the model's predictions using SHAP (SHapley Additive exPlanations) values. Each row corresponds to a feature, and the length of the bar indicates the average magnitude of that feature's impact on the model's output. The colors differentiate between the two classes: Class 0 (customers who did not churn) and Class 1 (customers who churned).
For example, the feature "Age" has a significant impact on the model's predictions. The longer red bar indicates that as age increases, it tends to push the model's prediction towards classifying a customer as likely to churn (Class 1). Conversely, the blue bar indicates the influence of the feature in predicting that a customer will not churn (Class 0).
Some key takeaways from the plot:
In essence, this visualization provides a ranked overview of which features are most impactful in the model's decision-making process, aiding in understanding and interpreting the model's behavior.
Model's performance on unseen data:
Observations:
Comparison with Model's Performance:
In conclusion, the model seems to perform well on unseen future data as it can effectively distinguish between churners and non-churners. However, there are instances where the model was less confident or possibly made misclassifications. This insight is valuable as it can guide further fine-tuning of the model or influence business decisions based on the model's predictions.
Model Evaluation based on initial business goals:
After our diligent data science efforts, we've made promising strides in developing a predictive model for identifying potential churn customers. Evaluating our model against the initial benchmarks:
Strategic User Segmentation:
Segmentation allows for a personalized approach. For example, if data indicates that males from Germany with active memberships and credit cards are prone to churn, prioritizing this segment can lead to a higher ROI. Tailoring strategies according to data-driven insights results in more effective customer retention efforts.
Key Takeaways:
Conclusion:
Throughout this portfolio, we've delved deep into the intricacies of customer churn prediction, providing a comprehensive look at data analysis, modeling, and actionable business insights. We've recognized the importance of precision, recall, and F1-score in determining a model's effectiveness, ensuring our strategies align with key business objectives. Our journey underscored the significance of data-driven decision-making, and how it can lead to tangible, positive business outcomes when combined with a deep understanding of user segmentation and business priorities. By also highlighting potential challenges in deploying models and the necessity of continuous monitoring and adaptation, we've provided a holistic view of the entire data science process.
As we wrap up this project, it stands as a testament to the power of data science in driving business forward, showcasing a perfect blend of technical prowess and business acumen. Here's to leveraging these insights for future endeavors and to the relentless pursuit of excellence in the dynamic field of data science!
Digital Transformation through AI and ML | Decarbonization in Energy | Consulting Director
1 年A structured approach that identifies and addresses some of the challenges impacting development of models Tazkera Haque .. thanks for sharing. Thoughts on number of products and surname as leading features? Did any of these surprise you or other domain experts?
Editor of 'The AI Way' a weekly email newsletter focussed on Education and AI. | Pioneering AI in Education & Self-Learning | Explore AI's Frontier with My Weekly Newsletter |1010+ Subscribers & Growing
1 年Great work. Tazkera Haque. Your right up is incredible well done I could gain a better understanding of how to do such a project myself.
Freelance Senior Data Analyst | I help the People & Brands to drive impact using Analytics | Retail & Ecommerce & Supply Chain & Customer Analytics | Senior Data Analyst @alibaba group
1 年Customer Churn is a big problem now a days. Good to see you have done project on it .
Data Analytics Consultant @Systems | Data Analyst Mentor?
1 年"Great insight! Thanks for sharing this valuable information.