Predicting the Unpredictable: A Data-Driven Approach to Arresting Customer Churn in Banking
Customer Churn Analysis using Ensemble Techniques

Predicting the Unpredictable: A Data-Driven Approach to Arresting Customer Churn in Banking


The banking industry is going through a seismic shift, characterized by changing customer expectations and an increasingly competitive landscape. Customer churn—defined as the loss of customers to competitors—poses a significant and immediate threat to long-term profitability. Bank XYZ, like many others in the sector, has found itself grappling with this challenge. Over the past few quarters, there has been a troubling increase in the number of customers closing their accounts and switching to competitor banks. This exodus has had a domino effect, significantly impacting quarterly revenues and threatening to derail the bank's financial projections for the ongoing year. If unaddressed, this could lead to a drastic drop in stock prices and market capitalization.

Understanding customer churn and finding actionable insights to mitigate it has thus become a strategic imperative for Bank XYZ. A multidisciplinary team of business analysts, product managers, engineers, and data scientists has been assembled to address this critical issue. The goal is clear: leverage data analytics and predictive modeling to understand the patterns of customer churn and develop targeted interventions to retain at-risk customers.

This project aims to delve deep into this challenge, offering a model that not only predicts which customers are likely to churn but also estimates when this churn will happen. The insights generated through this project will serve as a roadmap for targeted customer retention strategies, ultimately helping to stabilize and potentially increase Bank XYZ's revenue streams.

Data Science Metrics:

The objective of our data science efforts is to create a predictive model that performs robustly in identifying potential churn customers. Specifically, we aim to achieve the following benchmarks:

  • Recall: Achieve a recall rate of greater than 70%. This ensures that we are correctly identifying at least 70% of all the customers who are likely to churn.
  • Precision: Target a precision rate of greater than 70%. This will mean that at least 70% of the customers our model flags as potential churn risks are indeed at risk.
  • F1-Score: Aim for an F1-score of greater than 70% to ensure a balanced trade-off between Precision and Recall, which is critical in a business context where both false positives and false negatives have significant implications.

Business Metrics:

For the business side, the goal is to enable targeted interventions that would result in a tangible decrease in customer churn rates. Drawing from our data science metric targets:

  • Improvement in Churn Rate: If our model successfully identifies 70% of customers likely to churn (based on our Recall target), our business interventions should aim to retain at least half of these identified customers. This would translate into a 35% improvement in churn rate through various strategies like targeted offers, personalized communication, and addressing specific grievances.

By setting these metrics, we strive to create a model that is not only statistically sound but also has a real and significant business impact. Our multi-disciplinary team will use these metrics as the north star for performance, ensuring alignment across business and data science initiatives.

Data Import and Initial Exploration

Reading the Dataset

Our journey starts by obtaining the data. For this project, the dataset is hosted on an S3 bucket and can be directly accessed using its URL. We use the pandas library to read the CSV file and load it into a DataFrame, which is essentially a table-like data structure that makes data manipulation and analysis more efficient.

Dataset Dimensions

The DataFrame contains 10,000 rows and 14 columns, offering a sufficiently large dataset for meaningful data analysis and model training.

Data Overview:

The dataset contains various customer details like 'CustomerId', 'CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance', etc., along with a target variable 'Exited', which tells us whether the customer has churned or not.

Features:

  1. RowNumber: Index or identifier for each row.
  2. CustomerId: Unique identifier for each customer.
  3. Surname: Last name of the customer.
  4. CreditScore: Credit score of the customer.
  5. Geography: Country where the customer resides.
  6. Gender: Gender of the customer.
  7. Age: Age of the customer.
  8. Tenure: How long the customer has been with the bank.
  9. Balance: Current balance of the customer's account.
  10. NumOfProducts: Number of products the customer uses.
  11. HasCrCard: Whether the customer has a credit card or not (1 for Yes, 0 for No).
  12. IsActiveMember: Whether the customer is active (1 for Yes, 0 for No).
  13. EstimatedSalary: The estimated salary of the customer.
  14. Exited: Whether the customer has exited (churned) or not (1 for Yes, 0 for No).

Basic Statistical Summary

A quick look at the summary statistics provides some valuable insights:

  • Credit Score: Ranges from 350 to 850, with a mean of approximately 650.
  • Age: The customers range from 18 to 92 years old, with a mean age of approximately 39.
  • Balance: Account balance varies widely, with a mean of around $76,485.
  • Estimated Salary: Mean estimated salary is around $100,090.

Column Unique Values

  • Geography: Data from three countries.
  • Gender: Information for two genders, Male and Female.
  • Surname: Contains 2,932 unique surnames.

By understanding the nature of our dataset, we lay the foundation for the subsequent data cleaning, feature engineering, and predictive modeling steps.

Data Preprocessing and Feature Engineering

Identifying Unique Customers and Non-Essential Columns

First and foremost, we verify the integrity of the dataset by ensuring that each row represents a unique customer.

df.shape[0], df.CustomerId.nunique()  # Output: (10000, 10000)
        

Since the number of rows and the number of unique CustomerIds both are 10,000, we confirm that each row corresponds to a distinct customer. Given this, RowNumber and CustomerId columns can be removed as they don't contribute any meaningful information for our analysis.

Categorical and Numerical Features

We categorize the features into different types to make it easier for subsequent preprocessing steps.

  • Numerical Features: 'CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary'
  • Categorical Features: 'Surname', 'Geography', 'Gender', 'HasCrCard', 'IsActiveMember'
  • Target Variable: 'Exited'

Here, it's worth noting that 'Tenure' and 'NumOfProducts' can be considered as ordinal variables, whereas 'HasCrCard' and 'IsActiveMember' are binary categorical variables.

Separating Target Variable

Finally, we isolate the target variable, which is 'Exited' in this case, into a separate array for use in model training.

Critical Questions and Considerations for Data Understanding

Before diving into data modeling, it's crucial to question the data we have at hand and understand its limitations, as well as potential avenues for enrichment. Here are some key considerations:

Date/Time Column Missing

  1. No date/time column: A temporal feature could provide insights into customer behavior over time and seasonality effects. Without it, the dataset lacks this important contextual clue.

Snapshot or Time Series?

  1. Snapshot Date: The data appears to be a snapshot. Features like 'Balance', 'Tenure', 'NumOfProducts', and 'EstimatedSalary' would indeed have different values at different times. Knowing the date can help understand the economic context.
  2. Single or Multiple Dates: Are all these features measured on the same date for all customers? This is important for ensuring that comparisons are fair.
  3. Frequency of Updates: How frequently is each feature updated? This could be crucial for real-time prediction models.
  4. Time Series Possibility: Features over a period of time could provide a richer dataset and enable more accurate models.

Interpreting Churn and Activity

  1. Exited but still active?: Some customers have 'Exited' set to 1 but still have a balance or multiple products. This needs clarification: have they exited from a particular service or the bank entirely? Or is this data from just before their exit?
  2. IsActiveMember: This binary feature is overly simplified. Transaction frequency, kind, and amount could give a more nuanced view of a customer's activity level.
  3. Transaction Patterns: Transaction frequency and types could be more indicative of customer churn compared to a static snapshot of features. For instance, a customer who transacts daily is likely to be more loyal than someone transacting annually.

Objectives and Goals

The ultimate aim is to distill the problem statement further, ideally into quantifiable metrics. More context or data can significantly impact the performance of the eventual model. Without knowing the answers to the above questions, any model we build will have inherent limitations, affecting its reliability and applicability.

By addressing these questions upfront, we are setting the stage for a more informed and effective data analysis and predictive modeling process.


Data Splitting Strategy and Evaluation Metrics

Data Splitting Approach

In the absence of a temporal variable or the possibility of time-series analysis, the data is randomly partitioned into three distinct sets to ensure a comprehensive evaluation of the predictive models:

  1. Training Set (df_train, y_train): This set comprises 79.2% of the data and is used for training the models.
  2. Validation Set (df_val, y_val): Making up 10.8% of the data, this set is used to tune the hyperparameters and for initial evaluation of the model performance.
  3. Test Set (df_test, y_test): This is the holdout set that contains 10% of the data. It's used to estimate the model performance on unseen/new data.

Here is how the split looks in terms of data shape and target mean:

  • Training set: 7920 samples, Target mean ~ 20.30%
  • Validation set: 1080 samples, Target mean ~ 22.04%
  • Test set: 1000 samples, Target mean ~ 19.10%

Why This Strategy?

The dataset is split in a manner that ensures all sets are representative of the overall data distribution, as indicated by the similar means of the target variable (Exited) across the three sets. This helps in mitigating overfitting and provides a realistic estimate of how the model will perform on unseen data.

Considerations

  1. Random State: Setting a random state ensures that the splits you generate are reproducible. This is critical for tracking changes and for collaborative work.
  2. Stratification: Given that the target variable classes are imbalanced, stratified sampling could be considered in future experiments for ensuring each set has a similar distribution of the target variable.
  3. Test Set Isolation: The test set is only used at the very end of the model development process, providing an unbiased evaluation metric.

By adhering to this robust data splitting strategy, we aim to develop a machine learning model that generalizes well to new data. This also sets the stage for the upcoming phases of model selection, tuning, and evaluation.


Exploratory Data Analysis - Univariate Plots of Numerical Variables

Analytical Approach

The univariate plots provide us with a first look at the data, helping us to understand the distribution of individual numerical variables. The following plots were constructed:

Key Observations

  1. CreditScore: The boxplot does not show any significant outliers, indicating that the data for this feature is relatively well-behaved.
  2. Age: A similar case is made for age, where the boxplot shows the data spread with no extreme outliers.
  3. Tenure: The violin plot shows a near-uniform distribution, suggesting that all tenures are almost equally likely.
  4. Balance: The violin plot of this feature shows a bimodal distribution, suggesting that there are two common states for this variable.
  5. NumOfProducts: The histogram shows that most customers tend to hold one or two banking products, with very few venturing beyond that.
  6. EstimatedSalary: The KDE shows a fairly uniform distribution, indicating that it may not be a strong predictor of customer churn.

Label Encoding for Binary Variables in Our Dataset

During the data preprocessing phase of our project, it became evident that some of our categorical variables, particularly binary ones, needed encoding to be compatible with machine learning algorithms. We employed the 'Label Encoding' technique for this purpose. Here's a breakdown:

1. Direct Method Using Pandas:

Before employing any sophisticated tools, we attempted a direct approach using pandas:

# Convert 'Gender' column to category type and then map to codes (0 or 1)
df_train['Gender_cat'] = df_train.Gender.astype('category').cat.codes 
# Displaying a sample for verification
df_train.sample(10) 
# Dropping the temporarily created 'Gender_cat' column
df_train.drop('Gender_cat', axis=1, inplace=True)
        

This method provided a quick look at how encoding can be done without external libraries.

2. Using Scikit-learn's LabelEncoder:

For a more scalable and robust solution, we turned to scikit-learn's LabelEncoder:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() 

# We fit only on the training dataset, treating validation and test sets as unseen data.
df_train['Gender'] = le.fit_transform(df_train['Gender'])

# Mapping the encoding for reference
le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print(le_name_mapping) # {'Female': 0, 'Male': 1}
        

3. Handling Unseen Categorical Levels:

We need to anticipate situations where new categorical values appear in the validation or test set that were not present in the training set.

# If 'Gender' column has new values in test or validation set:
print(le.transform([['Male']])) # Shows array([1])
# What if there's an unseen value like 'ABC'?
# This will throw an error, so we map using pandas
pd.Series(['ABC']).map(le_name_mapping) # Returns NaN for unseen values
        

4. Encoding Gender for Validation and Test Sets:

After handling possible unseen values, we applied encoding to the validation and test sets:

df_val['Gender'] = df_val.Gender.map(le_name_mapping)
df_test['Gender'] = df_test.Gender.map(le_name_mapping)
# Filling any missing/NaN values that might arise due to new categorical levels
df_val['Gender'].fillna(-1, inplace=True)
df_test['Gender'].fillna(-1, inplace=True)
        

We used a placeholder value of -1 for any new categories that didn't exist in the training data.

Through the employment of Label Encoding, we transformed our binary categorical variables into a format suitable for machine learning models. By considering both direct methods and leveraging the robustness of scikit-learn, we ensured that our data preprocessing was both thorough and scalable.

Bivariate Analysis and Correlation Matrix:

A correlation matrix helps identify the linear relationships between features in a dataset. It aids in understanding which variables might be influencing each other and provides insights into potential multicollinearity, which can affect model performance. Understanding these relationships is crucial when making decisions about feature selection and interpretation of model results.

Observations in Our Case:

  1. No features displayed a high correlation with the target variable, suggesting that no single feature dominates the prediction.
  2. 'Age' and 'Balance' showed positive correlations, indicating they might have a role in influencing the target variable.
  3. Categorical variables, especially 'IsActiveMember' and countries 'Germany' and 'France', exhibited noticeable correlations, suggesting that these categories might impact the outcome to some degree.

In summary, while the correlations observed were not strong, they provide valuable insights into which features might be influential in predicting the target variable in the given dataset.

Individual features versus their distibution across target variable values

Upon examining the data, several key insights emerge regarding customer behavior:

The age distribution reveals that older customers are more inclined to leave the service than their younger counterparts, possibly indicating that the service caters better to younger demographics. The balance data further accentuates that customers with higher balances are more likely to exit, suggesting they might be seeking enhanced value or superior services elsewhere.

In terms of gender, females exhibit a higher exit rate (24.8%) compared to males (16.5%), hinting at potential disparities in the service's appeal between the genders. Moreover, inactive members showcase an elevated propensity to exit (26.6%) compared to their active peers (14.4%), emphasizing the importance of customer engagement for retention.

Geographically, customers based in Germany present a significantly higher exit rate (32.5%) compared to those in France (16.1%), pointing towards possible regional challenges or discrepancies in offerings.

Interestingly, while holding two products correlates with customer loyalty, possessing three or more leads to starkly increased exit rates. This could signify that customers find managing numerous products cumbersome or not beneficial. As we move forward, these insights can pave the way for devising targeted strategies to bolster customer satisfaction and retention.


Feature Engineering:

In the initial phase of data analysis, we undertook feature engineering to create new variables based on existing data. This is a crucial step as it can enhance the predictive power of a model by introducing new information derived from current variables.

The heatmap visually depicts the correlation matrix of these newly engineered features with the target variable, 'Exited'. From this:

  • bal_per_product has a correlation of 0.1 with 'Exited'. This implies that an increase in balance per product might slightly increase the chance of a customer exiting the bank.
  • bal_by_est_salary shows a weak correlation of 0.03 with 'Exited', indicating that the relationship between balance relative to estimated salary and the likelihood of exiting is not very strong.
  • tenure_age_ratio has a negative correlation of -0.12 with 'Exited'. This suggests that customers with a higher tenure to age ratio are somewhat less likely to leave the bank.
  • age_surname_mean_churn has a very weak correlation of 0.014 with 'Exited', suggesting that this feature, as of now, might not have a significant linear relationship with our target.

Feature Selection Analysis: Recursive Feature Elimination (RFE)

The primary objective is to identify the most significant variables or features that contribute to the prediction of customer churn. An initial set of features was shortlisted based on Exploratory Data Analysis (EDA) and bivariate analysis. The intent was to validate and potentially refine this list using the RFE methodology.

Methodology:

  1. Initial Features: Based on EDA and bivariate analysis, the following features were highlighted: Age, Gender, Balance, Number of Products, Active Membership status, geographical variables, balance per product, and tenure-age ratio.
  2. RFE with Logistic Regression:Model Used: Logistic Regression is a linear predictive model primarily used for binary classification tasks.Features Selected: Gender, Has Credit Card, IsActiveMember, Country indicators (France, Germany, Spain), Age, Number of Products, Surname (encoded), and Tenure-Age Ratio.
  3. RFE with Decision Trees:Model Used: Decision Tree Classifier uses a tree structure to make decisions based on feature values.Features Selected: IsActiveMember, country (especially Germany), Age, Number of Products, Estimated Salary, Surname (encoded), Balance Per Product, Balance divided by Estimated Salary, Tenure-Age Ratio, and Age-Surname Mean Churn rate.In our analysis, we've established a baseline using the Logistic Regression model, training it on features determined vital through Recursive Feature Elimination (RFE). The chosen features were bifurcated into categorical and numerical types. The numerical data underwent scaling to enhance the logistic regression's performance. For evaluation, we've selected metrics such as the ROC-AUC Score, F1 Score, Recall Score, Confusion Matrix, and Classification Report to provide a holistic view of the model's efficacy. This baseline model will act as our benchmark, against which we can assess the performance of subsequent, potentially more intricate, model. Solving class imbalance:

The dataset shows a notable class imbalance. The majority class '0' (customers who did not churn) has a significantly larger number of samples compared to the minority class '1' (customers who churned). This imbalance is common in many real-world datasets, especially in scenarios where the event of interest, like churning, is less frequent. Addressing this imbalance is crucial for building a robust predictive model, as it ensures the minority class is not overshadowed by the majority one.

The class imbalance ratio revealed that the samples for class '1' were almost four times less than those for class '0'. To compensate for this imbalance, we assigned class weights. Specifically, the class '1' was assigned a weight of approximately 3.93, whereas the class '0' retained a weight of 1.

With these weights in place, a Logistic Regression model was defined and trained. The model parameters showed varying influence from different features on the model's predictions. Subsequently, the model's training performance was assessed using several metrics. The ROC-AUC score stood at around 0.71, and the recall for the minority class (1) was approximately 0.70, which indicates the model's enhanced sensitivity to the minority class after weighting. However, the precision for the minority class remained low, causing the F1-score for the minority class to be around 0.50. This suggests that while the model is identifying the positive class more frequently, it's also making some false positive predictions.

The model's performance on the validation set was similarly evaluated. The ROC-AUC score remained consistent at 0.70, and the recall was also at 0.70. The precision for the minority class in the validation set was 0.40, leading to an F1-score of 0.51. This confirmed the model's behavior observed in the training set, implying that while the weighting improved the recall, there's a trade-off in precision, especially for the minority class.

In summary, addressing the class imbalance improved the model's sensitivity to the minority class but at the expense of precision. Such trade-offs need to be carefully considered, especially in applications where false positives have significant consequences.


In addressing the customer churn prediction challenge, a Support Vector Machine (SVM) model was utilized, emphasizing its capacity to handle class imbalances by assigning differential class weights. Specifically, non-churned customers (class '0') were given a weight of 1.0, while churned customers (class '1') were weighted at 3.92. Post-training, the SVM displayed a consistent performance across training and validation datasets, achieving around 72% accuracy on the training set and 70% on the validation set, with a notable ability to detect customers likely to churn.

The visualization showcases the decision boundaries of two linear classification models: Logistic Regression (LogReg) and Support Vector Machine (SVM) on a 2-dimensional dataset. To produce this plot, a dimensionality reduction technique, Principal Component Analysis (PCA), was employed to transform the original dataset (which had more than two features) into a 2-dimensional space. After applying PCA, the first two principal components explained approximately 26.03% and 18.79% of the variance, respectively. Upon training both the LogReg and SVM models on this reduced dataset, their respective decision boundaries were plotted. The background shading differentiates the regions classified by each model, with the contour lines representing the SVM's boundary. Points in the scatterplot represent individual data instances, color-coded based on their true labels, with blue circles representing class '0' and orange circles representing class '1'. The visualization assists in understanding how each model classifies data in this 2D space, providing insights into their linear separation capabilities.

The visualization presents the decision boundary of a Decision Tree classifier trained on a 2-dimensional dataset. To generate this 2D dataset, the original data, which had more than two features, underwent a dimensionality reduction using Principal Component Analysis (PCA). The first two principal components from the PCA accounted for about 51.07% and 48.93% of the variance, respectively. The plotted data points represent individual data samples, with blue circles indicating class '0' and orange circles indicating class '1'. The shaded regions reflect the classifications made by the Decision Tree: the peach area denotes class '0' and the grey area denotes class '1'. The non-linear boundaries dividing these regions exemplify the tree's ability to capture intricate patterns in the data. This plot provides an intuitive visual representation of how the Decision Tree model differentiates between the two classes in this reduced feature space.

The two visualizations depict decision boundaries for linear models (Logistic Regression and SVM) and a non-linear model (Decision Tree) trained on a 2-dimensional dataset derived via PCA. The linear models exhibit a straight-line boundary, indicating a simplistic distinction between the two classes. In contrast, the Decision Tree presents a more intricate, non-linear decision boundary, capturing more complex patterns within the data. While linear models rely on a straightforward distinction, often yielding generalized interpretations, the Decision Tree's ability to form segmented areas showcases its adaptability to underlying data structures, but might suggest a higher susceptibility to overfitting. The choice between linear and non-linear models depends on the data's nature and the desired trade-off between interpretability and model flexibility.


Machine learning pipeline for a Decision Tree classifier


Through this pipeline, categorical data is first transformed into numeric format, new features are generated, specific columns are scaled for normalization, and finally, the Decision Tree model is trained. The model considers class imbalance by assigning different weights to classes. When evaluated on validation data, the classifier shows good recall for predicting customers who exited the bank, albeit at the cost of precision, indicating it might be identifying too many false positives for that category.

Next, we do a spot-check and evaluate the performance of several machine learning models using k-fold cross-validation on a banking dataset. These models are housed within a "model zoo" which includes models like Random Forest (RF), Light Gradient Boosting Machine (LGBM), XGBoost (XGB), k-Nearest Neighbors (kNN), and various Naive Bayes models (Gaussian, Multinomial, Complement, and Bernoulli).

Initially, the training data is prepared and class weights are calculated to account for any class imbalance in the target variable. Certain features like 'CreditScore', 'Age', and 'Balance' are identified for scaling later on.

The models selected predominantly fall under the tree model category. Tree models, including RF, LGBM, and XGB, are often chosen for their ability to capture non-linear relationships and provide feature importance. However, the code also considers kNN and Naive Bayes models.

Automated pipelines are set up for each model. If a model requires feature scaling, like kNN, then the pipeline will include a scaling step; otherwise, this step is omitted for models like tree-based or Naive Bayes models which do not require feature scaling.

Subsequently, a cross-validation approach is employed to evaluate each model's performance based on recall and F1-score metrics. The results are then printed out for each model under both metrics.


Upon evaluating the models, LightGBM emerges as the preferred choice for further hyperparameter tuning. This decision is grounded in its performance: it showcased the highest recall and was a close second in terms of F1-score. Recall, which measures the proportion of actual positives correctly identified, is especially crucial in contexts where missing a positive instance can have significant implications, like in banking scenarios where predicting customer churn is vital.

Hyperparameter Tuning:

  1. We have initially chosen to perform hyperparameter tuning using RandomizedSearchCV on multiple hyperparameters of the classifier.
  2. The objective is to optimize for the F1 metric.
  3. We have done a Randomized Search with 20 iterations (n_iter = 20) and a 5-fold cross-validation (cv = 5).
  4. The best parameters obtained from the search are:reg_lambda: 5reg_alpha: 1num_leaves: 31n_estimators: 201max_depth: 4learning_rate: 0.5colsample_bytree: 0.3class_weight: {0: 1, 1: 1.96}
  5. The best F1 score achieved with these parameters is 0.6821958624113937.Then we have run a Grid Search with Cross Validation (GridSearchCV) to tune its hyperparameters.

  1. Best Parameters: The best hyperparameters for your model after the grid search are:Class Weight: {0: 1, 1: 3.0}Colsample by Tree: 0.6Learning Rate: 0.1Max Depth: 6Number of Estimators: 201Number of Leaves: 63Regularization Alpha: 1Regularization Lambda: 1
  2. Best Score: The F1 score of the best model after the grid search is approximately 0.6827.

Training Final Best Model and Saving for Deployment:

The boxplot visualizes the predicted churn probabilities generated by the finally chosen model LightGBM for two categories: 0 (customers who did not churn) and 1 (customers who churned). For customers categorized as '0', the model's predicted probabilities are mostly clustered around the lower end, indicating a lower likelihood of churning. On the other hand, for the '1' category, the probabilities are notably higher, emphasizing the model's prediction of a higher churn risk. The boxplot's range and interquartile spread further indicate the variance in predictions. The presence of outliers, especially in the '0' category, hints at instances where the model might be less certain.

SHAP:

This plot represents the average impact of different features on the model's predictions using SHAP (SHapley Additive exPlanations) values. Each row corresponds to a feature, and the length of the bar indicates the average magnitude of that feature's impact on the model's output. The colors differentiate between the two classes: Class 0 (customers who did not churn) and Class 1 (customers who churned).

For example, the feature "Age" has a significant impact on the model's predictions. The longer red bar indicates that as age increases, it tends to push the model's prediction towards classifying a customer as likely to churn (Class 1). Conversely, the blue bar indicates the influence of the feature in predicting that a customer will not churn (Class 0).

Some key takeaways from the plot:

  1. Age is the most influential feature, suggesting older customers are more likely to churn.
  2. NumOfProducts and Surname_enc also have strong influences on the model's predictions.
  3. IsActiveMember has a noticeable impact, especially in predicting customers who will not churn.
  4. Features like country_Germany, country_France, and country_Spain show the model's sensitivity to geographical information, potentially indicating different churn behaviors in various countries.
  5. The bottom features, such as Tenure and HasCrCard, have relatively minimal influence on the model's predictions, suggesting they might be less critical in determining customer churn.

In essence, this visualization provides a ranked overview of which features are most impactful in the model's decision-making process, aiding in understanding and interpreting the model's behavior.


Model's performance on unseen data:

Observations:

  1. Class 0 (No Churn):The median predicted probability for Class 0 (blue box) is quite low, close to 0. This indicates that for most of the actual non-churners, the model is confident in predicting a low churn probability.The interquartile range (IQR), represented by the box's height, is small, suggesting a tight distribution of predictions around the median for this class.There are some outliers above the box, which implies that for a few instances, the model predicted higher probabilities of churning, even though they did not churn.
  2. Class 1 (Churn):

  • The median predicted probability for Class 1 (orange box) is closer to 1, meaning the model, on average, is confidently predicting higher churn probabilities for those who did churn.
  • The IQR for Class 1 is broader than that of Class 0. This indicates a wider variability in the predicted probabilities for actual churners.
  • The whisker below the box stretches down, suggesting that for some churners, the model was less confident (predicted lower probabilities).

Comparison with Model's Performance:

  • Accuracy: The separation between the two boxes is relatively clear, which indicates that the model can differentiate between the two classes effectively. This suggests good accuracy.
  • Recall: Considering that there's a whisker stretching down in the Class 1 box, the model might have misclassified some of the actual churners as non-churners. This could be a concern if ensuring high recall (identifying all churners) is a priority.
  • Precision: The outliers in the Class 0 box indicate that the model might have predicted a few non-churners as churners, affecting precision.

In conclusion, the model seems to perform well on unseen future data as it can effectively distinguish between churners and non-churners. However, there are instances where the model was less confident or possibly made misclassifications. This insight is valuable as it can guide further fine-tuning of the model or influence business decisions based on the model's predictions.

Model Evaluation based on initial business goals:

After our diligent data science efforts, we've made promising strides in developing a predictive model for identifying potential churn customers. Evaluating our model against the initial benchmarks:

  1. Recall: Our model achieved a recall rate of 68%, nearing our target of 70%. This is a strong indication that we're on the right track, capturing a significant majority of customers who are likely to churn.
  2. Precision: With a precision rate of 53%, we've made considerable progress, though there's a gap from our 70% target. However, it's important to note that more than half of the customers flagged by our model as potential churn risks are indeed at risk.
  3. F1-Score: Our model's F1-score stands at 60%, a commendable achievement, signifying a reasonably balanced trade-off between precision and recall.

Strategic User Segmentation:

Segmentation allows for a personalized approach. For example, if data indicates that males from Germany with active memberships and credit cards are prone to churn, prioritizing this segment can lead to a higher ROI. Tailoring strategies according to data-driven insights results in more effective customer retention efforts.

Key Takeaways:

  1. Addressing Data Drift: Over time, data changes. Ensuring regular updates to the model can counteract discrepancies between training data and real-world data.
  2. The Power of Incremental Updates: Constantly evolving the model with new data, rather than full retraining, ensures its relevance and efficiency.
  3. Consistency Across Environments: Maintain a similar environment for both training and deployment to prevent potential model discrepancies.
  4. Focus on Core Metrics: Keeping an eye on vital metrics is paramount to gauge the model's impact on business outcomes.
  5. Granular Performance Monitoring: Track the model's performance not just overall, but also for specific segments to refine strategies effectively.
  6. Visual Communication: Illustrate the potential of the model to stakeholders. A compelling visualization indicating a 30-40% reduction in churn rates, for instance, emphasizes its significance.


Conclusion:

Throughout this portfolio, we've delved deep into the intricacies of customer churn prediction, providing a comprehensive look at data analysis, modeling, and actionable business insights. We've recognized the importance of precision, recall, and F1-score in determining a model's effectiveness, ensuring our strategies align with key business objectives. Our journey underscored the significance of data-driven decision-making, and how it can lead to tangible, positive business outcomes when combined with a deep understanding of user segmentation and business priorities. By also highlighting potential challenges in deploying models and the necessity of continuous monitoring and adaptation, we've provided a holistic view of the entire data science process.

As we wrap up this project, it stands as a testament to the power of data science in driving business forward, showcasing a perfect blend of technical prowess and business acumen. Here's to leveraging these insights for future endeavors and to the relentless pursuit of excellence in the dynamic field of data science!


Ranganath Venkataraman

Digital Transformation through AI and ML | Decarbonization in Energy | Consulting Director

1 年

A structured approach that identifies and addresses some of the challenges impacting development of models Tazkera Haque .. thanks for sharing. Thoughts on number of products and surname as leading features? Did any of these surprise you or other domain experts?

Niel de Kock

Editor of 'The AI Way' a weekly email newsletter focussed on Education and AI. | Pioneering AI in Education & Self-Learning | Explore AI's Frontier with My Weekly Newsletter |1010+ Subscribers & Growing

1 年

Great work. Tazkera Haque. Your right up is incredible well done I could gain a better understanding of how to do such a project myself.

Zain Ul Hassan

Freelance Senior Data Analyst | I help the People & Brands to drive impact using Analytics | Retail & Ecommerce & Supply Chain & Customer Analytics | Senior Data Analyst @alibaba group

1 年

Customer Churn is a big problem now a days. Good to see you have done project on it .

Nimra Ayaz

Data Analytics Consultant @Systems | Data Analyst Mentor?

1 年

"Great insight! Thanks for sharing this valuable information.

要查看或添加评论,请登录

Tazkera Sharifi的更多文章

社区洞察

其他会员也浏览了