How to explore a dataset with correlationfunnel and gather relevant insights to prepare data for analysis
Although an integral part of any Data Science project and crucial to the full success of the analysis, Exploratory Data Analysis (EDA) can be an incredibly labour intensive and time consuming process. Recent years have seen a proliferation of approaches and libraries aimed at speeding up the process and in this article I’m going to sample one of the “new kids on the block” (Matt Dancho’s correlationfunnel ) and combine its results with a more “traditional” EDA.
Project Structure
This article is the first part of a project where I’m analysing the results of a bank direct marketing campaign to sell term deposits in order to identify what type of customer is more likely to respond. The marketing campaigns were based on phone calls and more than one contact to the same client was required at times.
First, I am going to carry out an extensive data exploration and use the results and insights to prepare the data for analysis.
Then, I’m estimating a number of models and assess their performance and fit to the data using a model-agnostic methodology that enables to compare traditional “glass-box” models and “black-box” models.
Last, I’ll fit one final model that combines findings from the exploratory analysis and insight from models’ selection and use it to run a revenue optimisation.
The data
library(tidyverse) library(data.table) library(skimr) library(correlationfunnel) library(GGally) library(ggmosaic) library(knitr)
The Data is the Portuguese Bank Marketing set from the UCI Machine Learning Repository and describes the direct marketing campaigns carried out by a Portuguese banking institution aimed at selling term deposits/certificate of deposits to their customers. The marketing campaigns were based on phone calls to potential buyers from May 2008 to November 2010.
Of the four variants of the datasets available on the UCI repository, I’ve chosen the bank-additional-full.csv which contains 41,188 examples with 21 different variables (10 continuous, 10 categorical plus the target variable). A full description of the variables is provided in the appendix.
In particular, the target subscribed is a binary response variable indicating whether the client subscribed (‘Yes’ or numeric value 1) to a term deposit or not (‘No’ or numeric value 0), which make this a binary classification problem.
Loading data and initial inspection
The data I’m using (bank-direct-marketing.csv) is a modified version of the full set mentioned earlier and can be found on my GitHub profile. As it contains lots of double quotation marks, some manipulation is required to get into a usable format.
First, I load each row into one string
data_raw <- data.table::fread( file = "../01_data/bank_direct_marketing_modified.csv", # use character NOT present in data so each row collapses to a string sep = '~', quote = '', # include headers as first row header = FALSE )
Then, I clean data by removing double quotation marks, splitting row strings into single variables and select target variable subscribed to sit on the left-hand side as first variable in data set
data_clean <- # remove all double quotation marks " as_tibble(sapply(data_raw, function(x) gsub("\"", "", x))) %>% # split out into 21 variables separate(col = V1, into = c('age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employed', 'subscribed'), # using semicolumn as separator sep = ";", # to drop original field remove = T) %>% # drop first row, which contains slice((nrow(.) - 41187):nrow(.)) %>% # move targer variable subscribed to be first variable in data set select(subscribed, everything())
Initial Data Manipulation
Let’s have a look!
All variables are set as character and some need adjusting.
data_clean %>% glimpse() ## Observations: 41,188 ## Variables: 21 ## $ subscribed <chr> "no", "no", "no", "no", "no", "no", ... ## $ age <chr> "56", "57", "37", "40", "56", "45", ... ## $ job <chr> "housemaid", "services", "services", ... ## $ marital <chr> "married", "married", "married", ... ## $ education <chr> "basic.4y","high.school","high.school", ... ## $ default <chr> "no", "unknown", "no", "no", "no", "no", ... ## $ housing <chr> "no", "no", "yes", "no", "no", "no", ... ## $ loan <chr> "no", "no", "no", "no", "yes", "no", ... ## $ contact <chr> "telephone", "telephone", "telephone", ... ## $ month <chr> "may", "may", "may", "may", "may", "may",... ## $ day_of_week <chr> "mon", "mon", "mon", "mon", "mon", "mon",... ## $ duration <chr> "261", "149", "226", "151", "307", "198",... ## $ campaign <chr> "1", "1", "1", "1", "1", "1", "1", "1", ... ## $ pdays <chr> "999", "999", "999", "999", "999", "999",... ## $ previous <chr> "0", "0", "0", "0", "0", "0", "0", "0", ... ## $ poutcome <chr> "nonexistent", "nonexistent", "nonexi... ## $ emp_var_rate <chr> "1.1", "1.1", "1.1", "1.1", "1.1", "1.... ## $ cons_price_idx <chr> "93.994", "93.994", "93.994", "93.994",... ## $ cons_conf_idx <chr> "-36.4", "-36.4", "-36.4", "-36.4", ... ## $ euribor3m <chr> "4.857", "4.857", "4.857", "4.857", "4.857"... ## $ nr_employed <chr> "5191", "5191", "5191", "5191", "5191", ...
I’ll start with setting the variables that are continuous in nature to numeric and change pdays 999 to 0 (999 means client was not previously contacted). I’m also shortening level names of some categorical variables to ease visualisations.
Note that, although numeric in nature, campaign is more of a categorical variable so I am leaving it as a character.
data_clean <- data_clean %>% # recoding the majority class as 0 and the minority class as 1 mutate(subscribed = case_when(subscribed == 'no' ~ 0, TRUE ~ 1) %>% as_factor) %>% # change continuous variables that are numeric to type double mutate_at(c('age','duration', 'pdays', 'previous', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employed'), as.double) %>% # change pdays 999 to 0 (zero) mutate(pdays = case_when(pdays == '999' ~ 0, TRUE ~ pdays), # shortening level names of some categ. vars to ease visualisations job = case_when( job == 'housemaid' ~ 'maid', job == 'services' ~ 'svcs', job == 'admin.' ~ 'adm', job == 'blue-collar' ~ 'bcol', job == 'technician' ~ 'tech', job == 'retired' ~ 'ret', job == 'management' ~ 'mgmt', job == 'unemployed' ~ 'uemp', job == 'self-employed' ~ 'self', job == 'unknown' ~ 'unk', job == 'entrepreneur' ~ 'entr', TRUE ~ 'stdn'), marital = case_when( marital == 'married' ~ 'mar', marital == 'single' ~ 'sig', marital == 'divorced' ~ 'div', TRUE ~ 'unk'), education = case_when( education == 'basic.4y' ~ '4y', education == 'basic.6y' ~'6y', education == 'basic.9y' ~ '9y', education == 'high.school' ~ 'hs', education == 'professional.course' ~ 'crse', education == 'unknown' ~ 'unk', education == 'university.degree' ~ 'uni', TRUE ~ 'ilt'), default = case_when( default == 'unknown' ~ 'unk', default == 'yes' ~ 'yes', TRUE ~ 'no'), contact = case_when( contact == 'telephone' ~ 'tel', contact == 'cellular' ~ 'mob'), poutcome = case_when( poutcome == 'nonexistent' ~ 'non', poutcome == 'failure' ~ 'fail', TRUE ~ 'scs'), housing = case_when( housing == 'unknown' ~ 'unk', default == 'yes' ~ 'yes', TRUE ~ 'no'), loan = case_when( loan == 'unknown' ~ 'unk', default == 'yes' ~ 'yes', TRUE ~ 'no') )
There are no missing values in any of the variables (continuous or categorical) in this data set. For that reason, no imputation is necessary.
data_clean %>% skimr::skim() Data summary Name Piped data Number of rows 41188 Number of columns 21 _______________________ Column type frequency: character 11 factor 1 numeric 9 ________________________ Group variables None Variable type: character skim_var n_missing complete min max empty n_unique whitespace job 0 1 3 4 0 12 0 marital 0 1 3 3 0 4 0 education 0 1 2 4 0 8 0 default 0 1 2 3 0 3 0 housing 0 1 2 3 0 3 0 loan 0 1 2 3 0 3 0 contact 0 1 3 3 0 2 0 month 0 1 3 3 0 10 0 day_of_week 0 1 3 3 0 5 0 campaign 0 1 1 2 0 42 0 poutcome 0 1 3 4 0 3 0 Variable type: factor skim_var n_missing complete ordered n_unique top_counts subscribed 0 1 FALSE 2 0: 36548, 1: 4640 Variable type: numeric skim_var n_missing complete mean sd p0 p25 p50 p75 p100 hist age 0 1 40.02 10.42 17.00 32.00 38.00 47.00 98.00 ▅▇▃▁▁ duration 0 1 258.29 259.28 0.00 102.00 180.00 319.00 4918.0 ▇▁▁▁▁ pdays 0 1 0.22 1.35 0.00 0.00 0.00 0.00 27.00 ▇▁▁▁▁ previous 0 1 0.17 0.49 0.00 0.00 0.00 0.00 7.00 ▇▁▁▁▁ emp_var_rate 0 1 0.08 1.57 -3.40 -1.80 1.10 1.40 1.40 ▁▃▁▁▇ cons_price_idx 0 1 93.58 0.58 92.20 93.08 93.75 93.99 94.77 ▁▆▃▇▂ cons_conf_idx 0 1 -40.50 4.63 -50.80 -42.70 -41.80 -36.40 -26.90 ▅▇▁▇▁ euribor3m 0 1 3.62 1.73 0.63 1.34 4.86 4.96 5.04 ▅▁▁▁▇ nr_employed 0 1 5167.04 72.25 4963.6 5099.1 5191.0 5228.1 5228.1 ▁▁▃▁▇
NOTE: I’ve left all categorical variables as unordered as h2o does not support ordered categorical variables
correlationfunnel
correlationfunnel is a package developed with the aim to speed up Exploratory Data Analysis (EDA), a process that can be very time consuming even for small data sets.
With 3 simple steps we can produce a graph that arranges predictors top to bottom in descending order of absolute correlation with the target variable. Features at the top of the funnel are expected to have have stronger predictive power in a model.
This approach offers a quick way to identify a hierarchy of expected predictive power for all variables and gives an early indication of which predictors should feature strongly/weakly in any model.
data_clean %>% # turn numeric and categorical features into binary data binarize( n_bins = 4, # converting features to discrete thresh_infreq = 0.01 # thresh. for "Other" categ. features ) %>% # Correlate target variable to features in data set correlate(target = subscribed__1) %>% # correlation funnel visualisation plot_correlation_funnel()
Zooming in on the top 5 features we can see that certain characteristics have a greater correlation with the target variable (subscribing to the term deposit product) when:
- The duration of the last phone contact with the client is 319 seconds or longer
- The number of days that passed by after the client was last contacted is greater than 6
- The outcome of the previous marketing campaign was success
- The number of employed is 5,099 thousands or higher
- The value of the euribor_3_month rate is 1.344 or higher
Conversely, variables at the bottom of the funnel, such as day_of_week, housing, and loan. show very little variation compared to the target variable (i.e.: they are very close to the zero correlation point to the response). For that reason, I’m not expecting these features to impact the response.
Features exploration
Guided by the results of this visual correlation analysis, I will continue to explore the relationship between the target and each of the predictors in the next section. For this I will enlist the help of the brilliant GGally library to visualise a modified version of the correlation matrix with Ggpairs, and plot mosaic charts with the ggmosaic package, a great way to examine the relationship among two or more categorical variables.
Target Variable
First things first, the target variable: subscribed shows a strong class imbalance, with nearly 89% in the No category to 11% in the Yes category.
data_clean %>% select(subscribed) %>% group_by(subscribed) %>% count() %>% ungroup() %>% mutate(perc = n / sum(n)) %>% ggplot(aes(x = subscribed, y = n, fill = subscribed) ) + geom_col() + geom_text(aes(label = scales::percent(perc, accuracy = 0.1)), nudge_y = -2000, size = 4.5) + theme_minimal() + theme(legend.position = 'none', plot.title = element_text(hjust = 0.5)) + labs(title = 'Target Variable', x = 'Subscribed', y = 'Number of Responses')
I am going to address class imbalance during the modelling phase by enabling re-sampling, in h2o. This will rebalance the dataset by “shrinking” the prevalent class (“No” or 0) and ensure that the model adequately detects what variables are driving the ‘yes’ and ‘no’ responses.
Predictors
Let’s continue with some of the numerical features:
Although the correlation funnel analysis revealed that duration has the strongest expected predictive power, it is unknown before a call (it’s obviously known afterwards) and offers very little actionable insight or predictive value. Therefore, it should be discarded from any realistic predictive model and will not be used in this analysis.
age ’s density plots have very similar variance compared to the target variable and are centred around the same area. For these reasons, it should not have a great impact on subscribed.
Despite continuous in nature, pdays and previous are in fact categorical features and are also all strongly right skewed. For these reasons, they will need to be discretised into groups. Both variables are also moderately correlated, suggesting that they may capture the same behaviour.
Next, I visualise the bank client data with the mosaic charts:
In line with the correlationfunnel findings, job, education, marital and default all show a good level of variation compared to the target variable, indicating that they would impact the response. In contrast, housing and loan sat at the very bottom of the funnel and are expected to have little influence on the target, given the small variation when split by “subscribed” response.
default has only 3 observations in the ‘yes’ level, which will be rolled into the least frequent level as they’re not enough to make a proper inference. Level ‘unknown’ of the housing and loan variables have a small number of observations and will be rolled into the second smallest category. Lastly, job and education would also benefit from grouping up of least common levels.
Moving on to the other campaign attributes:
Although continuous in principal, campaign is more categorical in nature and strongly right skewed, and will need to be discretised into groups. However, we have learned from the earlier correlation analysis that is not expected be a strong drivers of variation in any model.
On the other hand, poutcome is one of the attributes expected to be have a strong predictive power. The uneven distribution of levels would suggest to roll the least common occurring level (success or scs) into another category. However, contacting a client who previously purchased a term deposit is one of the catacteristics with highest predictive power and needs to be left ungrouped.
Then, I’m looking at last contact information:
contact and month should impact the response variable as they both have a good level of variation compared to the target. month would also benefit from grouping up of least common levels.
In contrast, day_of_week does not appear to impact the response as there is not enough variation between the levels.
Last but not least, the social and economic attributes:
All social and economic attributes show a good level of variation compared to the target variable, which suggests that they should all impact the response. They all display a high degree of multi-modality and do not have an even spread through the density plot, and will need to be binned.
It is also worth noting that, with the exception of consumer confidence index, all other social and economic attributes are strongly correlated to each other, indicating that only one could be included in the model as they are all “picking up” similar economic trend.
Summary of Exploratory Data Analysis
- Correlation analysis with correlationfunnel helped identify a hierarchy of expected predictive power for all variables
- duration has strongest correlation with target variable whereas some of the bank client data like housing and loan shows the weakest correlation
- However, duration will NOT be used in the analysis as it is unknown before a call. As such it offers very little actionable insight or predictive value and should be discarded from any realistic predictive model
- The target variable subscribed shows strong class imbalance, with nearly 89% of No churn, which will need to be addresses before the modelling analysis can begin
- Most predictors benefited from grouping up of least common levels
- Further feature exploration revealed the most social and economic context attributes are strongly correlated to each other, suggesting that only a selection of them could be considered in a final model
Code Repository
The full R code and all relevant data files can be found on my GitHub profile @ Propensity Modelling
References
- For the original paper that used the data set see: A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, S. Moro, P. Cortez and P. Rita.
- To Speed Up Exploratory Data Analysis see: correlationfunnel Package Vignette
Originally published at https://diegousai.io on 14th January 2020