Bank marketing campaigns analysis with Machine Learning.
Abstract
This is data-set that describe Portugal bank marketing campaigns results. Conducted campaigns were based mostly on direct phone calls, offering bank's clients to place a term deposit. If after all marking affords client had agreed to place deposit - target variable marked 'yes', otherwise 'no'.
Source of the data https://archive.ics.uci.edu/ml/datasets/bank+marketing
Data-set description https://www.kaggle.com/volodymyrgavrysh/bank-marketing-campaigns-data-set-description
Citation Request:
This data-set is public available for research. The details are described in S. Moro, P. Cortez and P. Rita. "A Data-Driven Approach to Predict the Success of Bank Telemarketing." Decision Support Systems, Elsevier, 62:22-31, June 2014 <
Task
- predicting the future results of marketing companies based on available statistics and, accordingly, formulating recommendations for such companies in the future.
- building a profile of a consumer of banking services (deposits).
- make recommendations for future campaigns
Approach
The following steps will be performed to complete the task:
- Loading data and holding a short Explanatory Data Analysis (EDA).
- Formulating hypotheses regarding individual factors (features) for conducting correct data clearing and data preparation for modeling.
- The choice of metrics result.
- Building a pipeline for Cross Validation and Grid Search procedures (search for optimal parameters of the model)
- The choice of the most effective model **, build learning curve rate
- Formulation of conclusions.
** we intentionally use most basic machine learning models to increase the level of intelligibility of the solution
Feature description
Bank client data:
- 1 - age (numeric)
- 2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
- 3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
- 4 - education (categorical: basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
- 5 - default: has credit in default? (categorical: 'no','yes','unknown')
- 6 - housing: has housing loan? (categorical: 'no','yes','unknown')
- 7 - loan: has personal loan? (categorical: 'no','yes','unknown')
Related with the last contact of the current campaign:
- 8 - contact: contact communication type (categorical: 'cellular','telephone')
- 9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
- 10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
- 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
other attributes:
- 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
- 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
- 14 - previous: number of contacts performed before this campaign and for this client (numeric)
- 15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
social and economic context attributes
- 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
- 17 - cons.price.idx: consumer price index - monthly indicator (numeric)
- 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
- 19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
- 20 - nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target):
- 21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
1. Explore categorical features (EDA)?
Primary analysis of several categorical features reveals:
- Administrative staff and technical specialists opened the deposit most of all. In relative terms, a high proportion of pensioners and students might be mentioned as well.
- Although in absolute terms married consumers more often agreed to the service, in relative terms the single was responded better.
- Best communication channel is mobile phone.
- The difference is evident between consumers who already use the services of banks and received a loan.
- Home ownership does not greatly affect marketing company performance.
Explore numerical features (EDA)
From correlation matrix we observe next:
- most correlated with target feature is call duration. So we need to transform it to reduce the influence
- highly correlated features (employment rate, consumer confidence index, consumer price index) may describe clients state from different social-economic angles. Their variance might support model capacity for generalization.
2. Formulating hypotheses regarding individual factors (features) for conducting correct data cleaning and data preparation for modeling
Data cleaning strategy
Since categorical variables dominate in data-set and the number of weakly correlated numeric variables is not more than 4, we need to transform categorical variables to increase the model's ability to generalize data. (we can not drop them)
Particular attention should be paid to the Duration Feature and categories that can be treated as binary. It suggests using binning and simple transformation accordingly (0 and 1)
For categories of more than 3 types of possible option (marital and education) it is proposed to use the encode targeting - it will allow correctly relate the values to the target variable and use indicated categories in numerical form.
In some cases, re-scaling is proposed to normalize the data.
3. The choice of metrics result
It is proposed to use roc_auc* metrics for evaluating different models with additional monitoring of the accuracy metric dynamic.
This approach will allow us to explore models from different angles.
4. Building a pipeline for Cross Validation and Grid Search procedures (search for optimal parameters of the model)
See code here https://www.kaggle.com/volodymyrgavrysh/bank-marketing-campaigns-dataset-analysis#4.-Building-a-pipline-for-Cross-Validation-and-Grid-Search-procedures-(search-for-optimal-parameters-of-the-model)
5. The choice of the most effective model
Our best performed model with roc_auc* (0.9269) metric is Random forest . This classifier could achieve accuracy rate 0.903 that is average accuracy among all classifiers (0.904).
We can build graph to check Random Forest Classifier performance with OOB** score to be sure that critical hyper-parameter was correctly selected during Grid Search. As you may see it almost the same - 80 estimators with best roc_auc score and 90 estimators with maximum of OOB score
* https://en.wikipedia.org/wiki/Receiver_operating_characteristic
** https://en.wikipedia.org/wiki/Out-of-bag_error
Let see the roc_auc graph.
Curve is well distributed with tendency to False Positive Rate. The roc auc values of the best model of 0.9269 is quite high level to make later assumptions about the data.
We can build feature importance of Random Forest Classifier with best roc_auc score.
6. Conclusions and recommendations.
This analysis can be carried out at the level of individual bank branches as does not require sick resources and special knowledge (the model itself can be launched automatically with a certain periodicity)
Potentially similar micro-targeting will increase the overall effectiveness of the entire marketing company.
What general recommendations can be offered for a successful marketing company in the future?
1. Take into account the time of the company (May is the most effective)
2. Increase the time of contact with customers (perhaps in a different way formulating the goal of the company). It is possible to use other means of communication.
3. Focus on specific categories. The model shows that students and senior citizens respond better to this proposal.
4. It is imperative to form target groups based on sociological-economic categories. Age, income level (not always high), profession can accurately determine the marketing profile of a potential client.
Given these factors, it is recommended to concentrate on those consumer groups that are potentially more promising.
The concentration of the bank’s efforts will effectively distribute the company’s resources to the main factor - the bank’s contact time with the client, which affects conversion most of all.
--------------------------------------------------------------------------------------------------------------
The continuation of such a study may be the formation of a clear client profile - by age, gender, income and other factors, as well as the adaptation of the product itself (deposit) to a specific category.
------------------------------------------------------------------------------------------------------------
See all code on Kaggle https://www.kaggle.com/volodymyrgavrysh/bank-marketing-campaigns-dataset-analysis