An Example of Applied Machine Learning: Will the client subscribe?

An Example of Applied Machine Learning: Will the client subscribe?

Introduction

Hello, everyone! Today I'm going to perform data analysis on a dataset related to bank's marketing and build a Random Forest classifier. You can find it on the UCI Machine Learning Repository.

The dataset?

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

There are four datasets:

  1. bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]
  2. bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.
  3. bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).
  4. bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

In this post, I will be using bank-full.csv.

Attribute Information

Although, UCI provided attribute information, they aren't accurate. I will list them here along with an explanation of each attribute and the possible values.

Bank client data:

  1. age (numeric)
  2. job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
  3. marital : marital status (categorical: 'divorced','married','single'; note: 'divorced' means divorced or widowed)
  4. education (categorical: 'primary', 'secondary', 'tertiary', 'unknown')
  5. default: has credit in default? (categorical: 'no','yes')
  6. housing: has housing loan? (categorical: 'no','yes')
  7. loan: has personal loan? (categorical: 'no','yes') ## Related with the last contact of the current campaign:
  8. contact: contact communication type (categorical: 'cellular','telephone')
  9. month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
  10. day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
  11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. ## Other attributes:
  12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; -1 means client was not previously contacted)
  14. previous: number of contacts performed before this campaign and for this client (numeric)
  15. poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

We notice that the attribute jobs and education have missing values (unknown). I'm going to fix these values eventually.

Importing libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import warnings; warnings.simplefilter('ignore')
import seaborn as sns

%matplotlib inline

Viewing the dataset?

data = pd.read_csv("bank-full.csv", delimiter=";")
data.head()

Removing some columns

  1. As mentioned on the UCI site, 'duration' should be disregard if we want to build a good predictive model.
  2. The day and month aren't relevant anymore.
  3. Nowadays, the contact method (telephone, mobile) isn't going to affect anything.
  4. Poutcome should be removed since it contains many unknown values and is strongly related to the specific campaigns.
data.drop(['contact','day','month','duration','poutcome'], inplace=True, axis=1)

A few data insights

Balance, age, jobs and y?

It's good to know if the balance, the age and the job matters if the client will subscribe:

from scipy import stats, integrate
sns.distplot(data['age'], kde=False, fit=stats.gamma);


sns.jointplot(x="age", y="balance", data=data);

In [15]:

sns.boxplot(x="y", y="age", data=data);


In [18]:

sns.violinplot(x="y", y="balance", data=data);


In [22]:

sns.factorplot(x="y", y="age",col="job", data=data, kind="box", size=4, aspect=.5);


In [23]:

sns.factorplot(x="y", y="age",col="default", data=data, kind="box", aspect=.5);


A few conclusions

  1. Younger managers are more likely to subscribe.
  2. Older retired people are more likely to subscribe.
  3. Younger self-employed are more likely to subscribe.
  4. Older housemaids are more likely to subscribe.
  5. Younger students are more likely to subscribe.
  6. People with more balance are more likely to subscribe.
  7. In general, older people are more likely to subscribe. Although this depends on the job.
  8. People with no credit are more likely to subscribe.

Correlation heat map

What is correlation?

The term "correlation" refers to a mutual relationship or association between quantities. In almost any business, it is useful to express one quantity in terms of its relationship with others. For example, the sales of a given product can increase if the company spends more money on advertisements. Now in order to deduce such relationships, I will build a heatmap of the correlation among all the vectors in the dataset.

I will use Pearson's method as it is the most popular method.

Seaborn's library give us perfect heatmaps to visualize the correlation.

The formula that is used is very simple:


where: n is the sample size, xi and yi are the samples and x (bar) is the mean.

In [24]:

#Plot correlation heat map
correlation = data.corr(method='pearson')
plt.figure(figsize=(25,10))
sns.heatmap(correlation, vmax=1, square=True,  annot=True ) 
plt.show()


A few conclusions:

Before anything, please note that this matrix is symmetric and the diagonals are all 1 because it's the correlation between the vector and itself (not to be confused with autocorrelation which is used in signals).

  1. There is a strong positive correlation between the age and the balance which makes sense.
  2. A strong correlation between the number of days that passed by after the client was last contacted from a previous campaign and the number of contacts before this campaign.
  3. There is an obvious correlation among the campaign, pdays and previous vectors.

Cleaning the data

This specific dataset doesn't have NaN values. However, it has 'unknown' values which is the same thing.

There are two columns that contain unknown values:

  1. Job
  2. Education

What I'm going to do is check the percentage of each class (yes or no) having unknown values in either the job or the eduction field (or both).

In [25]:

no = data.loc[data['y'] == 'no']
yes = data.loc[data['y'] == 'yes']
unknown_no = data.loc[((data['job'] == 'unknown')|(data['education'] == 'unknown'))&(data['y'] == 'no')]
unknown_yes = data.loc[((data['job'] == 'unknown')|(data['education'] == 'unknown'))&(data['y'] == 'yes')]

In [26]:

print('The percentage of unknown values in class no: ', float(unknown_no.count()[0]/float(no.count()[0]))*100)
print('The percentage of unknown values in class yes: ', float(unknown_yes.count()[0]/float(yes.count()[0]))*100)
('The percentage of unknown values in class no: ', 4.38354791844096)
('The percentage of unknown values in class yes: ', 5.067120438646247)

Since the percentage is roughly the same among both classes and it's 5%, the best method is to just drop the values to prevent false model and predictions.

In [27]:

data.drop(no, axis=1)
data.drop(yes, axis=1)

Encoding categorical values

Since classification algorithms (RF for example) take numerical values as input, we need to encode the categorical columns. The following columns need to be encoded:

  1. Marital
  2. Job
  3. Education
  4. Default
  5. Housing
  6. Loan
  7. y

This could be done using the LabelEncoder by scikit-learn.

In [28]:

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
# Label encoder
data['marital'] = encoder.fit_transform(data['marital'])
data['job'] = encoder.fit_transform(data['job'])
data['education'] = encoder.fit_transform(data['education'])
data['default'] = encoder.fit_transform(data['default'])
data['housing'] = encoder.fit_transform(data['housing'])
data['loan'] = encoder.fit_transform(data['loan'])
data['y'] = encoder.fit_transform(data['y'])

Data normalization

The normalization of the data is very important when dealing with parameters of different units and scales. For example, some data mining techniques use the Euclidean distance. Therefore, all parameters should have the same scale for a fair comparison between them.

Again, scikit-learn provides preprocessing to normalize the vectors between 0 and 1.

In [29]:

from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
data_scaled = pd.DataFrame(min_max_scaler.fit_transform(data), columns=data.columns)
data_scaled

Generating samples using SMOTEENN

As previously calculated, the data is unbalanced, therefore we need to fix this. We could use resampling techniques such as SMOTEEN.

Preparing the dataset and importing the imblearn library which can be installed using pip and git: "pip install -U git+https://github.com/scikit-learn-contrib/imbalanced-learn.git"

SMOTEENN which is a combination of oversampling and cleaning is the algorithm that is going to balance our dataset.

You can read more about SMOTEENN here: https://contrib.scikit-learn.org/imbalanced-learn/stable/combine.html

In [30]:

from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X = data_scaled.drop('y', axis=1)
y = data_scaled['y']
X_res, y_res = smote_enn.fit_sample(X, y)

Using cross validation to split between training and testing

In [31]:

from sklearn.cross_validation import train_test_split


X_train_resampled, X_test_resampled, y_train_resampled, y_test_resampled = train_test_split(X_res
                                                                                                   ,y_res
                                                                                                   ,test_size = 0.3
                                                                                                   ,random_state = 0)
print("")
print("Train: ", len(X_train_resampled))
print("Test: ", len(X_test_resampled))
print("Total: ", len(X_train_resampled)+len(X_test_resampled))
('Train: ', 37176)
('Test: ', 15933)
('Total: ', 53109)
/home/boudi/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Random Forest & Tuning

Random forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set. More about them here.

Scikit-learn provides us the Random Forest Classifier so we can easily import it.

However, the main challenge is to tune this classifier (finding the best parameters) in order to get the best results.

GridSearchCV is an important method to estimate these parameters. However, we need to first train the model.

GridSearchCV implements a “fit” and a “score” method. It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.

The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

In [32]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV

clf = RandomForestClassifier(n_jobs=-1, random_state=7, max_features= 'sqrt', n_estimators=50)
clf.fit(X_train_resampled, y_train_resampled)

param_grid = { 
    'n_estimators': [50, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
}

CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 5)
/home/boudi/anaconda2/lib/python2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

In [33]:

CV_clf.fit(X_train_resampled, y_train_resampled)

Out[33]:

GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=-1,
            oob_score=False, random_state=7, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [50, 500], 'max_features': ['auto', 'sqrt', 'log2']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [34]:

y_pred = clf.predict(X_test_resampled)
CV_clf.best_params_

Out[34]:

{'max_features': 'auto', 'n_estimators': 500}

In [35]:

import itertools
from sklearn.metrics import accuracy_score, f1_score, precision_score, confusion_matrix,precision_recall_curve,auc,roc_auc_score,roc_curve,recall_score,classification_report 

def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    else:
        1#print('Confusion matrix, without normalization')

    #print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [36]:

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_resampled,y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()


The values that reflect a positive model are the ones on the diagonal (7063 and 8070). This means that the real label and the predicted one are the same (correct classification).

In [37]:

print("F1 Score: ", f1_score(y_test_resampled, y_pred, average="macro"))
print("Precision: ", precision_score(y_test_resampled, y_pred, average="macro"))
print("Recall: ", recall_score(y_test_resampled, y_pred, average="macro"))  
('F1 Score: ', 0.94895414320193083)
('Precision: ', 0.94844923274221049)
('Recall: ', 0.94967770445588928)

Receiver Operating Characteristic

This is a curve that plots the true positive rate with respect to the false positive rate. AUC is the area under the curve and to analyze the results we could refer to this table:

A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system:

.90-1 = excellent (A) .80-.90 = good (B) .70-.80 = fair (C) .60-.70 = poor (D) .50-.60 = fail (F)

In our case AUC = 0.95 which means that the model is excellent.

In [38]:

fpr, tpr, thresholds = roc_curve(y_test_resampled,y_pred)
roc_auc = auc(fpr,tpr)

# Plot ROC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.0])
plt.ylim([-0.1,1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

For more tutorials and examples, check my website: https://www.aeid.me.


要查看或添加评论,请登录

Abdullatif Eid的更多文章

社区洞察

其他会员也浏览了