登录查看更多内容

An Example of Applied Machine Learning: Will the client subscribe?

Abdullatif Eid

发布日期: 2018年9月2日

Introduction

Hello, everyone! Today I'm going to perform data analysis on a dataset related to bank's marketing and build a Random Forest classifier. You can find it on the UCI Machine Learning Repository.

The dataset?

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

There are four datasets:

bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]
bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.
bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).
bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

In this post, I will be using bank-full.csv.

Attribute Information

Although, UCI provided attribute information, they aren't accurate. I will list them here along with an explanation of each attribute and the possible values.

Bank client data:

age (numeric)
job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
marital : marital status (categorical: 'divorced','married','single'; note: 'divorced' means divorced or widowed)
education (categorical: 'primary', 'secondary', 'tertiary', 'unknown')
default: has credit in default? (categorical: 'no','yes')
housing: has housing loan? (categorical: 'no','yes')
loan: has personal loan? (categorical: 'no','yes') ## Related with the last contact of the current campaign:
contact: contact communication type (categorical: 'cellular','telephone')
month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. ## Other attributes:
campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; -1 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

We notice that the attribute jobs and education have missing values (unknown). I'm going to fix these values eventually.

Importing libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import warnings; warnings.simplefilter('ignore')
import seaborn as sns

%matplotlib inline

Viewing the dataset?

data = pd.read_csv("bank-full.csv", delimiter=";")
data.head()

Removing some columns

As mentioned on the UCI site, 'duration' should be disregard if we want to build a good predictive model.
The day and month aren't relevant anymore.
Nowadays, the contact method (telephone, mobile) isn't going to affect anything.
Poutcome should be removed since it contains many unknown values and is strongly related to the specific campaigns.

data.drop(['contact','day','month','duration','poutcome'], inplace=True, axis=1)

A few data insights

Balance, age, jobs and y?

It's good to know if the balance, the age and the job matters if the client will subscribe:

from scipy import stats, integrate
sns.distplot(data['age'], kde=False, fit=stats.gamma);

sns.jointplot(x="age", y="balance", data=data);

In [15]:

sns.boxplot(x="y", y="age", data=data);

In [18]:

sns.violinplot(x="y", y="balance", data=data);

In [22]:

sns.factorplot(x="y", y="age",col="job", data=data, kind="box", size=4, aspect=.5);

In [23]:

sns.factorplot(x="y", y="age",col="default", data=data, kind="box", aspect=.5);

A few conclusions

Younger managers are more likely to subscribe.
Older retired people are more likely to subscribe.
Younger self-employed are more likely to subscribe.
Older housemaids are more likely to subscribe.
Younger students are more likely to subscribe.
People with more balance are more likely to subscribe.
In general, older people are more likely to subscribe. Although this depends on the job.
People with no credit are more likely to subscribe.

Correlation heat map

What is correlation?

The term "correlation" refers to a mutual relationship or association between quantities. In almost any business, it is useful to express one quantity in terms of its relationship with others. For example, the sales of a given product can increase if the company spends more money on advertisements. Now in order to deduce such relationships, I will build a heatmap of the correlation among all the vectors in the dataset.

I will use Pearson's method as it is the most popular method.

Seaborn's library give us perfect heatmaps to visualize the correlation.

The formula that is used is very simple:

where: n is the sample size, xi and yi are the samples and x (bar) is the mean.

In [24]:

#Plot correlation heat map
correlation = data.corr(method='pearson')
plt.figure(figsize=(25,10))
sns.heatmap(correlation, vmax=1, square=True,  annot=True ) 
plt.show()

A few conclusions:

Before anything, please note that this matrix is symmetric and the diagonals are all 1 because it's the correlation between the vector and itself (not to be confused with autocorrelation which is used in signals).

There is a strong positive correlation between the age and the balance which makes sense.
A strong correlation between the number of days that passed by after the client was last contacted from a previous campaign and the number of contacts before this campaign.
There is an obvious correlation among the campaign, pdays and previous vectors.

Cleaning the data

This specific dataset doesn't have NaN values. However, it has 'unknown' values which is the same thing.

There are two columns that contain unknown values:

Job
Education

What I'm going to do is check the percentage of each class (yes or no) having unknown values in either the job or the eduction field (or both).

In [25]:

no = data.loc[data['y'] == 'no']
yes = data.loc[data['y'] == 'yes']
unknown_no = data.loc[((data['job'] == 'unknown')|(data['education'] == 'unknown'))&(data['y'] == 'no')]
unknown_yes = data.loc[((data['job'] == 'unknown')|(data['education'] == 'unknown'))&(data['y'] == 'yes')]

In [26]:

print('The percentage of unknown values in class no: ', float(unknown_no.count()[0]/float(no.count()[0]))*100)
print('The percentage of unknown values in class yes: ', float(unknown_yes.count()[0]/float(yes.count()[0]))*100)
('The percentage of unknown values in class no: ', 4.38354791844096)
('The percentage of unknown values in class yes: ', 5.067120438646247)

Since the percentage is roughly the same among both classes and it's 5%, the best method is to just drop the values to prevent false model and predictions.

In [27]:

data.drop(no, axis=1)
data.drop(yes, axis=1)

Encoding categorical values

Since classification algorithms (RF for example) take numerical values as input, we need to encode the categorical columns. The following columns need to be encoded:

Marital
Job
Education
Default
Housing
Loan
y

This could be done using the LabelEncoder by scikit-learn.

In [28]:

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
# Label encoder
data['marital'] = encoder.fit_transform(data['marital'])
data['job'] = encoder.fit_transform(data['job'])
data['education'] = encoder.fit_transform(data['education'])
data['default'] = encoder.fit_transform(data['default'])
data['housing'] = encoder.fit_transform(data['housing'])
data['loan'] = encoder.fit_transform(data['loan'])
data['y'] = encoder.fit_transform(data['y'])

Data normalization

The normalization of the data is very important when dealing with parameters of different units and scales. For example, some data mining techniques use the Euclidean distance. Therefore, all parameters should have the same scale for a fair comparison between them.

Again, scikit-learn provides preprocessing to normalize the vectors between 0 and 1.

In [29]:

from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
data_scaled = pd.DataFrame(min_max_scaler.fit_transform(data), columns=data.columns)
data_scaled

Generating samples using SMOTEENN

As previously calculated, the data is unbalanced, therefore we need to fix this. We could use resampling techniques such as SMOTEEN.

Preparing the dataset and importing the imblearn library which can be installed using pip and git: "pip install -U git+https://github.com/scikit-learn-contrib/imbalanced-learn.git"

SMOTEENN which is a combination of oversampling and cleaning is the algorithm that is going to balance our dataset.

You can read more about SMOTEENN here: https://contrib.scikit-learn.org/imbalanced-learn/stable/combine.html

In [30]:

from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X = data_scaled.drop('y', axis=1)
y = data_scaled['y']
X_res, y_res = smote_enn.fit_sample(X, y)

Using cross validation to split between training and testing

In [31]:

from sklearn.cross_validation import train_test_split


X_train_resampled, X_test_resampled, y_train_resampled, y_test_resampled = train_test_split(X_res
                                                                                                   ,y_res
                                                                                                   ,test_size = 0.3
                                                                                                   ,random_state = 0)
print("")
print("Train: ", len(X_train_resampled))
print("Test: ", len(X_test_resampled))
print("Total: ", len(X_train_resampled)+len(X_test_resampled))
('Train: ', 37176)
('Test: ', 15933)
('Total: ', 53109)
/home/boudi/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Random Forest & Tuning

Random forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set. More about them here.

Scikit-learn provides us the Random Forest Classifier so we can easily import it.

However, the main challenge is to tune this classifier (finding the best parameters) in order to get the best results.

GridSearchCV is an important method to estimate these parameters. However, we need to first train the model.

GridSearchCV implements a “fit” and a “score” method. It also implements “predict”, “predict_proba”, “decision_function”, “transform” and “inverse_transform” if they are implemented in the estimator used.

The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

In [32]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV

clf = RandomForestClassifier(n_jobs=-1, random_state=7, max_features= 'sqrt', n_estimators=50)
clf.fit(X_train_resampled, y_train_resampled)

param_grid = { 
    'n_estimators': [50, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
}

CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 5)
/home/boudi/anaconda2/lib/python2.7/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

In [33]:

CV_clf.fit(X_train_resampled, y_train_resampled)

Out[33]:

GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=-1,
            oob_score=False, random_state=7, verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [50, 500], 'max_features': ['auto', 'sqrt', 'log2']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [34]:

y_pred = clf.predict(X_test_resampled)
CV_clf.best_params_

Out[34]:

{'max_features': 'auto', 'n_estimators': 500}

In [35]:

import itertools
from sklearn.metrics import accuracy_score, f1_score, precision_score, confusion_matrix,precision_recall_curve,auc,roc_auc_score,roc_curve,recall_score,classification_report 

def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    else:
        1#print('Confusion matrix, without normalization')

    #print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [36]:

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_resampled,y_pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

The values that reflect a positive model are the ones on the diagonal (7063 and 8070). This means that the real label and the predicted one are the same (correct classification).

In [37]:

print("F1 Score: ", f1_score(y_test_resampled, y_pred, average="macro"))
print("Precision: ", precision_score(y_test_resampled, y_pred, average="macro"))
print("Recall: ", recall_score(y_test_resampled, y_pred, average="macro"))  
('F1 Score: ', 0.94895414320193083)
('Precision: ', 0.94844923274221049)
('Recall: ', 0.94967770445588928)

Receiver Operating Characteristic

This is a curve that plots the true positive rate with respect to the false positive rate. AUC is the area under the curve and to analyze the results we could refer to this table:

A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system:

.90-1 = excellent (A) .80-.90 = good (B) .70-.80 = fair (C) .60-.70 = poor (D) .50-.60 = fail (F)

In our case AUC = 0.95 which means that the model is excellent.

In [38]:

fpr, tpr, thresholds = roc_curve(y_test_resampled,y_pred)
roc_auc = auc(fpr,tpr)

# Plot ROC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.0])
plt.ylim([-0.1,1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

For more tutorials and examples, check my website: https://www.aeid.me.

要查看或添加评论，请登录

Abdullatif Eid的更多文章

How to Deploy Autonomous AI Agents on Kubernetes Without Breaking a Sweat

2025年2月7日

How to Deploy Autonomous AI Agents on Kubernetes Without Breaking a Sweat

Autonomous AI agents are revolutionizing industries, from managing workflows to automating customer interactions. But…
Computer Vision Problem: from Processing to Convolutional Neural Networks

2022年4月1日

Computer Vision Problem: from Processing to Convolutional Neural Networks

A typical Computer Vision problem: Part 1 In this series of posts, I will tackle a typical computer vision problem. The…

An Example of Applied Machine Learning: Will the client subscribe?

Abdullatif Eid

Introduction

The dataset?

Attribute Information

Bank client data:

Importing libraries

Viewing the dataset?

Removing some columns

A few data insights

Balance, age, jobs and y?

A few conclusions

Correlation heat map

What is correlation?

A few conclusions:

Cleaning the data

Encoding categorical values

Data normalization

Generating samples using SMOTEENN

Using cross validation to split between training and testing

Random Forest & Tuning

Receiver Operating Characteristic

Abdullatif Eid的更多文章

社区洞察

其他会员也浏览了

Don’t Let Data Drift Derail Your Machine Learning Success

The costliest mistake your business can make in machine learning

Machine Learning Monitoring, Part 5: Why You Should Care About Data and Concept Drift

Top Machine Learning Algorithms for Actionable Insights in Real-Time

MLflow: a better way to track your models

Machine Learning Monitoring, Part 2: Who Should Care, and What We Are Missing

ML IN STOCK MARKET TRADING

Foundation models will kill the traditional moat of ML applications. What new ones will emerge?

F1 Score and Matthews Correlation Coefficient (MCC): Comparison and Analysis

How Machines Learn to See Similarities

Introduction

The dataset?

Attribute Information

Bank client data:

Importing libraries

Viewing the dataset?

Removing some columns

A few data insights

Balance, age, jobs and y?

A few conclusions

Correlation heat map

What is correlation?

A few conclusions:

Cleaning the data

Encoding categorical values

Data normalization

Generating samples using SMOTEENN

Using cross validation to split between training and testing

Random Forest & Tuning

Receiver Operating Characteristic

Abdullatif Eid的更多文章

How to Deploy Autonomous AI Agents on Kubernetes Without Breaking a Sweat

Computer Vision Problem: from Processing to Convolutional Neural Networks

社区洞察

其他会员也浏览了

Don’t Let Data Drift Derail Your Machine Learning Success

The costliest mistake your business can make in machine learning

Machine Learning Monitoring, Part 5: Why You Should Care About Data and Concept Drift

Top Machine Learning Algorithms for Actionable Insights in Real-Time

MLflow: a better way to track your models

Machine Learning Monitoring, Part 2: Who Should Care, and What We Are Missing

ML IN STOCK MARKET TRADING

Foundation models will kill the traditional moat of ML applications. What new ones will emerge?

F1 Score and Matthews Correlation Coefficient (MCC): Comparison and Analysis

How Machines Learn to See Similarities