Part 3: Predicting results, and working with Command Boards using Machine Learning

Part 3: Predicting results, and working with Command Boards using Machine Learning

Analyzing the Employees TurnOver: Part 3 - Descriptive & Predictive analysis

If you haven't seen Part 1 or Part 2, please refer to?Part 1: Introduction & Part 2: Exploratory Analysis


In this section, we will start answering the concerns of the Company, which are divided into three questions based on three big analytic ideas:

  1. Descriptive Analysis: Which department has the highest employee turnover? (same with lowest).
  2. Predictive Analysis: Investigate which variables seem to be better predictors of employee departure.
  3. Prescriptive Analysis: What recommendations would you make regarding ways to reduce employee turnover?

In the first point, Descriptive Analysis, we will answer the idea of 'What has happened?'. This is what we have done through all the Exploratory Analysis: understand what happened in the company, through the data. The value we learn here are the so-called 'hindsights'.

In the second point, Predictive Analysis, we will reply to the idea of 'What could happen in the future based on previous trends and patterns?'. We will get the features (attributes/columns/characteristics) that are more relevant to define the outcome of the model. Then, we will generate the model. The value we learn here are the so-called 'insights'.

In the third point, Prescriptive Analysis, we will reply to the idea of 'What should the company do?'. We have what happened in the past. We have the key features that define the behavior of the outcome. Now it's in our hands to try to anticipate the future and try to get the outcomes we are looking for. Here we apply the model, and we bring some business strategies into the 'game'. Here, the support from the Company's Direction is critical to get the most of the model's outcome. The value we learn here are the so-called 'foresights'.


1. Which department has the highest employee turnover? Which one has the lowest?

We list all the departments.

departmentsDF        
No alt text provided for this image
departmentsDF['absoluteRatio'].plot(label = 'Internal Ratio: Deparment Left / Department Original', figsize = (15,7))
departmentsDF['relativeRatio'].plot(label = "Internal TurnOver: Department Left / (Department Original+Department Final)/2")

plt.title('Ratios vs TurnOvers')
plt.legend()
xticks=[i for i in range(len(departmentsDF['department']))]
xlabelsNames=[i for i in departmentsDF['department']]
plt.xticks(xticks, xlabelsNames)
plt.show()        
No alt text provided for this image

As we can see in this graphic, depending on the ratio that is taken as a parameter, the result will change.

Departments' Analysis

  • The Sales, Retail, and Engineering departments were the top 3 employee turnover departments, in absolute numbers.
  • In terms of TurnOver Ratio: IT is the first one, then logistics and then marketing.
  • Why is it important also to consider the ratio? Sales are the biggest department in the company (1883 employees) and had 537 employees who left the company (turnover ratio: 0.285). This number is already bigger than the whole IT department (356 employees), who had 110 employees who left the company, with a result of a turnover ratio of 0.308, bigger than the one from the sales department.
  • The finance department had the smallest amount of turnover both in terms of absolute and relative levels.

1.1. Department with the Highest Number of Employee Turnover

print('The department with the highest number of Employees Internal TurnOver is (the department comparing against itself): {}'.format(departmentsDF.iloc[departmentsDF.relativeRatio.idxmax(),0]))
print('The department with the highest number of Employees Total TurnOver is (the department comparing against the whole company): {}'.format(departmentsDF.iloc[departmentsDF.absoluteRatio.idxmax(),0]))
The department with the highest number of Employees Internal TurnOver is (the department comparing against itself): IT
The department with the highest number of Employees Total TurnOver is (the department comparing against the whole company): sales        

  • The department with the highest number of Employees Internal TurnOver is (the department comparing against itself): IT
  • The department with the highest number of Employees Total TurnOver is (the department comparing against the whole company): sales

1.2. Department with the Lowest Number of Employee Turnover

print('The department with the lowest number of Employees Internal TurnOver is (the department comparing against itself): {}'.format(departmentsDF.iloc[departmentsDF.relativeRatio.idxmin(),0]))
print('The department with the lowest number of Employees Total TurnOver is (the department comparing against the whole company): {}'.format(departmentsDF.iloc[departmentsDF.relativeRatio.idxmin(),0]))
The department with the lowest number of Employees Internal TurnOver is (the department comparing against itself): finance
The department with the lowest number of Employees Total TurnOver is (the department comparing against the whole company): finance        

  • The department with the lowest number of Employees Internal TurnOver is (the department comparing against itself): finance
  • The department with the lowest number of Employees Total TurnOver is (the department comparing against the whole company): finance

2. Investigate which variables seem to be better predictors of employee departure.

For selecting the Variables/Features that would explain our model, we have two options:

  1. We select the features by using a Classification Model and then apply these best features in a Logistic Regression Model
  2. We can use other models and compare their accuracy. Then we pick the best model, extract the most important features (according to that model), hyperparams tune it, and see if the final score is better than the previous one.

2.1. Option 1: We will select the Most Important Features by applying the Decision Tree Classifier Model (CART), and then apply the Logistic Regression Model in order to obtain the equation (model) that predicts the outcome.

For modeling a Machine Learning algorithm, first of all, we need to identify the problem we are in: in this case, it's a classification problem (the outcome we are expecting is a specific value included in a defined range of possible values). Here is 'yes' or 'no'. The employee leaves the company or not.

Then, what algorithm should we use? Well, no one has the answer. It depends on many factors. That is why, in point 2.2, we will be trying a set of them and see which one scores the better.

For modeling ML's algorithms, the data also has to fulfill some conditions. One of them is that all the data has only numeric values. That is why we need to convert all the non-numerical values into numbers. This will be the case of the target/outcome ('yes' or 'no', will be converted into 1 or 0) and the departments ('sales', 'it', 'finance'... which will be converted in 0,1,2,3,...). On top of this, the dataset must not have any NULL value. However, we have already checked that, and we are OK.

In the ML modeling, we will have to split our dataset into train and test sets. The train set will be composed of 85% of the total dataset, while the 15% rest will be the test one. Why? Because we will have to evaluate how well our model did in the training phase. After splitting our dataset, will have to 'fit' a model (also known as 'estimator') to a training dataset. Then, once we have tuned the model's parameters to make it score better, we apply the fitted model to the dataset we will use to predict. Once we have the predictions, we will compare them with the actual results of the test set, to see how good we did. We will repeat this procedure (tuning, predicting & scoring) until we have the best possible result.

Let's start.

Data Preparation & Label Encoding

# Import the neccessary modules for data manipulation and visual representation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns

#Read the analytics csv file and store our dataset into a dataframe called "df"
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, confusion_matrix, precision_recall_curve

df = pd.read_csv('./data/employee_churn_data.csv')

#We label-encode the target
labelencoder_y = LabelEncoder()
df['left'] = labelencoder_y.fit_transform(df.left)


# Convert these variables into categorical variables
df["department"] = df["department"].astype('category').cat.codes
df["salary"] = df["salary"].astype('category').cat.codes
#We create a Validation set - Split-out validation dataset
#The columns are removed and only that data is taken

#We take the name of the features, excluding the target.
target_name = 'left'
namesFeatures=df.drop(columns=target_name).columns.values
array = df.values

X = array[:,0:-1]
y = array[:,-1]

# split into 85:15 ration
validation_size = 0.15
seed = 7

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=validation_size, random_state=seed,stratify=y)        

Now we have our train set (X_train & y_train) and our test set (X_test & y_test)

Feature Selection with CART (Decision Tree)

Here we will select the most important Features using the Decision Tree Classifier estimator.

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Create train and test splits
target_name = 'left'


model = tree.DecisionTreeClassifier(
    #max_depth=3,
    class_weight="balanced",
    min_weight_fraction_leaf=0.01
    )
dtree = model.fit(X_train,y_train)

## plot the importances ##
importances = dtree.feature_importances_
feat_names = df.drop(['left'],axis=1).columns


 # Get the models coefficients (and top 5)
coeff = pd.DataFrame({'feature_name': feat_names, 'model_coefficient': importances.transpose().flatten()})
coeff = coeff.sort_values('model_coefficient',ascending=False)
coeff_top = coeff.head(10)
coeff_bottom = coeff.tail(10)

# Plot top 5 coefficients
plt.figure().set_size_inches(10, 6)
fg3 = sns.barplot(x='feature_name', y='model_coefficient',data=coeff_top, palette="Blues_d")
fg3.set_xticklabels(rotation=35, labels=coeff_top.feature_name)

plt.xlabel('Feature')
plt.ylabel('Coefficient')
plt.show()        
No alt text provided for this image


theDf = {'Name':feat_names, 'Ranking':importances}
tableFeatures = pd.DataFrame(theDf)
tableFeatures.sort_values(by='Ranking',ascending=False,inplace=True)

selectedFeatures=tableFeatures.Name.head(3)

tableFeatures.head(5)        
No alt text provided for this image

According to the features' ranking obtained by applying the Decision Tree Classifier, these are the Top 3 features:

  • avg_hrs_month
  • review
  • satisfaction

We will select and apply them in the next Logistic Regression model to create the equation to predict future outcomes.

Logistic Regression using only the selected features

# Create an intercept term for the logistic regression equation
target_name = 'left'

namesFeatures=df.drop(columns=target_name).columns.values

#We create the Intercept 'dummy' variable now.
df['intercept'] = 1

indep_var = [i for i in selectedFeatures]
df = df[indep_var+['intercept',target_name]]

# Create train and test splits
X = df.drop(target_name, axis=1)

y=df[target_name]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15, random_state=123, stratify=y)
import statsmodels.api as sm
model = sm.Logit(y_train, X_train[indep_var+['intercept']],random_state=7)
answer = model.fit()

print(answer.summary())
answer.params        
No alt text provided for this image

The equation would be:

Employee Turnover Score = avg_hrs_month*(0.061653) + review*(11.104666) + satisfaction*(2.488504) - 20.897861

An example of the information we could get from this would be the following: (in the last chapter we will go deeper)

# Create function to compute coefficients
coef = answer.params

def theAlarm(value):
    
    if (value >=0) & (value <0.25):
        toReturn='\n The Employee is in the 1st Quadrant. \x1b[6;30;42m' + 'No actions should be taken.' + '\x1b[0m'
        return (toReturn)
    elif (value >=0.25) & (value <0.50):
        toReturn='\n The Employee is in the 2nd Quadrant. \x1b[0;30;46m' + 'Pay attention to the employee.' + '\x1b[0m'
        return (toReturn)
    elif (value >=0.50) & (value <0.75):
        toReturn='\n The Employee is in the 3rd Quadrant. \x1b[0;30;43m' + 'Actions should be taken.' + '\x1b[0m'
        return (toReturn)
    else:
        toReturn='\n The Employee is in the 4th Quadrant. \x1b[0;37;41m' + 'Urgent Actions must be taken!' + '\x1b[0m'
        return (toReturn)

def getTurnOver (coef, avg_hrs_month, review, satisfaction) : 
    y = coef[3] + coef[0]*avg_hrs_month + coef[1]*review + coef[2]*satisfaction
    p = np.exp(y) / (1+np.exp(y))
    quadrant=theAlarm(p)
    print ('The Employee is working: {} Hours in Average per Month, has Review of: {}%, and has a Satisfaction level of: {}%. \nThis Employee has {}% chances of leaving the company. {}'.format(avg_hrs_month,review*100,satisfaction*100,np.round(p*100,1),quadrant))
        

Now, let's try to predict what will be the outcome of an employee with 70% of satisfaction, 50% of review, and who is working 170 hours per month, on average.

# An Employee with 70% of Satisfaction, 50% of Review, that worked 170 hours in average per month.
averageOverHours=170
review=0.5
satisfaction=0.8

getTurnOver(coef, averageOverHours, review, satisfaction)        
No alt text provided for this image

As we see, the model is predicting that it is more likely that the employee will stay in the company.


2.2. Option 2: We will rank the scoring of some Machine Learning Classification algorithms according to their respective ROC-AUC in the train set, pick the best one, get the coefficients (most important features), hyperparams tune it, and the final model.

For this process, we will use:

  • Cross-Validation: Cross-validation is a technique for evaluating a machine learning model and testing its performance.
  • With Kfold: k-Fold CV is a technique that minimizes the disadvantages of the hold-out method.
  • ROC-AUC Score: ROC Curves summarize the trade-off between the true positive rate and false-positive rate for a predictive model using different probability thresholds.
  • We cannot use Accuracy (to test the Accuracy) as we are in an imbalanced-binary-output dataset. False Positive and False Negative errors must be considered, and Accuracy alone does not measure them.

# Load libraries
from sklearn import linear_model

#Cross Validation Techniques
from sklearn.model_selection import KFold
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV


from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

#Classifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

#Ensemble
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import BaggingClassifier
from xgboost import XGBClassifier
df = pd.read_csv('./data/employee_churn_data.csv')

#We label-encode the target
labelencoder_y = LabelEncoder()
df['left'] = labelencoder_y.fit_transform(df.left)


# Convert these variables into categorical variables
df["department"] = df["department"].astype('category').cat.codes
df["salary"] = df["salary"].astype('category').cat.codes

#We create a Validation set - Split-out validation dataset
#The columns are removed and only that data is taken

#We take the name of the features, excluding the target.
target_name = 'left'
namesFeatures=df.drop(columns=target_name).columns.values
array = df.values

X = array[:,0:-1]
y = array[:,-1]

# split into 85:15 ration
validation_size = 0.15
seed = 7

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=validation_size, random_state=seed,stratify=y)
num_folds = 5
seed = 7
number_repeats = 3
scoring='roc_auc'

#Select 1 for KFold, 2 for RepeatedStratifiedKFold
cvToUse=1

if cvToUse==1:
    cv = KFold( n_splits=num_folds, 
                shuffle=True,
                random_state=seed
                )
else:
    cv = RepeatedStratifiedKFold(   n_splits=num_folds, 
                                    n_repeats=number_repeats, 
                                    random_state=seed
                                    )
# Spot-Check Algorithms
models = []
models.append(('LR', LogisticRegression(random_state=seed)))
models.append(('LDA', LinearDiscriminantAnalysis())) #Doesn't allow random_state
models.append(('KNN', KNeighborsClassifier())) #Doesn't allow random_state
models.append(('CART', DecisionTreeClassifier(random_state=seed)))
models.append(('NB', GaussianNB())) #Doesn't allow random_state
models.append(('SVM', SVC(random_state=seed)))
# ensembles
models.append(('BDT-Ensemble', BaggingClassifier(random_state=seed)))
models.append(('AB-Ensemble', AdaBoostClassifier(random_state=seed)))
models.append(('GBC-Ensemble', GradientBoostingClassifier(random_state=seed)))
models.append(('XGB-Ensemble', XGBClassifier(random_state=seed,eval_metric='logloss'))) #I set this eval_metric for avoiding warning messages.
models.append(('RF-Ensemble', RandomForestClassifier(random_state=seed)))
models.append(('ET-Ensemble', ExtraTreesClassifier(random_state=seed)))

# evaluate each model in turn
resultsSimpler = []
namesSimpler = []


# Create DataFrame  
tableResults = pd.DataFrame(columns=['Name', 'ROC-AUC(Train)', 'STD'])


print("Scoring used: ROC-AUC")
for name, model in models:
    
    cv_results = cross_val_score(model, X_train, y_train, cv=cv, scoring=scoring)
    resultsSimpler.append(cv_results)
    namesSimpler.append(name)
    
    new_row = {'Name':name, 'ROC-AUC(Train)':cv_results.mean(), 'STD':cv_results.std()}
    tableResults = tableResults.append(new_row, ignore_index=True)
    
    msg = "{}: {} ({})".format(name, cv_results.mean(), cv_results.std())
    print(msg)
tableResults=tableResults.sort_values(by='ROC-AUC(Train)',ascending=False)
tableResults
    
Scoring used: ROC-AUC
LR: 0.6939307985118056 (0.015222599204236527)
LDA: 0.7189424885547057 (0.015509742665182055)
KNN: 0.7308053397060327 (0.0077046622782576835)
CART: 0.7855039148935751 (0.009134661911484994)
NB: 0.7127441184325531 (0.012938366031042732)
SVM: 0.61157105872254 (0.01786056670182524)
BDT-Ensemble: 0.9055489950151159 (0.004911097629818087)
AB-Ensemble: 0.8482198674988943 (0.0037887496497835365)
GBC-Ensemble: 0.9204620061188449 (0.004513217863409023)
XGB-Ensemble: 0.9219059521732735 (0.006389556574612421)
RF-Ensemble: 0.9250010593808922 (0.00569482185845855)
ET-Ensemble: 0.9154081004321297 (0.005482901806565513)        
No alt text provided for this image

This table is showing us that the Random Forest Classifier is the one that scored the best. Followed by the Extreme Boosting Gradient, and by the Gradient Boosting Classifier.

We can see that with the Logistic Regression model, we are around 69% of Roc-Auc in the train-set, which is not a good score, but it is acceptable. When testing the model against the Test set, the score might drop; however, with some tune in the algorithm's hyperparameters, we could make the score increase again. Nevertheless, we can establish that our predictions (outcome) in the last step will be around that score. In this case, we will try to improve the score with other models.

We will select the first three best-scored algorithms, and hyperparams tune them.

tableResults.head(3)        
No alt text provided for this image
tunedAlgorithmTable = pd.DataFrame(columns=['Name', 'ROC-AUC(Test)','Model','BestEstimator'])        


2.2.1. Algorithm 1: Random Forest Classifier (RFC)

rfc = RandomForestClassifier(random_state=seed)

rfc.fit(X_train, y_train)

# estimate accuracy on validation dataset
predictions = rfc.predict(X_test)

print('The Initial Model ROC-AUC on the Test Set is:')
rfc_roc_auc = roc_auc_score(y_test, predictions)
print(rfc_roc_auc)

## Take the important Features ##
importances = rfc.feature_importances_

cm=confusion_matrix(y_test, predictions)
plt.figure(figsize=(6,3))
plt.title("Confusion Matrix")
sns.heatmap(cm, annot=True,fmt='d', cmap='Blues')
plt.ylabel("Actual Values")
plt.xlabel("Predicted Values")
plt.show()        


The Initial Model ROC-AUC on the Test Set is:
0.8296546805405329        
No alt text provided for this image

This model is giving us an almost 83% of ROC-AUC score. Let's see if we can improve it.


Selecting Most Important Features

According to RFC, the most important features to explain the model are:

from sklearn.feature_selection import SelectFromModel

numbers=pd.Series(np.arange(0,len(importances)))
values=pd.Series(importances)

newDF=pd.DataFrame({'id':numbers,'value':values})
newDF.sort_values(by='value',ascending=False,inplace=True)


selectModel = SelectFromModel(rfc, prefit=True)
X_train_new = selectModel.transform(X_train)
X_test_new = selectModel.transform(X_test)

newQuantityOfFeatures=X_train_new.shape[1]
newDF=newDF.iloc[0:newQuantityOfFeatures,:]
theIds=newDF[newDF['value'] == [i for i in newDF.value]].id

selectedFeatures=theIds.values

theDf = {'Name':namesFeatures, 'Ranking':importances}
tableFeatures = pd.DataFrame(theDf)
tableFeatures.sort_values(by='Ranking',ascending=False,inplace=True)

tableFeatures.head(len(theIds))        
No alt text provided for this image
 # Get the models coefficients
coeff = pd.DataFrame({'feature_name': namesFeatures, 'model_coefficient': importances.transpose().flatten()})
coeff = coeff.sort_values('model_coefficient',ascending=False)
coeff_top = coeff.head(10)
coeff_bottom = coeff.tail(10)


plt.figure().set_size_inches(10, 6)
fg3 = sns.barplot(x='feature_name', y='model_coefficient',data=coeff_top, palette="Blues_d")
fg3.set_xticklabels(rotation=35, labels=coeff_top.feature_name)

plt.xlabel('Feature')
plt.ylabel('Coefficient')
plt.subplots_adjust(bottom=0.4)        
No alt text provided for this image

As we can see, the Random Forest Model returns the same three features as the Logistic Regression did but is weighted differently. In this case, the order is:

  • Satisfaction
  • Average Working Hours per Month
  • Review


Tuning RFC

# define models
rfc = RandomForestClassifier(random_state=seed)

param_grid = {'n_estimators' : [1100],
                "min_samples_split" : [11],
                'class_weight':["balanced"],
                'max_depth': [None],
                'random_state':[seed],
#               'max_features':['sqrt', 'log2'],  
               'min_samples_leaf': [1]              
                    }
                    

grid_search = GridSearchCV( estimator=rfc, 
                            param_grid=param_grid, 
                            n_jobs=-1, 
                            cv=cv, 
                            scoring=scoring
                            )

grid_result = grid_search.fit(X_train, y_train)


# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print('Best Estimator: ',grid_result.best_estimator_)

best_hyperparams=grid_result.best_params_
best_cv_score=grid_result.best_score_
Best: 0.928198 using {'class_weight': 'balanced', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 11, 'n_estimators': 1100, 'random_state': 7}
Best Estimator:  RandomForestClassifier(class_weight='balanced', min_samples_split=11,
                       n_estimators=1100, random_state=7)
#model
rfc = RandomForestClassifier(**grid_result.best_params_)

# Estimate accuracy on validation dataset
rfc.fit(X_train, y_train)
predictions = rfc.predict(X_test)


print('The FINAL Model ROC-AUC score on the Test Set is: ')
rfc_roc_auc = roc_auc_score(y_test, predictions)
print(rfc_roc_auc)
print(confusion_matrix(y_test, predictions))

new_row = { 'Name':'RF-Ensemble', 
            'ROC-AUC(Test)':roc_auc_score(y_test, predictions),
            'Model':rfc,
            'BestEstimator':grid_result.best_estimator_}
tunedAlgorithmTable = tunedAlgorithmTable.append(new_row, ignore_index=True)        


The FINAL Model ROC-AUC score on the Test Set is: 
0.8505989126994999
[[926  87]
 [ 89 329]]        

The final score is 85%. We have improved it!


2.2.2. Algorithm 2: Extreme Gradient Boosting (XGB)

from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix

xgb = XGBClassifier(random_state=seed,eval_metric='logloss')

xgb.fit(X_train, y_train)

# estimate accuracy on validation dataset
predictions = xgb.predict(X_test)

print('The Initial Model ROC-AUC on the Test Set is:')
xgb_roc_auc = roc_auc_score(y_test, predictions)
print(xgb_roc_auc)


## Take the important Features ##
importances = xgb.feature_importances_

cm=confusion_matrix(y_test, predictions)
plt.figure(figsize=(6,3))
plt.title("Confusion Matrix")
sns.heatmap(cm, annot=True,fmt='d', cmap='Blues')
plt.ylabel("Actual Values")
plt.xlabel("Predicted Values")
plt.show()        


The Initial Model ROC-AUC on the Test Set is:
0.8280983577133626        
No alt text provided for this image

This model is giving us an almost 83% of ROC-AUC score. Let's see if we can improve it.

Selecting Most Important Features

According to XGB, the most important features to explain the model are:

from sklearn.feature_selection import SelectFromModel

numbers=pd.Series(np.arange(0,len(importances)))
values=pd.Series(importances)

newDF=pd.DataFrame({'id':numbers,'value':values})
newDF.sort_values(by='value',ascending=False,inplace=True)


selectModel = SelectFromModel(xgb, prefit=True)
X_train_new = selectModel.transform(X_train)
X_test_new = selectModel.transform(X_test)

newQuantityOfFeatures=X_train_new.shape[1]
newDF=newDF.iloc[0:newQuantityOfFeatures,:]
theIds=newDF[newDF['value'] == [i for i in newDF.value]].id

selectedFeatures=theIds.values

theDf = {'Name':namesFeatures, 'Ranking':importances}
tableFeatures = pd.DataFrame(theDf)
tableFeatures.sort_values(by='Ranking',ascending=False,inplace=True)

tableFeatures.head(len(theIds))        
No alt text provided for this image
 # Get the models coefficients
coeff = pd.DataFrame({'feature_name': namesFeatures, 'model_coefficient': importances.transpose().flatten()})
coeff = coeff.sort_values('model_coefficient',ascending=False)
coeff_top = coeff.head(10)
coeff_bottom = coeff.tail(10)


plt.figure().set_size_inches(10, 6)
fg3 = sns.barplot(x='feature_name', y='model_coefficient',data=coeff_top, palette="Blues_d")
fg3.set_xticklabels(rotation=35, labels=coeff_top.feature_name)

plt.xlabel('Feature')
plt.ylabel('Coefficient')
plt.subplots_adjust(bottom=0.4)        
No alt text provided for this image

As we can see, the Extreme Gradient Boost Model returns the same three features as the Logistic Regression did but is weighted differently. In this case, the order is:

  • Average Working Hours per Month
  • Satisfaction
  • Review

Tuning XGB

# define models and parameters
xgb = XGBClassifier(random_state=seed)


#Like it is now: 1 minutes to execute.
param_grid = {  'n_estimators':[70,80,90,100,110], 
                'subsample':[1,2], 
                'max_depth':[5,6,7],
                'learning_rate':[0.300000012],
                'min_child_weight':[1,2],
                'gamma':[0],
                'colsample_bytree':[1,2],
                'eval_metric':['logloss'],

            }
                    

grid_search = GridSearchCV( estimator=xgb, 
                            param_grid=param_grid, 
                            n_jobs=-1, 
                            cv=cv, 
                            scoring=scoring
                            )


grid_result = grid_search.fit(X_train, y_train)
#grid_result = grid_search.fit(X_train_new, y_train)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print('Best Estimator: ',grid_result.best_estimator_)

best_hyperparams=grid_result.best_params_
best_cv_score=grid_result.best_score_        


The FINAL Model ROC-AUC score on the Test Set is: 
0.8334522027045537
[[947  66]
 [112 306]]        

The final score is 83,3%. We have improved it!


2.2.3. Algorithm 3: Gradient Boosting

#model
xgb = XGBClassifier(**grid_result.best_params_)

# Estimate accuracy on validation dataset
xgb.fit(X_train, y_train)
predictions = xgb.predict(X_test)

print('The FINAL Model ROC-AUC score on the Test Set is: ')
xgb_roc_auc = roc_auc_score(y_test, predictions)
print(xgb_roc_auc)
print(confusion_matrix(y_test, predictions))

new_row = { 'Name':'XGB-Ensemble',
            'ROC-AUC(Test)':roc_auc_score(y_test, predictions),
            'Model':xgb,
            'BestEstimator':grid_result.best_estimator_}
tunedAlgorithmTable = tunedAlgorithmTable.append(new_row, ignore_index=True)        


The Initial Model ROC-AUC on the Test Set is:
0.8286675137093383        
No alt text provided for this image

This model is giving us an almost 83% of ROC-AUC score. Let's see if we can improve it.


Selecting Most Important Features

According to GBC/SCG, the most important features to explain the model are:

from sklearn.feature_selection import SelectFromModel

numbers=pd.Series(np.arange(0,len(importances)))
values=pd.Series(importances)

newDF=pd.DataFrame({'id':numbers,'value':values})
newDF.sort_values(by='value',ascending=False,inplace=True)


selectModel = SelectFromModel(gbc, prefit=True)
X_train_new = selectModel.transform(X_train)
X_test_new = selectModel.transform(X_test)

newQuantityOfFeatures=X_train_new.shape[1]
newDF=newDF.iloc[0:newQuantityOfFeatures,:]
theIds=newDF[newDF['value'] == [i for i in newDF.value]].id

selectedFeatures=theIds.values

theDf = {'Name':namesFeatures, 'Ranking':importances}
tableFeatures = pd.DataFrame(theDf)
tableFeatures.sort_values(by='Ranking',ascending=False,inplace=True)

tableFeatures.head(len(theIds))        
No alt text provided for this image
 # Get the models coefficients
coeff = pd.DataFrame({'feature_name': namesFeatures, 'model_coefficient': importances.transpose().flatten()})
coeff = coeff.sort_values('model_coefficient',ascending=False)
coeff_top = coeff.head(10)
coeff_bottom = coeff.tail(10)


plt.figure().set_size_inches(10, 6)
fg3 = sns.barplot(x='feature_name', y='model_coefficient',data=coeff_top, palette="Blues_d")
fg3.set_xticklabels(rotation=35, labels=coeff_top.feature_name)

plt.xlabel('Feature')
plt.ylabel('Coefficient')
plt.subplots_adjust(bottom=0.4)        
No alt text provided for this image

As we can see, the Extreme Gradient Boost Model returns the same three features as the Logistic Regression did but is weighted differently. In this case, the order is:

  • Average Working Hours per Month
  • Satisfaction
  • Review

Tuning the GBC

# define models and parameters
gbc = GradientBoostingClassifier()

param_grid = {  'n_estimators' : [80,90],
                "learning_rate" : [0.08,0.1],
                'subsample':[1.0],
                'max_depth': [6,7],
                'random_state':[seed],
                'loss': ['deviance'],
                'max_features':[None],  
                'min_samples_leaf': [1],
                'min_samples_leaf': [2],              
                    }
                    
grid_search = GridSearchCV( estimator=gbc, 
                            param_grid=param_grid, 
                            n_jobs=-1, 
                            cv=cv, 
                            scoring=scoring
                            )

grid_result = grid_search.fit(X_train, y_train)

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print('Best Estimator: ',grid_result.best_estimator_)

best_hyperparams=grid_result.best_params_
best_cv_score=grid_result.best_score_
Best: 0.927736 using {'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 6, 'max_features': None, 'min_samples_leaf': 2, 'n_estimators': 80, 'random_state': 7, 'subsample': 1.0}
Best Estimator:  GradientBoostingClassifier(max_depth=6, min_samples_leaf=2, n_estimators=80,
                           random_state=7)
#Model
gbc = GradientBoostingClassifier(**grid_result.best_params_)

# Estimate accuracy on validation dataset
gbc.fit(X_train, y_train)
predictions = gbc.predict(X_test)

print('The FINAL Model ROC-AUC score on the Test Set is: ')
gbc_roc_auc = roc_auc_score(y_test, predictions)
print(gbc_roc_auc)
print(confusion_matrix(y_test, predictions))

new_row = { 'Name':'GBC-Ensemble',
            'ROC-AUC(Test)':roc_auc_score(y_test, predictions),
            'Model':model,
            'BestEstimator':grid_result.best_estimator_}
tunedAlgorithmTable = tunedAlgorithmTable.append(new_row, ignore_index=True)        


The FINAL Model ROC-AUC score on the Test Set is: 
0.8430215806949843
[[947  66]
 [104 314]]        

The final score is 84,3%. We have improved it!


2.4. Voting

We rank all the tuned-Algorithms by final ROC-AUC score in the test set

#From this table we will select the most promising algorithms
tunedAlgorithmTable.sort_values(by='ROC-AUC(Test)',inplace=True,ascending=False)
tunedAlgorithmTable.head(10)        
No alt text provided for this image

We Select the top 2 algorithms for the Voting Ensemble

In this process, we will combined the best two algorithms and get the predictions and see if they, combined, are better than the best estimator's predictions.

#Select the Number of the TOP Algorithms to use
n_Algorithms=2

# Voting Ensemble for Classification
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score

nameVoting=[]
bestEstimatorVoting=[]

for i in range(n_Algorithms):
    nameVoting.append(tunedAlgorithmTable.iloc[i,:].Name)
    bestEstimatorVoting.append(tunedAlgorithmTable.iloc[i,:].BestEstimator)

newList=zip(nameVoting,bestEstimatorVoting)
newList=list(newList)
theEstimators = newList

VotingPredictor = VotingClassifier( estimators = theEstimators,
                                    voting='soft', 
                                    n_jobs = -1)

VotingPredictor = VotingPredictor.fit(X_train, y_train)


scores = cross_val_score(   VotingPredictor, 
                            X_train, 
                            y_train, 
                            cv = cv,
                            n_jobs = -1, 
                            scoring = scoring)
print("The algorithms used are:")
for i in range(n_Algorithms):                            
    print("{}".format(nameVoting[i]))

print('\nThe Summary')    
print(round(np.mean(scores)*100, 2))        


The algorithms used are:
RF-Ensemble
GBC-Ensemble

The Summary
92.97        


predictions = VotingPredictor.predict(X_test)
voting_roc_auc=roc_auc_score(y_test, predictions)
print('The FINAL Model Accuracy on the Test Set is: ',voting_roc_auc)



if voting_roc_auc>tunedAlgorithmTable.iloc[0,1]:
    print('\nThe top {} combination of models ({}) do better than the best model ({}) alone.'.format(n_Algorithms,theEstimators,tunedAlgorithmTable.iloc[0,0]))
    predictorToUse=VotingPredictor
else:
    print('\nThe best model ({}) alone does better than the Top {} combination of models ({})'.format(tunedAlgorithmTable.iloc[0,0],n_Algorithms,nameVoting))
    predictorToUse=bestEstimatorVoting[0]        


The FINAL Model Accuracy on the Test Set is:  0.8466679576982482

The best model (RF-Ensemble) alone does better than the Top 2 combination of models (['RF-Ensemble', 'GBC-Ensemble'])        


ROC Graph

# Create ROC Graph
from sklearn.metrics import roc_curve
rfc_fpr, rfc_tpr, rfc_thresholds = roc_curve(y_test, rfc.predict_proba(X_test)[:,1])
xgb_fpr, xgb_tpr, xgb_thresholds = roc_curve(y_test, xgb.predict_proba(X_test)[:,1])
gbc_fpr, gbc_tpr, gbc_thresholds = roc_curve(y_test, gbc.predict_proba(X_test)[:,1])
voting_fpr, voting_tpr, voting_thresholds = roc_curve(y_test, VotingPredictor.predict_proba(X_test)[:,1])


plt.figure()

# Plot RFC ROC
plt.plot(rfc_fpr, rfc_tpr, label='Random Forest Classifier (area = %0.4f)' % rfc_roc_auc)

# Plot XGB ROC
plt.plot(xgb_fpr, xgb_tpr, label='XGBoost (area = %0.4f)' % xgb_roc_auc)

# Plot GBC ROC
plt.plot(gbc_fpr, gbc_tpr, label='Gradient Boost (area = %0.4f)' % gbc_roc_auc)

# Plot Voting ROC
plt.plot(voting_fpr, voting_tpr, label='Voting Ensemble (area = %0.4f)' % voting_roc_auc)

# Plot Base Rate ROC
plt.plot([0,1], [0,1],label='Base Rate' 'k--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Graph')
plt.legend(loc="lower right")
plt.show()        
No alt text provided for this image

As we can see, the best estimator alone (Random Forest) does better than the best two combined, or any other single estimator.


2.2.5. Getting Information from the best model (Random Forest)

As we have seen, in the three models developed, the best three features are the same ones as in the Logistic Regression Model shown in the previous point, but in a different order in terms of importance. Nevertheless, taking into consideration that the Random Forest model got a better ROC-AUC score, let's try to forecast some possible outcomes according to different scenarios.

> It's important to mention that in this case, we won't be able to provide a TurnOver percentage, but just if the employee might leave the company or not.

#As the three main features explain most of the model, we will get the means for the rest of the features.

department=df.department.mean()
promoted=df.promoted.mean()
projects=df.projects.mean()
salary=df.salary.mean()
tenure=df.tenure.mean()
bonus=df.bonus.mean()

#Scenario 1
averageOverHours1=170
review1=0.6
satisfaction1=0.8

#Scenario 2
averageOverHours2=180
review2=0.7
satisfaction2=0.8

#Scenario 3
averageOverHours3=188
review3=0.8
satisfaction3=0.8

dataSetToTry=[ [department,promoted,review1,projects,salary,tenure, satisfaction1,bonus,averageOverHours1 ],
        [department,promoted,review2,projects,salary,tenure, satisfaction2,bonus,averageOverHours2 ],
        [department,promoted,review3,projects,salary,tenure, satisfaction3,bonus,averageOverHours3 ],]        

2.2.5.1. Forecasting with Logistic Model equation

Applying the equation we got in point 2.1, we forecast:

getTurnOver(coef, averageOverHours1, review1, satisfaction1)
getTurnOver(coef, averageOverHours2, review2, satisfaction2)
getTurnOver(coef, averageOverHours3, review3, satisfaction3)        
No alt text provided for this image


2.2.5.2. Forecasting with Random Forest Model

def getMessage(thePred):
    for i in range(len(thePred)):
        if thePred[i]==0:
            alert='stays in the company'
        else:
            alert='leaves the company'
        message='The outcome is that the employee {}: {}'.format(i,alert)
        print(message)
thePrediction=predictorToUse.predict(dataSetToTry)

getMessage(thePrediction)        
No alt text provided for this image


Summary

Both models are good enough, but according to the ROC-AUC score, the RF one scores better. Nevertheless, at the end of this Report, we will show how we can establish the Logistic Regression equation to develop a Command Board.

Part 4 will be the last one, where we will do the Prescriptive Analysis, closing all the ideas, and proposing a Command Board.

Go to?Part 4: Prescriptive Analysis


---------------------------------------------------------------------------------------------------------------

This case is part of a DataCamp competition:?Employees TurnOver - DataCamp

The full report is available in my GitHub repository?GitHub - vascoarizna

Here you will find not only the full explanation with the graphics and the solution but also the Jupyter Notebook codes in case you want to take anything for you.

Author:?Ignacio Ariznabarreta -?JIAF Consulting

要查看或添加评论,请登录

Jose Ignacio Ariznabarreta Fossati的更多文章

社区洞察

其他会员也浏览了