Part 3: Predicting results, and working with Command Boards using Machine Learning
Analyzing the Employees TurnOver: Part 3 - Descriptive & Predictive analysis
If you haven't seen Part 1 or Part 2, please refer to?Part 1: Introduction & Part 2: Exploratory Analysis
In this section, we will start answering the concerns of the Company, which are divided into three questions based on three big analytic ideas:
In the first point, Descriptive Analysis, we will answer the idea of 'What has happened?'. This is what we have done through all the Exploratory Analysis: understand what happened in the company, through the data. The value we learn here are the so-called 'hindsights'.
In the second point, Predictive Analysis, we will reply to the idea of 'What could happen in the future based on previous trends and patterns?'. We will get the features (attributes/columns/characteristics) that are more relevant to define the outcome of the model. Then, we will generate the model. The value we learn here are the so-called 'insights'.
In the third point, Prescriptive Analysis, we will reply to the idea of 'What should the company do?'. We have what happened in the past. We have the key features that define the behavior of the outcome. Now it's in our hands to try to anticipate the future and try to get the outcomes we are looking for. Here we apply the model, and we bring some business strategies into the 'game'. Here, the support from the Company's Direction is critical to get the most of the model's outcome. The value we learn here are the so-called 'foresights'.
1. Which department has the highest employee turnover? Which one has the lowest?
We list all the departments.
departmentsDF
departmentsDF['absoluteRatio'].plot(label = 'Internal Ratio: Deparment Left / Department Original', figsize = (15,7))
departmentsDF['relativeRatio'].plot(label = "Internal TurnOver: Department Left / (Department Original+Department Final)/2")
plt.title('Ratios vs TurnOvers')
plt.legend()
xticks=[i for i in range(len(departmentsDF['department']))]
xlabelsNames=[i for i in departmentsDF['department']]
plt.xticks(xticks, xlabelsNames)
plt.show()
As we can see in this graphic, depending on the ratio that is taken as a parameter, the result will change.
Departments' Analysis
1.1. Department with the Highest Number of Employee Turnover
print('The department with the highest number of Employees Internal TurnOver is (the department comparing against itself): {}'.format(departmentsDF.iloc[departmentsDF.relativeRatio.idxmax(),0]))
print('The department with the highest number of Employees Total TurnOver is (the department comparing against the whole company): {}'.format(departmentsDF.iloc[departmentsDF.absoluteRatio.idxmax(),0]))
The department with the highest number of Employees Internal TurnOver is (the department comparing against itself): IT
The department with the highest number of Employees Total TurnOver is (the department comparing against the whole company): sales
1.2. Department with the Lowest Number of Employee Turnover
print('The department with the lowest number of Employees Internal TurnOver is (the department comparing against itself): {}'.format(departmentsDF.iloc[departmentsDF.relativeRatio.idxmin(),0]))
print('The department with the lowest number of Employees Total TurnOver is (the department comparing against the whole company): {}'.format(departmentsDF.iloc[departmentsDF.relativeRatio.idxmin(),0]))
The department with the lowest number of Employees Internal TurnOver is (the department comparing against itself): finance
The department with the lowest number of Employees Total TurnOver is (the department comparing against the whole company): finance
2. Investigate which variables seem to be better predictors of employee departure.
For selecting the Variables/Features that would explain our model, we have two options:
2.1. Option 1: We will select the Most Important Features by applying the Decision Tree Classifier Model (CART), and then apply the Logistic Regression Model in order to obtain the equation (model) that predicts the outcome.
For modeling a Machine Learning algorithm, first of all, we need to identify the problem we are in: in this case, it's a classification problem (the outcome we are expecting is a specific value included in a defined range of possible values). Here is 'yes' or 'no'. The employee leaves the company or not.
Then, what algorithm should we use? Well, no one has the answer. It depends on many factors. That is why, in point 2.2, we will be trying a set of them and see which one scores the better.
For modeling ML's algorithms, the data also has to fulfill some conditions. One of them is that all the data has only numeric values. That is why we need to convert all the non-numerical values into numbers. This will be the case of the target/outcome ('yes' or 'no', will be converted into 1 or 0) and the departments ('sales', 'it', 'finance'... which will be converted in 0,1,2,3,...). On top of this, the dataset must not have any NULL value. However, we have already checked that, and we are OK.
In the ML modeling, we will have to split our dataset into train and test sets. The train set will be composed of 85% of the total dataset, while the 15% rest will be the test one. Why? Because we will have to evaluate how well our model did in the training phase. After splitting our dataset, will have to 'fit' a model (also known as 'estimator') to a training dataset. Then, once we have tuned the model's parameters to make it score better, we apply the fitted model to the dataset we will use to predict. Once we have the predictions, we will compare them with the actual results of the test set, to see how good we did. We will repeat this procedure (tuning, predicting & scoring) until we have the best possible result.
Let's start.
Data Preparation & Label Encoding
# Import the neccessary modules for data manipulation and visual representation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
#Read the analytics csv file and store our dataset into a dataframe called "df"
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, confusion_matrix, precision_recall_curve
df = pd.read_csv('./data/employee_churn_data.csv')
#We label-encode the target
labelencoder_y = LabelEncoder()
df['left'] = labelencoder_y.fit_transform(df.left)
# Convert these variables into categorical variables
df["department"] = df["department"].astype('category').cat.codes
df["salary"] = df["salary"].astype('category').cat.codes
#We create a Validation set - Split-out validation dataset
#The columns are removed and only that data is taken
#We take the name of the features, excluding the target.
target_name = 'left'
namesFeatures=df.drop(columns=target_name).columns.values
array = df.values
X = array[:,0:-1]
y = array[:,-1]
# split into 85:15 ration
validation_size = 0.15
seed = 7
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=validation_size, random_state=seed,stratify=y)
Now we have our train set (X_train & y_train) and our test set (X_test & y_test)
Feature Selection with CART (Decision Tree)
Here we will select the most important Features using the Decision Tree Classifier estimator.
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
# Create train and test splits
target_name = 'left'
model = tree.DecisionTreeClassifier(
#max_depth=3,
class_weight="balanced",
min_weight_fraction_leaf=0.01
)
dtree = model.fit(X_train,y_train)
## plot the importances ##
importances = dtree.feature_importances_
feat_names = df.drop(['left'],axis=1).columns
# Get the models coefficients (and top 5)
coeff = pd.DataFrame({'feature_name': feat_names, 'model_coefficient': importances.transpose().flatten()})
coeff = coeff.sort_values('model_coefficient',ascending=False)
coeff_top = coeff.head(10)
coeff_bottom = coeff.tail(10)
# Plot top 5 coefficients
plt.figure().set_size_inches(10, 6)
fg3 = sns.barplot(x='feature_name', y='model_coefficient',data=coeff_top, palette="Blues_d")
fg3.set_xticklabels(rotation=35, labels=coeff_top.feature_name)
plt.xlabel('Feature')
plt.ylabel('Coefficient')
plt.show()
theDf = {'Name':feat_names, 'Ranking':importances}
tableFeatures = pd.DataFrame(theDf)
tableFeatures.sort_values(by='Ranking',ascending=False,inplace=True)
selectedFeatures=tableFeatures.Name.head(3)
tableFeatures.head(5)
According to the features' ranking obtained by applying the Decision Tree Classifier, these are the Top 3 features:
We will select and apply them in the next Logistic Regression model to create the equation to predict future outcomes.
Logistic Regression using only the selected features
# Create an intercept term for the logistic regression equation
target_name = 'left'
namesFeatures=df.drop(columns=target_name).columns.values
#We create the Intercept 'dummy' variable now.
df['intercept'] = 1
indep_var = [i for i in selectedFeatures]
df = df[indep_var+['intercept',target_name]]
# Create train and test splits
X = df.drop(target_name, axis=1)
y=df[target_name]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15, random_state=123, stratify=y)
import statsmodels.api as sm
model = sm.Logit(y_train, X_train[indep_var+['intercept']],random_state=7)
answer = model.fit()
print(answer.summary())
answer.params
The equation would be:
Employee Turnover Score = avg_hrs_month*(0.061653) + review*(11.104666) + satisfaction*(2.488504) - 20.897861
An example of the information we could get from this would be the following: (in the last chapter we will go deeper)
# Create function to compute coefficients
coef = answer.params
def theAlarm(value):
if (value >=0) & (value <0.25):
toReturn='\n The Employee is in the 1st Quadrant. \x1b[6;30;42m' + 'No actions should be taken.' + '\x1b[0m'
return (toReturn)
elif (value >=0.25) & (value <0.50):
toReturn='\n The Employee is in the 2nd Quadrant. \x1b[0;30;46m' + 'Pay attention to the employee.' + '\x1b[0m'
return (toReturn)
elif (value >=0.50) & (value <0.75):
toReturn='\n The Employee is in the 3rd Quadrant. \x1b[0;30;43m' + 'Actions should be taken.' + '\x1b[0m'
return (toReturn)
else:
toReturn='\n The Employee is in the 4th Quadrant. \x1b[0;37;41m' + 'Urgent Actions must be taken!' + '\x1b[0m'
return (toReturn)
def getTurnOver (coef, avg_hrs_month, review, satisfaction) :
y = coef[3] + coef[0]*avg_hrs_month + coef[1]*review + coef[2]*satisfaction
p = np.exp(y) / (1+np.exp(y))
quadrant=theAlarm(p)
print ('The Employee is working: {} Hours in Average per Month, has Review of: {}%, and has a Satisfaction level of: {}%. \nThis Employee has {}% chances of leaving the company. {}'.format(avg_hrs_month,review*100,satisfaction*100,np.round(p*100,1),quadrant))
Now, let's try to predict what will be the outcome of an employee with 70% of satisfaction, 50% of review, and who is working 170 hours per month, on average.
# An Employee with 70% of Satisfaction, 50% of Review, that worked 170 hours in average per month.
averageOverHours=170
review=0.5
satisfaction=0.8
getTurnOver(coef, averageOverHours, review, satisfaction)
As we see, the model is predicting that it is more likely that the employee will stay in the company.
2.2. Option 2: We will rank the scoring of some Machine Learning Classification algorithms according to their respective ROC-AUC in the train set, pick the best one, get the coefficients (most important features), hyperparams tune it, and the final model.
For this process, we will use:
# Load libraries
from sklearn import linear_model
#Cross Validation Techniques
from sklearn.model_selection import KFold
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
#Classifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
#Ensemble
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import BaggingClassifier
from xgboost import XGBClassifier
df = pd.read_csv('./data/employee_churn_data.csv')
#We label-encode the target
labelencoder_y = LabelEncoder()
df['left'] = labelencoder_y.fit_transform(df.left)
# Convert these variables into categorical variables
df["department"] = df["department"].astype('category').cat.codes
df["salary"] = df["salary"].astype('category').cat.codes
#We create a Validation set - Split-out validation dataset
#The columns are removed and only that data is taken
#We take the name of the features, excluding the target.
target_name = 'left'
namesFeatures=df.drop(columns=target_name).columns.values
array = df.values
X = array[:,0:-1]
y = array[:,-1]
# split into 85:15 ration
validation_size = 0.15
seed = 7
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=validation_size, random_state=seed,stratify=y)
num_folds = 5
seed = 7
number_repeats = 3
scoring='roc_auc'
#Select 1 for KFold, 2 for RepeatedStratifiedKFold
cvToUse=1
if cvToUse==1:
cv = KFold( n_splits=num_folds,
shuffle=True,
random_state=seed
)
else:
cv = RepeatedStratifiedKFold( n_splits=num_folds,
n_repeats=number_repeats,
random_state=seed
)
# Spot-Check Algorithms
models = []
models.append(('LR', LogisticRegression(random_state=seed)))
models.append(('LDA', LinearDiscriminantAnalysis())) #Doesn't allow random_state
models.append(('KNN', KNeighborsClassifier())) #Doesn't allow random_state
models.append(('CART', DecisionTreeClassifier(random_state=seed)))
models.append(('NB', GaussianNB())) #Doesn't allow random_state
models.append(('SVM', SVC(random_state=seed)))
# ensembles
models.append(('BDT-Ensemble', BaggingClassifier(random_state=seed)))
models.append(('AB-Ensemble', AdaBoostClassifier(random_state=seed)))
models.append(('GBC-Ensemble', GradientBoostingClassifier(random_state=seed)))
models.append(('XGB-Ensemble', XGBClassifier(random_state=seed,eval_metric='logloss'))) #I set this eval_metric for avoiding warning messages.
models.append(('RF-Ensemble', RandomForestClassifier(random_state=seed)))
models.append(('ET-Ensemble', ExtraTreesClassifier(random_state=seed)))
# evaluate each model in turn
resultsSimpler = []
namesSimpler = []
# Create DataFrame
tableResults = pd.DataFrame(columns=['Name', 'ROC-AUC(Train)', 'STD'])
print("Scoring used: ROC-AUC")
for name, model in models:
cv_results = cross_val_score(model, X_train, y_train, cv=cv, scoring=scoring)
resultsSimpler.append(cv_results)
namesSimpler.append(name)
new_row = {'Name':name, 'ROC-AUC(Train)':cv_results.mean(), 'STD':cv_results.std()}
tableResults = tableResults.append(new_row, ignore_index=True)
msg = "{}: {} ({})".format(name, cv_results.mean(), cv_results.std())
print(msg)
tableResults=tableResults.sort_values(by='ROC-AUC(Train)',ascending=False)
tableResults
Scoring used: ROC-AUC
LR: 0.6939307985118056 (0.015222599204236527)
LDA: 0.7189424885547057 (0.015509742665182055)
KNN: 0.7308053397060327 (0.0077046622782576835)
CART: 0.7855039148935751 (0.009134661911484994)
NB: 0.7127441184325531 (0.012938366031042732)
SVM: 0.61157105872254 (0.01786056670182524)
BDT-Ensemble: 0.9055489950151159 (0.004911097629818087)
AB-Ensemble: 0.8482198674988943 (0.0037887496497835365)
GBC-Ensemble: 0.9204620061188449 (0.004513217863409023)
XGB-Ensemble: 0.9219059521732735 (0.006389556574612421)
RF-Ensemble: 0.9250010593808922 (0.00569482185845855)
ET-Ensemble: 0.9154081004321297 (0.005482901806565513)
This table is showing us that the Random Forest Classifier is the one that scored the best. Followed by the Extreme Boosting Gradient, and by the Gradient Boosting Classifier.
We can see that with the Logistic Regression model, we are around 69% of Roc-Auc in the train-set, which is not a good score, but it is acceptable. When testing the model against the Test set, the score might drop; however, with some tune in the algorithm's hyperparameters, we could make the score increase again. Nevertheless, we can establish that our predictions (outcome) in the last step will be around that score. In this case, we will try to improve the score with other models.
We will select the first three best-scored algorithms, and hyperparams tune them.
tableResults.head(3)
tunedAlgorithmTable = pd.DataFrame(columns=['Name', 'ROC-AUC(Test)','Model','BestEstimator'])
2.2.1. Algorithm 1: Random Forest Classifier (RFC)
rfc = RandomForestClassifier(random_state=seed)
rfc.fit(X_train, y_train)
# estimate accuracy on validation dataset
predictions = rfc.predict(X_test)
print('The Initial Model ROC-AUC on the Test Set is:')
rfc_roc_auc = roc_auc_score(y_test, predictions)
print(rfc_roc_auc)
## Take the important Features ##
importances = rfc.feature_importances_
cm=confusion_matrix(y_test, predictions)
plt.figure(figsize=(6,3))
plt.title("Confusion Matrix")
sns.heatmap(cm, annot=True,fmt='d', cmap='Blues')
plt.ylabel("Actual Values")
plt.xlabel("Predicted Values")
plt.show()
The Initial Model ROC-AUC on the Test Set is:
0.8296546805405329
This model is giving us an almost 83% of ROC-AUC score. Let's see if we can improve it.
Selecting Most Important Features
According to RFC, the most important features to explain the model are:
from sklearn.feature_selection import SelectFromModel
numbers=pd.Series(np.arange(0,len(importances)))
values=pd.Series(importances)
newDF=pd.DataFrame({'id':numbers,'value':values})
newDF.sort_values(by='value',ascending=False,inplace=True)
selectModel = SelectFromModel(rfc, prefit=True)
X_train_new = selectModel.transform(X_train)
X_test_new = selectModel.transform(X_test)
newQuantityOfFeatures=X_train_new.shape[1]
newDF=newDF.iloc[0:newQuantityOfFeatures,:]
theIds=newDF[newDF['value'] == [i for i in newDF.value]].id
selectedFeatures=theIds.values
theDf = {'Name':namesFeatures, 'Ranking':importances}
tableFeatures = pd.DataFrame(theDf)
tableFeatures.sort_values(by='Ranking',ascending=False,inplace=True)
tableFeatures.head(len(theIds))
# Get the models coefficients
coeff = pd.DataFrame({'feature_name': namesFeatures, 'model_coefficient': importances.transpose().flatten()})
coeff = coeff.sort_values('model_coefficient',ascending=False)
coeff_top = coeff.head(10)
coeff_bottom = coeff.tail(10)
plt.figure().set_size_inches(10, 6)
fg3 = sns.barplot(x='feature_name', y='model_coefficient',data=coeff_top, palette="Blues_d")
fg3.set_xticklabels(rotation=35, labels=coeff_top.feature_name)
plt.xlabel('Feature')
plt.ylabel('Coefficient')
plt.subplots_adjust(bottom=0.4)
As we can see, the Random Forest Model returns the same three features as the Logistic Regression did but is weighted differently. In this case, the order is:
Tuning RFC
领英推荐
# define models
rfc = RandomForestClassifier(random_state=seed)
param_grid = {'n_estimators' : [1100],
"min_samples_split" : [11],
'class_weight':["balanced"],
'max_depth': [None],
'random_state':[seed],
# 'max_features':['sqrt', 'log2'],
'min_samples_leaf': [1]
}
grid_search = GridSearchCV( estimator=rfc,
param_grid=param_grid,
n_jobs=-1,
cv=cv,
scoring=scoring
)
grid_result = grid_search.fit(X_train, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print('Best Estimator: ',grid_result.best_estimator_)
best_hyperparams=grid_result.best_params_
best_cv_score=grid_result.best_score_
Best: 0.928198 using {'class_weight': 'balanced', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 11, 'n_estimators': 1100, 'random_state': 7}
Best Estimator: RandomForestClassifier(class_weight='balanced', min_samples_split=11,
n_estimators=1100, random_state=7)
#model
rfc = RandomForestClassifier(**grid_result.best_params_)
# Estimate accuracy on validation dataset
rfc.fit(X_train, y_train)
predictions = rfc.predict(X_test)
print('The FINAL Model ROC-AUC score on the Test Set is: ')
rfc_roc_auc = roc_auc_score(y_test, predictions)
print(rfc_roc_auc)
print(confusion_matrix(y_test, predictions))
new_row = { 'Name':'RF-Ensemble',
'ROC-AUC(Test)':roc_auc_score(y_test, predictions),
'Model':rfc,
'BestEstimator':grid_result.best_estimator_}
tunedAlgorithmTable = tunedAlgorithmTable.append(new_row, ignore_index=True)
The FINAL Model ROC-AUC score on the Test Set is:
0.8505989126994999
[[926 87]
[ 89 329]]
The final score is 85%. We have improved it!
2.2.2. Algorithm 2: Extreme Gradient Boosting (XGB)
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix
xgb = XGBClassifier(random_state=seed,eval_metric='logloss')
xgb.fit(X_train, y_train)
# estimate accuracy on validation dataset
predictions = xgb.predict(X_test)
print('The Initial Model ROC-AUC on the Test Set is:')
xgb_roc_auc = roc_auc_score(y_test, predictions)
print(xgb_roc_auc)
## Take the important Features ##
importances = xgb.feature_importances_
cm=confusion_matrix(y_test, predictions)
plt.figure(figsize=(6,3))
plt.title("Confusion Matrix")
sns.heatmap(cm, annot=True,fmt='d', cmap='Blues')
plt.ylabel("Actual Values")
plt.xlabel("Predicted Values")
plt.show()
The Initial Model ROC-AUC on the Test Set is:
0.8280983577133626
This model is giving us an almost 83% of ROC-AUC score. Let's see if we can improve it.
Selecting Most Important Features
According to XGB, the most important features to explain the model are:
from sklearn.feature_selection import SelectFromModel
numbers=pd.Series(np.arange(0,len(importances)))
values=pd.Series(importances)
newDF=pd.DataFrame({'id':numbers,'value':values})
newDF.sort_values(by='value',ascending=False,inplace=True)
selectModel = SelectFromModel(xgb, prefit=True)
X_train_new = selectModel.transform(X_train)
X_test_new = selectModel.transform(X_test)
newQuantityOfFeatures=X_train_new.shape[1]
newDF=newDF.iloc[0:newQuantityOfFeatures,:]
theIds=newDF[newDF['value'] == [i for i in newDF.value]].id
selectedFeatures=theIds.values
theDf = {'Name':namesFeatures, 'Ranking':importances}
tableFeatures = pd.DataFrame(theDf)
tableFeatures.sort_values(by='Ranking',ascending=False,inplace=True)
tableFeatures.head(len(theIds))
# Get the models coefficients
coeff = pd.DataFrame({'feature_name': namesFeatures, 'model_coefficient': importances.transpose().flatten()})
coeff = coeff.sort_values('model_coefficient',ascending=False)
coeff_top = coeff.head(10)
coeff_bottom = coeff.tail(10)
plt.figure().set_size_inches(10, 6)
fg3 = sns.barplot(x='feature_name', y='model_coefficient',data=coeff_top, palette="Blues_d")
fg3.set_xticklabels(rotation=35, labels=coeff_top.feature_name)
plt.xlabel('Feature')
plt.ylabel('Coefficient')
plt.subplots_adjust(bottom=0.4)
As we can see, the Extreme Gradient Boost Model returns the same three features as the Logistic Regression did but is weighted differently. In this case, the order is:
Tuning XGB
# define models and parameters
xgb = XGBClassifier(random_state=seed)
#Like it is now: 1 minutes to execute.
param_grid = { 'n_estimators':[70,80,90,100,110],
'subsample':[1,2],
'max_depth':[5,6,7],
'learning_rate':[0.300000012],
'min_child_weight':[1,2],
'gamma':[0],
'colsample_bytree':[1,2],
'eval_metric':['logloss'],
}
grid_search = GridSearchCV( estimator=xgb,
param_grid=param_grid,
n_jobs=-1,
cv=cv,
scoring=scoring
)
grid_result = grid_search.fit(X_train, y_train)
#grid_result = grid_search.fit(X_train_new, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print('Best Estimator: ',grid_result.best_estimator_)
best_hyperparams=grid_result.best_params_
best_cv_score=grid_result.best_score_
The FINAL Model ROC-AUC score on the Test Set is:
0.8334522027045537
[[947 66]
[112 306]]
The final score is 83,3%. We have improved it!
2.2.3. Algorithm 3: Gradient Boosting
#model
xgb = XGBClassifier(**grid_result.best_params_)
# Estimate accuracy on validation dataset
xgb.fit(X_train, y_train)
predictions = xgb.predict(X_test)
print('The FINAL Model ROC-AUC score on the Test Set is: ')
xgb_roc_auc = roc_auc_score(y_test, predictions)
print(xgb_roc_auc)
print(confusion_matrix(y_test, predictions))
new_row = { 'Name':'XGB-Ensemble',
'ROC-AUC(Test)':roc_auc_score(y_test, predictions),
'Model':xgb,
'BestEstimator':grid_result.best_estimator_}
tunedAlgorithmTable = tunedAlgorithmTable.append(new_row, ignore_index=True)
The Initial Model ROC-AUC on the Test Set is:
0.8286675137093383
This model is giving us an almost 83% of ROC-AUC score. Let's see if we can improve it.
Selecting Most Important Features
According to GBC/SCG, the most important features to explain the model are:
from sklearn.feature_selection import SelectFromModel
numbers=pd.Series(np.arange(0,len(importances)))
values=pd.Series(importances)
newDF=pd.DataFrame({'id':numbers,'value':values})
newDF.sort_values(by='value',ascending=False,inplace=True)
selectModel = SelectFromModel(gbc, prefit=True)
X_train_new = selectModel.transform(X_train)
X_test_new = selectModel.transform(X_test)
newQuantityOfFeatures=X_train_new.shape[1]
newDF=newDF.iloc[0:newQuantityOfFeatures,:]
theIds=newDF[newDF['value'] == [i for i in newDF.value]].id
selectedFeatures=theIds.values
theDf = {'Name':namesFeatures, 'Ranking':importances}
tableFeatures = pd.DataFrame(theDf)
tableFeatures.sort_values(by='Ranking',ascending=False,inplace=True)
tableFeatures.head(len(theIds))
# Get the models coefficients
coeff = pd.DataFrame({'feature_name': namesFeatures, 'model_coefficient': importances.transpose().flatten()})
coeff = coeff.sort_values('model_coefficient',ascending=False)
coeff_top = coeff.head(10)
coeff_bottom = coeff.tail(10)
plt.figure().set_size_inches(10, 6)
fg3 = sns.barplot(x='feature_name', y='model_coefficient',data=coeff_top, palette="Blues_d")
fg3.set_xticklabels(rotation=35, labels=coeff_top.feature_name)
plt.xlabel('Feature')
plt.ylabel('Coefficient')
plt.subplots_adjust(bottom=0.4)
As we can see, the Extreme Gradient Boost Model returns the same three features as the Logistic Regression did but is weighted differently. In this case, the order is:
Tuning the GBC
# define models and parameters
gbc = GradientBoostingClassifier()
param_grid = { 'n_estimators' : [80,90],
"learning_rate" : [0.08,0.1],
'subsample':[1.0],
'max_depth': [6,7],
'random_state':[seed],
'loss': ['deviance'],
'max_features':[None],
'min_samples_leaf': [1],
'min_samples_leaf': [2],
}
grid_search = GridSearchCV( estimator=gbc,
param_grid=param_grid,
n_jobs=-1,
cv=cv,
scoring=scoring
)
grid_result = grid_search.fit(X_train, y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
print('Best Estimator: ',grid_result.best_estimator_)
best_hyperparams=grid_result.best_params_
best_cv_score=grid_result.best_score_
Best: 0.927736 using {'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 6, 'max_features': None, 'min_samples_leaf': 2, 'n_estimators': 80, 'random_state': 7, 'subsample': 1.0}
Best Estimator: GradientBoostingClassifier(max_depth=6, min_samples_leaf=2, n_estimators=80,
random_state=7)
#Model
gbc = GradientBoostingClassifier(**grid_result.best_params_)
# Estimate accuracy on validation dataset
gbc.fit(X_train, y_train)
predictions = gbc.predict(X_test)
print('The FINAL Model ROC-AUC score on the Test Set is: ')
gbc_roc_auc = roc_auc_score(y_test, predictions)
print(gbc_roc_auc)
print(confusion_matrix(y_test, predictions))
new_row = { 'Name':'GBC-Ensemble',
'ROC-AUC(Test)':roc_auc_score(y_test, predictions),
'Model':model,
'BestEstimator':grid_result.best_estimator_}
tunedAlgorithmTable = tunedAlgorithmTable.append(new_row, ignore_index=True)
The FINAL Model ROC-AUC score on the Test Set is:
0.8430215806949843
[[947 66]
[104 314]]
The final score is 84,3%. We have improved it!
2.4. Voting
We rank all the tuned-Algorithms by final ROC-AUC score in the test set
#From this table we will select the most promising algorithms
tunedAlgorithmTable.sort_values(by='ROC-AUC(Test)',inplace=True,ascending=False)
tunedAlgorithmTable.head(10)
We Select the top 2 algorithms for the Voting Ensemble
In this process, we will combined the best two algorithms and get the predictions and see if they, combined, are better than the best estimator's predictions.
#Select the Number of the TOP Algorithms to use
n_Algorithms=2
# Voting Ensemble for Classification
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score
nameVoting=[]
bestEstimatorVoting=[]
for i in range(n_Algorithms):
nameVoting.append(tunedAlgorithmTable.iloc[i,:].Name)
bestEstimatorVoting.append(tunedAlgorithmTable.iloc[i,:].BestEstimator)
newList=zip(nameVoting,bestEstimatorVoting)
newList=list(newList)
theEstimators = newList
VotingPredictor = VotingClassifier( estimators = theEstimators,
voting='soft',
n_jobs = -1)
VotingPredictor = VotingPredictor.fit(X_train, y_train)
scores = cross_val_score( VotingPredictor,
X_train,
y_train,
cv = cv,
n_jobs = -1,
scoring = scoring)
print("The algorithms used are:")
for i in range(n_Algorithms):
print("{}".format(nameVoting[i]))
print('\nThe Summary')
print(round(np.mean(scores)*100, 2))
The algorithms used are:
RF-Ensemble
GBC-Ensemble
The Summary
92.97
predictions = VotingPredictor.predict(X_test)
voting_roc_auc=roc_auc_score(y_test, predictions)
print('The FINAL Model Accuracy on the Test Set is: ',voting_roc_auc)
if voting_roc_auc>tunedAlgorithmTable.iloc[0,1]:
print('\nThe top {} combination of models ({}) do better than the best model ({}) alone.'.format(n_Algorithms,theEstimators,tunedAlgorithmTable.iloc[0,0]))
predictorToUse=VotingPredictor
else:
print('\nThe best model ({}) alone does better than the Top {} combination of models ({})'.format(tunedAlgorithmTable.iloc[0,0],n_Algorithms,nameVoting))
predictorToUse=bestEstimatorVoting[0]
The FINAL Model Accuracy on the Test Set is: 0.8466679576982482
The best model (RF-Ensemble) alone does better than the Top 2 combination of models (['RF-Ensemble', 'GBC-Ensemble'])
ROC Graph
# Create ROC Graph
from sklearn.metrics import roc_curve
rfc_fpr, rfc_tpr, rfc_thresholds = roc_curve(y_test, rfc.predict_proba(X_test)[:,1])
xgb_fpr, xgb_tpr, xgb_thresholds = roc_curve(y_test, xgb.predict_proba(X_test)[:,1])
gbc_fpr, gbc_tpr, gbc_thresholds = roc_curve(y_test, gbc.predict_proba(X_test)[:,1])
voting_fpr, voting_tpr, voting_thresholds = roc_curve(y_test, VotingPredictor.predict_proba(X_test)[:,1])
plt.figure()
# Plot RFC ROC
plt.plot(rfc_fpr, rfc_tpr, label='Random Forest Classifier (area = %0.4f)' % rfc_roc_auc)
# Plot XGB ROC
plt.plot(xgb_fpr, xgb_tpr, label='XGBoost (area = %0.4f)' % xgb_roc_auc)
# Plot GBC ROC
plt.plot(gbc_fpr, gbc_tpr, label='Gradient Boost (area = %0.4f)' % gbc_roc_auc)
# Plot Voting ROC
plt.plot(voting_fpr, voting_tpr, label='Voting Ensemble (area = %0.4f)' % voting_roc_auc)
# Plot Base Rate ROC
plt.plot([0,1], [0,1],label='Base Rate' 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Graph')
plt.legend(loc="lower right")
plt.show()
As we can see, the best estimator alone (Random Forest) does better than the best two combined, or any other single estimator.
2.2.5. Getting Information from the best model (Random Forest)
As we have seen, in the three models developed, the best three features are the same ones as in the Logistic Regression Model shown in the previous point, but in a different order in terms of importance. Nevertheless, taking into consideration that the Random Forest model got a better ROC-AUC score, let's try to forecast some possible outcomes according to different scenarios.
> It's important to mention that in this case, we won't be able to provide a TurnOver percentage, but just if the employee might leave the company or not.
#As the three main features explain most of the model, we will get the means for the rest of the features.
department=df.department.mean()
promoted=df.promoted.mean()
projects=df.projects.mean()
salary=df.salary.mean()
tenure=df.tenure.mean()
bonus=df.bonus.mean()
#Scenario 1
averageOverHours1=170
review1=0.6
satisfaction1=0.8
#Scenario 2
averageOverHours2=180
review2=0.7
satisfaction2=0.8
#Scenario 3
averageOverHours3=188
review3=0.8
satisfaction3=0.8
dataSetToTry=[ [department,promoted,review1,projects,salary,tenure, satisfaction1,bonus,averageOverHours1 ],
[department,promoted,review2,projects,salary,tenure, satisfaction2,bonus,averageOverHours2 ],
[department,promoted,review3,projects,salary,tenure, satisfaction3,bonus,averageOverHours3 ],]
2.2.5.1. Forecasting with Logistic Model equation
Applying the equation we got in point 2.1, we forecast:
getTurnOver(coef, averageOverHours1, review1, satisfaction1)
getTurnOver(coef, averageOverHours2, review2, satisfaction2)
getTurnOver(coef, averageOverHours3, review3, satisfaction3)
2.2.5.2. Forecasting with Random Forest Model
def getMessage(thePred):
for i in range(len(thePred)):
if thePred[i]==0:
alert='stays in the company'
else:
alert='leaves the company'
message='The outcome is that the employee {}: {}'.format(i,alert)
print(message)
thePrediction=predictorToUse.predict(dataSetToTry)
getMessage(thePrediction)
Summary
Both models are good enough, but according to the ROC-AUC score, the RF one scores better. Nevertheless, at the end of this Report, we will show how we can establish the Logistic Regression equation to develop a Command Board.
Part 4 will be the last one, where we will do the Prescriptive Analysis, closing all the ideas, and proposing a Command Board.
---------------------------------------------------------------------------------------------------------------
This case is part of a DataCamp competition:?Employees TurnOver - DataCamp
The full report is available in my GitHub repository?GitHub - vascoarizna
Here you will find not only the full explanation with the graphics and the solution but also the Jupyter Notebook codes in case you want to take anything for you.
Author:?Ignacio Ariznabarreta -?JIAF Consulting