Part 2: Predicting results, and working with Command Boards using Machine Learning

Part 2: Predicting results, and working with Command Boards using Machine Learning

Analyzing the Employees TurnOver: Part 2

If you haven't seen Part 1, please refer to Part 1: Introduction

As an introductory overview of what we should do from an analytical perspective, we must take into Data Science follows the next steps:

  1. Understanding the Problem
  2. Exploratory Data Analysis
  3. Data Visualization
  4. Feature Engineering (Preparing the Dataset for the Machine Learning models)
  5. Model Building
  6. Model Deployment

In this report, we will go through all these steps (the last one we will just simulate some data).

After introducing the company's concern, understanding the problem, and showing the strategy we will take, it's time to start with the Exploratory Data Analysis.

A. Exploratory Analysis

Importing

import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
from pandas.plotting import scatter_matrix        


df = pd.read_csv('dataset/employee_churn_data.csv')df.head()        
No alt text provided for this image


Data Exploration

df.info()        
No alt text provided for this image

As we mentioned in Part 1: Introduction of the Report, the variables (features) found are:

  • 'department' -> The department where the employees works/used to work.
  • 'promoted' -> If the employee was promoted in the past 2 years or not.
  • 'review' -> From 0 to 1, the review score the employee received.
  • 'projects' -> The number of projects each employee has been working on.
  • 'salary' -> The level of salary. In this case, the variable is categorical and it can be: 'low', 'medium', and 'high'.
  • 'tenure' -> The employee number of years in the company.
  • 'satisfaction' -> From 0 to 1, the level of satisfaction of the employee (this was captured through surveys).
  • 'bonus' -> Whether the employee has received any bonus or not.
  • 'avr_hrs_mont' -> The average worked hours per month of the employee.
  • 'left' -> If the employee left the company or not. This is our so-called 'target'. This is the value we will try to predict later.

A.1. Info Summary

  • We have 9540 entries, in 10 columns
  • We have no NULL values and all the attributes seem to have the correct DataType

df.describe()        
No alt text provided for this image

Statistical Summary

  • Satisfaction: its mean is 50%
  • Average Hours Worked per month: 184hs
  • In most countries, the legal working schedule per day is (8+1)hs = 9hs. Monthly, this means 180hs.
  • Here, we are saying that the mean is already above the legal number (above this number, the Company should extra compensate the employee). And, on top of that, the first 25th percentil already pass this quantity of hours. This means that more than 75% of the Company works more than what the law establishes (in a different description, we would have to analyze if the employee chose to do the overtime hours or if they have to, and if they are well retributed for these extra-hours or not. As this might impact on the employee's satisfaction)
  • Promotion: Only 3% of the employees received any kind of promotion in the past two years.
  • Years in the Company (tenure): the average time in the Company is 6.5 years.
  • Bonus: Only 21% of the employees received any kind of Bonus.
  • Projects: The average # of projects managed by an employee is 3.

A.2. TurnOver (target value)

df.left.value_counts()
no     6756
yes    2784
Name: left, dtype: int64        


sns.countplot(x='left', data=df)
plt.xticks(rotation=45)
plt.show()        
No alt text provided for this image

Overview of summary (Turnover V.S. Non-turnover)

turnover_Summary = df.groupby('left')
turnOverSummaryMean=turnover_Summary.mean()
turnOverSummaryMean        
No alt text provided for this image
turnover_rate = df.left.value_counts() / df.shape[0]
turnover_rate
no     0.708176
yes    0.291824
Name: left, dtype: float64        

  • Looks like about 71% of employees stayed and 29% of employees left.

A.3.Correlation

Is there any correlation between the variables?

#For analysis purposes, we label-encode the target
dfNumeric=df.copy()
dfNumeric['leftNumeric']=0
dfNumeric.loc[dfNumeric['left']=='yes','leftNumeric']=1
dfNumeric.drop(columns='left',axis=1,inplace=True)
dfNumeric.rename(columns={'leftNumeric':'left'},inplace=True)

firstCol=dfNumeric.loc[:,'left']
restDF=dfNumeric.drop(columns='left',axis=1)
dfNumericCorr=pd.concat([firstCol,restDF], axis=1)
corrmat = dfNumericCorr.corr()
corrmat = (corrmat)
f, ax = plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, annot = True, vmax=.8, square=True)
plt.title('Heatmap of Correlation Matrix')
corrmat        
No alt text provided for this image
No alt text provided for this image


A.3.1. Correlation against the Target

dfNumeric.drop(columns='left',axis=1).corrwith(dfNumeric.loc[:,'left'])
promoted        -0.036777
review           0.304294
projects        -0.012408
tenure           0.010521
satisfaction    -0.009721
bonus           -0.011485
avg_hrs_month    0.009008
dtype: float64        

A.3.2. Summary of Correlations

Positive Correlations:

  • There is a strong positive relation between 'tenure' and 'avg_hrs_month.' This could mean that the longer the employee is in the company, the more hours he/she works.
  • Also, there is a curious case in the relation between 'review' and 'left' as it looks like the higher the review, the higher the possibilities of leaving the company.

Negative Correlations:

  • There is a low-level negative relation between 'review' and 'avg_hrs_month'. This might mean that the lower the review, the employees tend to work more hours.
  • There is a negative relation between 'review' and 'satisfaction'. This means that the higher the review of the employee is, the lower the satisfaction is. This is strange, actually. Maybe the satisfaction level is not acquiring the real employees' satisfaction.

A.4.Plots

A.4.1. Distribution

# Graph Employee Satisfaction

# Sort the dataframe by target
target_0 = dfNumeric[dfNumeric.left==0]
target_1 = dfNumeric[dfNumeric.left==1]

sns.distplot(target_0[['satisfaction']], rug=True,kde=False,label='Stayed')
sns.distplot(target_1[['satisfaction']], kde=False, rug=True,label='Left')
plt.title('Employees Satisfaction')
plt.legend()

plt.show()        
No alt text provided for this image

The distribution tends to be a normal dist.

  • The distribution of the employees who left its 'inside' to one of the employees who Stayed, but it is slightly right-skewed.

# Graph Employee Review

# Sort the dataframe by target
target_0 = dfNumeric[dfNumeric.left==0]
target_1 = dfNumeric[dfNumeric.left==1]

sns.distplot(target_0[['review']], rug=True,kde=False,label='Stayed')
sns.distplot(target_1[['review']], kde=False, rug=True,label='Left')
plt.title('Employees Review')
plt.legend()

plt.show()        
No alt text provided for this image

  • The distribution of the employees who stayed tends to be a normal dist.
  • The distribution of the employees who left it's left-skewed.

# Graph Employee avg_hrs_month

# Sort the dataframe by target
target_0 = dfNumeric[dfNumeric.left==0]
target_1 = dfNumeric[dfNumeric.left==1]

sns.distplot(target_0[['avg_hrs_month']], rug=True,kde=False,label='Stayed')
sns.distplot(target_1[['avg_hrs_month']], kde=False, rug=True,label='Left')
plt.title('Employees Average Hrs Worked per Month')
plt.legend()

plt.show()        
No alt text provided for this image

  • Here, we see that the maximum values of Average Hours worked per Month belong to the employees who stayed.
  • However, there is an important peak between 185 and 190 hours from the employees who left the company.

A.4.2. Scatter

# scatter plot matrix

sns.set()
sns.pairplot(df, size = 2.5)
plt.show()        
No alt text provided for this image


A.5. Correlations between pair of values

A.5.1. Relation between Salary and Target (employees who left)

ax = plt.subplots(figsize=(15, 4))
sns.countplot(x="salary", hue='left', data=df).set_title('Employee Salary Turnover')
plt.show()        
No alt text provided for this image

  • There are no important remarks in regards to the Salary distribution

A.5.2. Relation between Department and Target (employees who left)

dfSalaryByDepartment=df
dfSalaryByDepartment['salaryNumeric'] = 0
dfSalaryByDepartment.loc[dfSalaryByDepartment['salary']=='medium','salaryNumeric']=1
dfSalaryByDepartment.loc[dfSalaryByDepartment['salary']=='high','salaryNumeric']=2

dfSalaryByDepartment=df.groupby(['department','salary','salaryNumeric']).count()/df.groupby(['department']).count()
dfSalaryByDepartment=dfSalaryByDepartment[['bonus']]

dfSalaryByDepartment.reset_index(inplace=True)
dfSalaryByDepartment.iloc[:,[0,1,3]]        
No alt text provided for this image
sns.barplot(x='department', y='bonus',hue='salary',data=dfSalaryByDepartment.sort_values(by='salaryNumeric')).set_title('Employee Department Quantity')
plt.xticks(rotation=90)
plt.show()        
No alt text provided for this image


totalPerDepartment=df.groupby(['department'])[['left']].count()
totalPerDepartment.rename(columns={'left':'totalEmployees'},inplace=True)
totalPerDepartment=totalPerDepartment.reset_index().sort_values(by='totalEmployees',ascending=False)
totalPerDepartment        
No alt text provided for this image


# Employee distribution
sns.barplot(x='department', y='totalEmployees',data=totalPerDepartment).set_title('Employee Department Quantity')
 
# Rotate x-labels
plt.xticks(rotation=90)
plt.show()        
No alt text provided for this image
numberOfLeftOver=dfNumeric.groupby('department')[['left']].sum()
totalPerDep=totalPerDepartment.rename(columns={'department':'departmentTotal'}).sort_values(by='departmentTotal')
numberOfLeftOver=numberOfLeftOver.reset_index().sort_values(by='left',ascending=False)
numberOfLeft2=numberOfLeftOver.sort_values(by='department')
departmentsDF=pd.concat([numberOfLeft2,totalPerDep],axis=1)
departmentsDF.drop(columns=['departmentTotal'],inplace=True)
departmentsDF['absoluteRatio']=departmentsDF['left']/departmentsDF['totalEmployees'].sum()
departmentsDF['relativeRatio']=departmentsDF['left']/departmentsDF['totalEmployees']
departmentsDF        
No alt text provided for this image


f, ax = plt.subplots(figsize=(15, 5))
sns.countplot(x="department", hue='left', data=df).set_title('Employee Department Turnover')
plt.xticks(rotation=90)
plt.show()        
No alt text provided for this image

  • The sales, retail, and engineering department were the top 3 employee turnover departments, in absolute numbers.
  • In terms of TurnOver Ratio: IT is the first one, then logistics and then marketing.
  • Why is it important also to consider the ratio? The sales department is the biggest one and had 537 persons who left the company (turnover ratio: 0.285183). This number is already bigger than the whole IT department, which had 110 employees who left the company. Now, taking into consideration that the total # of employees the IT department had was 356, the turnover ratio (0.308989) is bigger than the one from the sales department.
  • The finance department had the smallest amount of turnover both in terms of absolute and relative levels.

Analyzing by Absolute Values/Ratio

dfNumeric[dfNumeric.department=='sales'].groupby('left').mean()        
No alt text provided for this image

We see that, on average, the people who left the Company had a lower bonus than one who stayed, but a better review, which might mean they were good employees.

  • The bigger the review is, the more the employees tend to leave the Company

dfNumeric[dfNumeric.department=='finance'].groupby('left').mean()        
No alt text provided for this image

We see that the employees from Finance (the sector with the smallest turnover both in absolute and relative levels) who left the Company had the highest review than the ones who stayed, and also the biggest bonus, but they were less satisfied. Anyway, both groups have a bonus bigger than the Company's average one.

  • The satisfaction level is important
  • The bigger the review is, the more the employees tend to leave the Company


Analyzing by Relative Levels

Analyzing the employees from IT

dfNumeric[dfNumeric.department=='IT'].groupby('left').mean()        
No alt text provided for this image

In terms of relative TurnOver, the employees who left the Company

  • tended to work more hours per month.
  • had a bigger review
  • had been promoted a lot less

A.5.3. Relation between # of Projects and Target (employees who left)

ax = sns.countplot(x="projects", hue="left", data=df)
plt.show()        
No alt text provided for this image

  • No remarks from this analysis

A.5.4. Relation between Reviews and Target (employees who left)

turnOverSummaryMean        
No alt text provided for this image


target_0=df['left'] == 'no'
target_1=df['left'] == 'yes'
# Kernel Density Plot
fig = plt.figure(figsize=(15,4),)
ax=sns.kdeplot(df.loc[(target_0),'review'] , color='b',shade=True,label='Stayed')
ax=sns.kdeplot(df.loc[(target_1),'review'] , color='r',shade=True, label='Left')
plt.title('Employee Review Distribution - Left the Company V.S. Stayed')
plt.legend()
plt.show()        
No alt text provided for this image
ax = sns.boxplot(x="left", y="review", data=df)        
No alt text provided for this image


Breaking the Analysis in Ranges

def splitRange(theDf,column,minValue,maxValue,numberOfSplits=0,theSteps=0):
    """This function receives a Dataframe, the column that you want to create Ranges from, the Starting Value, the Finishing Value
- the Number of Splits you would like to have, or the steps you would like your range to have.
The first 4 values are mandatory, and then you need to enter either of the last two values."""

    if (theSteps==0) & (numberOfSplits==0):
        print ('Enter the Number or Steps ot the Number of Splits.')
        return
    if numberOfSplits!=0:
        theSteps=maxValue/numberOfSplits

    for i in np.arange(minValue,maxValue,theSteps):
        if i == minValue:
            theDf.loc[(theDf[column]==i),column+'Range']=(str(np.round(i,1))+' - '+str(np.round(i+theSteps,1)))
        
        theDf.loc[(theDf[column]>i) & (theDf[column]<=(i+theSteps)),column+'Range']=(str(np.round(i,1))+' - '+str(np.round(i+theSteps,1)))

        if i+theSteps == maxValue:
            theDf.loc[(theDf[column]==np.float(maxValue)),column+'Range']=(str(np.round(i,1))+' - '+str(np.round(i+theSteps,1)))
    return theDf
tuenumberOfSplits=10
theColumn='review'
theMinValue=0
theMaxValue=1

rangedDf=splitRange(dfNumeric,theColumn,theMinValue,theMaxValue,tuenumberOfSplits)
rangedDf.sort_values(theColumn,inplace=True)
sns.countplot(x='reviewRange',hue='left',data=rangedDf)
plt.xticks(rotation=45)
plt.show()        
No alt text provided for this image
rangedDf.groupby('reviewRange').mean()        
No alt text provided for this image


sns.scatterplot(x='review',y='tenure',hue='left',data=rangedDf)
plt.show()        
No alt text provided for this image

  • The mean of the employees who left the company was 0.69.
  • We can see that the higher the review score, the more chances the employee leave the company
  • On top of that, employees with a review above 0.8 leave the company 90% of the time ((0.8+1)/2)


A.5.5. Relation between Average Hours worked per Month and Target (employees who left)

# Kernel Density Plot
fig = plt.figure(figsize=(15,4),)
ax=sns.kdeplot(df.loc[(target_0),'avg_hrs_month'] , color='b',shade=True,label='Stayed')
ax=sns.kdeplot(df.loc[(target_1),'avg_hrs_month'] , color='r',shade=True, label='Left')
plt.title('Employee Average Hours worked per Month Distribution - Left the Company V.S. Stayed')
plt.legend()
plt.show()        
No alt text provided for this image

  • For the employees who stayed in the company, there is a right skewness: the mean is smaller than the median.
  • For the employees who left the company, there is a clear left skewness. In this case, they tend to work many more hours than the average of the company.
  • In fact, there is a big peak between 185 and 190 hours.

In general, in all the company the employees tend to do overtime hours.

total=df.groupby('left')[['department']].count()
overTimeHours=df[(df.avg_hrs_month>=185) & (df.avg_hrs_month<=190)]
overTimeHours.groupby('left')[['department']].count()/total        
No alt text provided for this image

  • 56% of the people who left the company were doing between 185 and 190 hs per month. This means between 9:15 and 9:30 hours per day.

Breaking the Analysis in Ranges

tuenumberOfSplits=0
theColumn='avg_hrs_month'
theMinValue=170
theMaxValue=205
steps=5

rangedDf=splitRange(dfNumeric,theColumn,theMinValue,theMaxValue,tuenumberOfSplits,steps)
rangedDf.sort_values(theColumn,inplace=True)
sns.countplot(x='avg_hrs_monthRange',hue='left',data=rangedDf)
plt.xticks(rotation=45)
plt.show()        
No alt text provided for this image
rangedAVGDf=rangedDf.groupby('avg_hrs_monthRange').mean()
rangedAVGDf.reset_index(inplace=True)
rangedAVGDf        
No alt text provided for this image


A.5.6. Relation between Satisfaction and Target (employees who left)

# Kernel Density Plot
fig = plt.figure(figsize=(15,4),)
ax=sns.kdeplot(df.loc[(target_0),'satisfaction'] , color='b',shade=True,label='Stayed')
ax=sns.kdeplot(df.loc[(target_1),'satisfaction'] , color='r',shade=True, label='Left')
plt.title('Employee Average Hours worked per Month Distribution - Left the Company V.S. Stayed')
plt.legend()
plt.show()        
No alt text provided for this image

  • For the employees that stayed in the company, the distribution is slightly normal.
  • However, for the employees who left the company, we can see a small right skewness, and we can see a peak below the mean.

Breaking the Analysis in Ranges

tuenumberOfSplits=10
theColumn='satisfaction'
theMinValue=0
theMaxValue=1
steps=0

rangedDf=splitRange(dfNumeric,theColumn,theMinValue,theMaxValue,tuenumberOfSplits,steps)
#The database has a an error where the row 1755 has a Satisfaction of 1.0000000000000002, when it should be 1.
#With this line, we modify the error
theSatisfaction=df.iloc[1755].satisfaction
df.index[rangedDf['satisfaction']==theSatisfaction]
rangedDf.iloc[9539,12]='0.9 - 1.0'
rangedDf.sort_values('satisfaction',inplace=True)
sns.countplot(x='satisfactionRange',hue='left',data=rangedDf)
plt.xticks(rotation=45)
plt.show()        
No alt text provided for this image
rangedDf.groupby('satisfactionRange').mean()        
No alt text provided for this image


A.5.7. Relation between Satisfaction and Review

#The database has a an error where the row 1755 has a Satisfaction of 1.0000000000000002, when it should be 1.
#With this line, we modify the error
theSatisfaction=df.iloc[1755].satisfaction
df.index[rangedDf['satisfaction']==theSatisfaction]
rangedDf.iloc[9539,12]='0.9 - 1.0'
sns.lineplot(x='reviewRange',y='tenure',data=rangedDf.sort_values('review'))
plt.xticks(rotation=45)
plt.show()        
No alt text provided for this image

  • The bigger the satisfaction is, the lower the review is. And vice versa. (with exception of the first steps of the range)

sns.scatterplot(x='satisfaction',y='review',hue='left',data=df)
plt.show()        
No alt text provided for this image

In terms of employees who left the company, we find two distinctive clusters:

  1. Employees with low satisfaction and high review: those might have been good employees but not happy in the company.
  2. Most of the employees who left were grouped with a satisfaction above the mean and with an average review.

A.5.8. Relation between Tenure (years in the company) and Target (employees who left)

f, ax = plt.subplots(figsize=(15, 5))
sns.countplot(x="tenure", hue='left', data=df).set_title('Employee years in the company Turnover')
plt.show()        
No alt text provided for this image


f, ax = plt.subplots(figsize=(15, 5))
sns.barplot(x="tenure", y="tenure", hue="left", data=df, estimator=lambda x: len(x) / len(df) * 100)
ax.set(ylabel="Percent")
plt.show()        
No alt text provided for this image

  • We see that in the years 3 & 4, and 7 & 8 years in the company, the turnover ratio almost reaches a parity.
  • However, due to statistical significance, I would pay special attention to the years 7 and 8
  • This might mean that some of the employees who left the company might have been headhunted.

A.5.9. Relation between Tenure and Average Hours worked in the Month

sns.lineplot(x='avg_hrs_month',y='tenure',hue='left',data=df)
plt.xticks(rotation=45)
plt.show()        
No alt text provided for this image

  • Both the employees who stay as the ones who left tend to work more as they have a bigger tenure (more years in the company)

A.5.10 Relation between Review and Tenure

sns.lineplot(x='reviewRange',y='tenure',data=rangedDf.sort_values('review'))
plt.xticks(rotation=45)
plt.show()        
No alt text provided for this image

  • Also, we see that the employees who got the higher reviews were the employees with fewer years in the company


Preliminary Summary

  • The unique indicator found that directly (and clearly) impacts the outcome (if the employee leaves the Company or not) is the Review of the Employee. The higher the review, the more chances the employee has to leave the Company.
  • Linked to the Review indicator, we have the Satisfaction, as higher the review is, the lower the Satisfaction is (it's almost a perfect negative relation with the exception of the first Satisfaction's ranges).

Statistical Summary

  • Satisfaction: its mean is 50%.
  • Average Hours Worked per month: 184hs
  • Promotion: Only 3% of the employees received any kind of promotion in the past two years.
  • Years in the Company (tenure): the average time in the Company is 6.5 years.
  • Bonus: Only 21% of the employees received any kind of Bonus.
  • Projects: The average # of projects managed by an employee is 3.

Review

  • Employees with a review above 0.8 leave the Company 90% of the times
  • Employees who got the higher reviews were the employees with fewer years in the Company

TurnOver

  • about 71% of employees stayed, and 29% of employees left the Company in the last analyzed period.

Correlations

Positive Correlations

  • There is a strong positive relation between 'tenure' and 'avg_hrs_month'. This could mean that the longer the employee is in the Company, the more hours he/she works.
  • Also, there is a curious case in the relation between 'review' and 'left' as it looks like the higher the review, the higher the possibilities of leaving the Company.

Negative Correlations

  • There is a low-level negative relation between 'review' and 'avg_hrs_month'. This might mean that the lower the review, the employees tend to work more hours.
  • There is a negative relation between 'review' and 'satisfaction'. This means that the higher the review of the employee is, the lower the Satisfaction is. This is strange, actually. Maybe the satisfaction level is not acquiring the real employees' Satisfaction.


ANALYSIS OF THE EMPLOYEES FROM SALES

We see that, on average, the people who left the Company had a lower bonus than one who stayed, but a better review, which might mean they were good employees.

ANALYSIS OF THE EMPLOYEES FROM FINANCES

We see that the employees from Finance (the sector with the smallest turnover both in absolute and relative levels) who left the Company had the highest review than the ones who stayed, and also the biggest Bonus, but they were less satisfied. Anyway, both groups had a bonus bigger than the Company's average one.

ANALYSIS OF THE EMPLOYEES FROM IT

In terms of relative TurnOver, the employees who left the Company:

  • tended to work more hours per month.
  • had a bigger review
  • had been promoted a lot less


In the next post, we will start answering the specific questions, and creating & deploying the Machine Learning chosen model. Finally, we will finish with the specific insights.

Go to Part 3: Predictive Analysis


---------------------------------------------------------------------------------------------------------------

This case is part of a DataCamp competition: Employees TurnOver - DataCamp

The full report is available in my GitHub repository GitHub - vascoarizna

Here you will find not only the full explanation with the graphics and the solution but also the Jupyter Notebook codes in case you want to take anything for you.

Author: Ignacio Ariznabarreta - JIAF Consulting

要查看或添加评论,请登录

Jose Ignacio Ariznabarreta Fossati的更多文章

社区洞察

其他会员也浏览了