Part 2: Predicting results, and working with Command Boards using Machine Learning
Analyzing the Employees TurnOver: Part 2
If you haven't seen Part 1, please refer to Part 1: Introduction
As an introductory overview of what we should do from an analytical perspective, we must take into Data Science follows the next steps:
In this report, we will go through all these steps (the last one we will just simulate some data).
After introducing the company's concern, understanding the problem, and showing the strategy we will take, it's time to start with the Exploratory Data Analysis.
A. Exploratory Analysis
Importing
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
from pandas.plotting import scatter_matrix
df = pd.read_csv('dataset/employee_churn_data.csv')df.head()
Data Exploration
df.info()
As we mentioned in Part 1: Introduction of the Report, the variables (features) found are:
A.1. Info Summary
df.describe()
Statistical Summary
A.2. TurnOver (target value)
df.left.value_counts()
no 6756
yes 2784
Name: left, dtype: int64
sns.countplot(x='left', data=df)
plt.xticks(rotation=45)
plt.show()
Overview of summary (Turnover V.S. Non-turnover)
turnover_Summary = df.groupby('left')
turnOverSummaryMean=turnover_Summary.mean()
turnOverSummaryMean
turnover_rate = df.left.value_counts() / df.shape[0]
turnover_rate
no 0.708176
yes 0.291824
Name: left, dtype: float64
A.3.Correlation
Is there any correlation between the variables?
#For analysis purposes, we label-encode the target
dfNumeric=df.copy()
dfNumeric['leftNumeric']=0
dfNumeric.loc[dfNumeric['left']=='yes','leftNumeric']=1
dfNumeric.drop(columns='left',axis=1,inplace=True)
dfNumeric.rename(columns={'leftNumeric':'left'},inplace=True)
firstCol=dfNumeric.loc[:,'left']
restDF=dfNumeric.drop(columns='left',axis=1)
dfNumericCorr=pd.concat([firstCol,restDF], axis=1)
corrmat = dfNumericCorr.corr()
corrmat = (corrmat)
f, ax = plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, annot = True, vmax=.8, square=True)
plt.title('Heatmap of Correlation Matrix')
corrmat
A.3.1. Correlation against the Target
dfNumeric.drop(columns='left',axis=1).corrwith(dfNumeric.loc[:,'left'])
promoted -0.036777
review 0.304294
projects -0.012408
tenure 0.010521
satisfaction -0.009721
bonus -0.011485
avg_hrs_month 0.009008
dtype: float64
A.3.2. Summary of Correlations
Positive Correlations:
Negative Correlations:
A.4.Plots
A.4.1. Distribution
# Graph Employee Satisfaction
# Sort the dataframe by target
target_0 = dfNumeric[dfNumeric.left==0]
target_1 = dfNumeric[dfNumeric.left==1]
sns.distplot(target_0[['satisfaction']], rug=True,kde=False,label='Stayed')
sns.distplot(target_1[['satisfaction']], kde=False, rug=True,label='Left')
plt.title('Employees Satisfaction')
plt.legend()
plt.show()
The distribution tends to be a normal dist.
# Graph Employee Review
# Sort the dataframe by target
target_0 = dfNumeric[dfNumeric.left==0]
target_1 = dfNumeric[dfNumeric.left==1]
sns.distplot(target_0[['review']], rug=True,kde=False,label='Stayed')
sns.distplot(target_1[['review']], kde=False, rug=True,label='Left')
plt.title('Employees Review')
plt.legend()
plt.show()
# Graph Employee avg_hrs_month
# Sort the dataframe by target
target_0 = dfNumeric[dfNumeric.left==0]
target_1 = dfNumeric[dfNumeric.left==1]
sns.distplot(target_0[['avg_hrs_month']], rug=True,kde=False,label='Stayed')
sns.distplot(target_1[['avg_hrs_month']], kde=False, rug=True,label='Left')
plt.title('Employees Average Hrs Worked per Month')
plt.legend()
plt.show()
A.4.2. Scatter
# scatter plot matrix
sns.set()
sns.pairplot(df, size = 2.5)
plt.show()
A.5. Correlations between pair of values
A.5.1. Relation between Salary and Target (employees who left)
ax = plt.subplots(figsize=(15, 4))
sns.countplot(x="salary", hue='left', data=df).set_title('Employee Salary Turnover')
plt.show()
A.5.2. Relation between Department and Target (employees who left)
dfSalaryByDepartment=df
dfSalaryByDepartment['salaryNumeric'] = 0
dfSalaryByDepartment.loc[dfSalaryByDepartment['salary']=='medium','salaryNumeric']=1
dfSalaryByDepartment.loc[dfSalaryByDepartment['salary']=='high','salaryNumeric']=2
dfSalaryByDepartment=df.groupby(['department','salary','salaryNumeric']).count()/df.groupby(['department']).count()
dfSalaryByDepartment=dfSalaryByDepartment[['bonus']]
dfSalaryByDepartment.reset_index(inplace=True)
dfSalaryByDepartment.iloc[:,[0,1,3]]
sns.barplot(x='department', y='bonus',hue='salary',data=dfSalaryByDepartment.sort_values(by='salaryNumeric')).set_title('Employee Department Quantity')
plt.xticks(rotation=90)
plt.show()
totalPerDepartment=df.groupby(['department'])[['left']].count()
totalPerDepartment.rename(columns={'left':'totalEmployees'},inplace=True)
totalPerDepartment=totalPerDepartment.reset_index().sort_values(by='totalEmployees',ascending=False)
totalPerDepartment
# Employee distribution
sns.barplot(x='department', y='totalEmployees',data=totalPerDepartment).set_title('Employee Department Quantity')
# Rotate x-labels
plt.xticks(rotation=90)
plt.show()
numberOfLeftOver=dfNumeric.groupby('department')[['left']].sum()
totalPerDep=totalPerDepartment.rename(columns={'department':'departmentTotal'}).sort_values(by='departmentTotal')
numberOfLeftOver=numberOfLeftOver.reset_index().sort_values(by='left',ascending=False)
numberOfLeft2=numberOfLeftOver.sort_values(by='department')
departmentsDF=pd.concat([numberOfLeft2,totalPerDep],axis=1)
departmentsDF.drop(columns=['departmentTotal'],inplace=True)
departmentsDF['absoluteRatio']=departmentsDF['left']/departmentsDF['totalEmployees'].sum()
departmentsDF['relativeRatio']=departmentsDF['left']/departmentsDF['totalEmployees']
departmentsDF
f, ax = plt.subplots(figsize=(15, 5))
sns.countplot(x="department", hue='left', data=df).set_title('Employee Department Turnover')
plt.xticks(rotation=90)
plt.show()
Analyzing by Absolute Values/Ratio
dfNumeric[dfNumeric.department=='sales'].groupby('left').mean()
We see that, on average, the people who left the Company had a lower bonus than one who stayed, but a better review, which might mean they were good employees.
dfNumeric[dfNumeric.department=='finance'].groupby('left').mean()
We see that the employees from Finance (the sector with the smallest turnover both in absolute and relative levels) who left the Company had the highest review than the ones who stayed, and also the biggest bonus, but they were less satisfied. Anyway, both groups have a bonus bigger than the Company's average one.
Analyzing by Relative Levels
Analyzing the employees from IT
dfNumeric[dfNumeric.department=='IT'].groupby('left').mean()
In terms of relative TurnOver, the employees who left the Company
领英推荐
A.5.3. Relation between # of Projects and Target (employees who left)
ax = sns.countplot(x="projects", hue="left", data=df)
plt.show()
A.5.4. Relation between Reviews and Target (employees who left)
turnOverSummaryMean
target_0=df['left'] == 'no'
target_1=df['left'] == 'yes'
# Kernel Density Plot
fig = plt.figure(figsize=(15,4),)
ax=sns.kdeplot(df.loc[(target_0),'review'] , color='b',shade=True,label='Stayed')
ax=sns.kdeplot(df.loc[(target_1),'review'] , color='r',shade=True, label='Left')
plt.title('Employee Review Distribution - Left the Company V.S. Stayed')
plt.legend()
plt.show()
ax = sns.boxplot(x="left", y="review", data=df)
Breaking the Analysis in Ranges
def splitRange(theDf,column,minValue,maxValue,numberOfSplits=0,theSteps=0):
"""This function receives a Dataframe, the column that you want to create Ranges from, the Starting Value, the Finishing Value
- the Number of Splits you would like to have, or the steps you would like your range to have.
The first 4 values are mandatory, and then you need to enter either of the last two values."""
if (theSteps==0) & (numberOfSplits==0):
print ('Enter the Number or Steps ot the Number of Splits.')
return
if numberOfSplits!=0:
theSteps=maxValue/numberOfSplits
for i in np.arange(minValue,maxValue,theSteps):
if i == minValue:
theDf.loc[(theDf[column]==i),column+'Range']=(str(np.round(i,1))+' - '+str(np.round(i+theSteps,1)))
theDf.loc[(theDf[column]>i) & (theDf[column]<=(i+theSteps)),column+'Range']=(str(np.round(i,1))+' - '+str(np.round(i+theSteps,1)))
if i+theSteps == maxValue:
theDf.loc[(theDf[column]==np.float(maxValue)),column+'Range']=(str(np.round(i,1))+' - '+str(np.round(i+theSteps,1)))
return theDf
tuenumberOfSplits=10
theColumn='review'
theMinValue=0
theMaxValue=1
rangedDf=splitRange(dfNumeric,theColumn,theMinValue,theMaxValue,tuenumberOfSplits)
rangedDf.sort_values(theColumn,inplace=True)
sns.countplot(x='reviewRange',hue='left',data=rangedDf)
plt.xticks(rotation=45)
plt.show()
rangedDf.groupby('reviewRange').mean()
sns.scatterplot(x='review',y='tenure',hue='left',data=rangedDf)
plt.show()
A.5.5. Relation between Average Hours worked per Month and Target (employees who left)
# Kernel Density Plot
fig = plt.figure(figsize=(15,4),)
ax=sns.kdeplot(df.loc[(target_0),'avg_hrs_month'] , color='b',shade=True,label='Stayed')
ax=sns.kdeplot(df.loc[(target_1),'avg_hrs_month'] , color='r',shade=True, label='Left')
plt.title('Employee Average Hours worked per Month Distribution - Left the Company V.S. Stayed')
plt.legend()
plt.show()
In general, in all the company the employees tend to do overtime hours.
total=df.groupby('left')[['department']].count()
overTimeHours=df[(df.avg_hrs_month>=185) & (df.avg_hrs_month<=190)]
overTimeHours.groupby('left')[['department']].count()/total
Breaking the Analysis in Ranges
tuenumberOfSplits=0
theColumn='avg_hrs_month'
theMinValue=170
theMaxValue=205
steps=5
rangedDf=splitRange(dfNumeric,theColumn,theMinValue,theMaxValue,tuenumberOfSplits,steps)
rangedDf.sort_values(theColumn,inplace=True)
sns.countplot(x='avg_hrs_monthRange',hue='left',data=rangedDf)
plt.xticks(rotation=45)
plt.show()
rangedAVGDf=rangedDf.groupby('avg_hrs_monthRange').mean()
rangedAVGDf.reset_index(inplace=True)
rangedAVGDf
A.5.6. Relation between Satisfaction and Target (employees who left)
# Kernel Density Plot
fig = plt.figure(figsize=(15,4),)
ax=sns.kdeplot(df.loc[(target_0),'satisfaction'] , color='b',shade=True,label='Stayed')
ax=sns.kdeplot(df.loc[(target_1),'satisfaction'] , color='r',shade=True, label='Left')
plt.title('Employee Average Hours worked per Month Distribution - Left the Company V.S. Stayed')
plt.legend()
plt.show()
Breaking the Analysis in Ranges
tuenumberOfSplits=10
theColumn='satisfaction'
theMinValue=0
theMaxValue=1
steps=0
rangedDf=splitRange(dfNumeric,theColumn,theMinValue,theMaxValue,tuenumberOfSplits,steps)
#The database has a an error where the row 1755 has a Satisfaction of 1.0000000000000002, when it should be 1.
#With this line, we modify the error
theSatisfaction=df.iloc[1755].satisfaction
df.index[rangedDf['satisfaction']==theSatisfaction]
rangedDf.iloc[9539,12]='0.9 - 1.0'
rangedDf.sort_values('satisfaction',inplace=True)
sns.countplot(x='satisfactionRange',hue='left',data=rangedDf)
plt.xticks(rotation=45)
plt.show()
rangedDf.groupby('satisfactionRange').mean()
A.5.7. Relation between Satisfaction and Review
#The database has a an error where the row 1755 has a Satisfaction of 1.0000000000000002, when it should be 1.
#With this line, we modify the error
theSatisfaction=df.iloc[1755].satisfaction
df.index[rangedDf['satisfaction']==theSatisfaction]
rangedDf.iloc[9539,12]='0.9 - 1.0'
sns.lineplot(x='reviewRange',y='tenure',data=rangedDf.sort_values('review'))
plt.xticks(rotation=45)
plt.show()
sns.scatterplot(x='satisfaction',y='review',hue='left',data=df)
plt.show()
In terms of employees who left the company, we find two distinctive clusters:
A.5.8. Relation between Tenure (years in the company) and Target (employees who left)
f, ax = plt.subplots(figsize=(15, 5))
sns.countplot(x="tenure", hue='left', data=df).set_title('Employee years in the company Turnover')
plt.show()
f, ax = plt.subplots(figsize=(15, 5))
sns.barplot(x="tenure", y="tenure", hue="left", data=df, estimator=lambda x: len(x) / len(df) * 100)
ax.set(ylabel="Percent")
plt.show()
A.5.9. Relation between Tenure and Average Hours worked in the Month
sns.lineplot(x='avg_hrs_month',y='tenure',hue='left',data=df)
plt.xticks(rotation=45)
plt.show()
A.5.10 Relation between Review and Tenure
sns.lineplot(x='reviewRange',y='tenure',data=rangedDf.sort_values('review'))
plt.xticks(rotation=45)
plt.show()
Preliminary Summary
Statistical Summary
Review
TurnOver
Correlations
Positive Correlations
Negative Correlations
ANALYSIS OF THE EMPLOYEES FROM SALES
We see that, on average, the people who left the Company had a lower bonus than one who stayed, but a better review, which might mean they were good employees.
ANALYSIS OF THE EMPLOYEES FROM FINANCES
We see that the employees from Finance (the sector with the smallest turnover both in absolute and relative levels) who left the Company had the highest review than the ones who stayed, and also the biggest Bonus, but they were less satisfied. Anyway, both groups had a bonus bigger than the Company's average one.
ANALYSIS OF THE EMPLOYEES FROM IT
In terms of relative TurnOver, the employees who left the Company:
In the next post, we will start answering the specific questions, and creating & deploying the Machine Learning chosen model. Finally, we will finish with the specific insights.
---------------------------------------------------------------------------------------------------------------
This case is part of a DataCamp competition: Employees TurnOver - DataCamp
The full report is available in my GitHub repository GitHub - vascoarizna
Here you will find not only the full explanation with the graphics and the solution but also the Jupyter Notebook codes in case you want to take anything for you.
Author: Ignacio Ariznabarreta - JIAF Consulting