Predicting Employee Turnover: Data Science Capstone Project

Predicting Employee Turnover: Data Science Capstone Project

Introduction

This capstone project for my Advanced Data Analytics Certificate showcases all the skills I've learned, applied to a fictional scenario provided by Coursera, aimed at solving real-world business challenges.

In the rapidly evolving business landscape, understanding the factors behind employee turnover is crucial for maintaining a healthy corporate culture and minimizing costs associated with recruitment and training. This article goes into a comprehensive data analysis project aimed at uncovering the reasons behind the high turnover rate at Salifort Motors and offering actionable insights to improve employee retention.

The Scenario

Salifort Motors faces a significant challenge with a high employee turnover rate, prompting concerns from the senior leadership team about the impact on the company's culture and financial health. Tasked with addressing this issue, my objective was to analyze employee survey data to identify underlying factors contributing to turnover and develop a predictive model to forecast employee departures. By identifying key predictors of turnover, Salifort Motors aims to implement targeted interventions to enhance job satisfaction, promote professional development, and ultimately reduce the turnover rate, aligning with the company's commitment to supporting employee success.

Skills I utilized in this project:

  • Python Programming
  • Exploratory Data Analysis (EDA)
  • Statistical Modeling
  • Machine Learning Modeling
  • Data Presentation & Reporting

Analyzing the Data

I was provided with this list, which shows the variables and descriptions from the survey conducted.

The first step was to load packages for data manipulation, visualization, statistical analysis, modeling, and the dataset.

# For data manipulation
import numpy as np
import pandas as pd

# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For displaying all of the columns in dataframes
pd.set_option('display.max_columns', None)

# For data modeling
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from xgboost import plot_importance

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# For metrics and helpful functions
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay, classification_report

df = pd.read_csv("HR_capstone_dataset.csv")        

Next, I gathered basic information and descriptive statistics in exploratory data analysis (EDA).

# Looking at the data
data.head()
data.describe()
# Displaying all column names
list(df)
# Renaming columns
df = df.rename(columns={'Work_accident': 'work_accident', 'Department': 'department','average_montly_hours': 'average_monthly_hours','time_spend_company': 'tenure','number_project': 'projects','promotion_last_5years': 'promoted_last_five_years'})
# Checking for missing values
df.isna().sum()
# Checking for duplicates
df.duplicated().sum()
df1 = df.drop_duplicates(keep='first')

percentile25 = df1["tenure"].quantile(0.25)
percentile75 = df1["tenure"].quantile(0.75)
iqr = percentile75 - percentile25
lower_limit = percentile25 - 1.5 * iqr
upper_limit = percentile75 + 1.5 * iqr
df1["tenure"].value_counts(normalize=False)
outliers = df1[(df1['tenure'] > upper_limit) | (df1['tenure'] < lower_limit)]
print("Outliers in `tenure`:", len(outliers))        

Through my first steps in EDA, I found and corrected misspellings in the column names and simplified others after removing duplicate data. I also found 824 outliers in 'tenure', which should be fine with the model I am using.

I then started data exploration, first by looking at how many employees have left:

df1["left"].value_counts(normalize=False)
df1["left"].value_counts(normalize=True)        

  • 10,000 employees have left (83.4%)
  • 1991 employees stayed (16.6%)

I then continued to explore data with many different visualizations, looking for trends or patterns in the data. This was a long process, so I will only display some results.

fig = plt.figure(figsize=(5,3))
sns.barplot(data=df,
            x='salary',
            y='left',
            )
plt.title('Salaries vs percentage left');        

  • This graph shows a clear correlation between salaries and employees leaving, as more employees with lower salaries are leaving than those with high

fig = plt.figure(figsize=(15,8))
sns.boxplot(data=df1, x='average_monthly_hours', y='projects', hue='left', orient="h")        

  • This stacked barplot shows an interesting relationship between the number of projects, average monthly hours, and employees leaving
  • As you can see with the orange plots on the bottom right of the image, there are many people leaving who are possibly being overworked, with many projects and many hours

plt.figure(figsize=(16, 3))
sns.scatterplot(data=df1, x='average_monthly_hours', y='promotion_last_5years', hue='left', alpha=0.4)
plt.axvline(x=166.67, color='#ff6361', ls='--')
plt.legend(labels=['166.67 hrs./mo.', 'left', 'stayed'])
plt.title('Monthly hours by promotion last 5 years', fontsize='14');        

  • Very few employees who were promoted in the last five years left
  • Very few employees who worked the most hours were promoted
  • All of the employees who left were working the longest hours

plt.figure(figsize=(16, 9))
sns.scatterplot(data=df1, x='average_monthly_hours', y='last_evaluation', hue='left', alpha=0.4)
plt.axvline(x=166.67, color='#ff6361', label='166.67 hrs./mo.', ls='--')
plt.legend(labels=['166.67 hrs./mo.', 'left', 'stayed'])
plt.title('Monthly hours by last evaluation score', fontsize='14');        

  • The scatterplot indicates two groups of employees who left: overworked employees who performed very well and employees who worked slightly under the nominal monthly average of 166.67 hours with lower evaluation scores
  • Most of the employees in this company work well over 167 hours per month

plt.figure(figsize=(16, 9))
heatmap = sns.heatmap(df1.corr(), vmin=-1, vmax=1, annot=True, cmap=sns.color_palette("vlag", as_cmap=True))
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':14}, pad=12);        

  • This correlation heatmap shows that the number of projects, monthly hours, and evaluation scores all have some positive correlation with each other, and whether an employee leaves is negatively correlated with their satisfaction level

My next step was to start building the model. I chose to use a tree-based model because this would allow me to predict whether an employee is likely to leave based on their survey responses and find what variables contribute the most to an employee's desire to leave.

X = df1.copy()
X = pd.get_dummies(X,
                   columns=['salary', 'department'],
                   drop_first=True)
y = X['left']
X = X.drop(['left'], axis=1)

X_tr, X_test, y_tr, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train, X_val, y_train, y_val = train_test_split(X_tr, y_tr, test_size=0.25, random_state=0)
X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape        

I started off with a random forest model:

rf = RandomForestClassifier(random_state=0)

cv_params = {'max_depth': [5, 7, None],
             'max_features': [0.3, 0.6],
             'max_samples': [0.7],
             'min_samples_leaf': [1,2],
             'min_samples_split': [2,3],
             'n_estimators': [75,100,200],
             }

scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}

rf_cv = GridSearchCV(rf, cv_params, scoring=scoring, cv=5, refit='recall')
%%time
rf_cv.fit(X_train, y_train)
rf_cv.best_score_
rf_cv.best_params_        

  • This model gave me a recall score of 0.9148 (pretty good)
  • Note: Recall is a metric that measures how often a machine learning model correctly identifies positive instances (true positives) from all the actual positive samples in the dataset

I then built an XGBoost model so I could compare the recall score with the random forest model:

xgb = XGBClassifier(objective='binary:logistic', random_state=0)

cv_params = {'max_depth': [4,8,12],
             'min_child_weight': [3, 5],
             'learning_rate': [0.01, 0.1],
             'n_estimators': [300, 500]
             }

scoring = {'accuracy', 'precision', 'recall', 'f1'}

xgb_cv = GridSearchCV(xgb, cv_params, scoring=scoring, cv=5, refit='recall')
%%time
xgb_cv.fit(X_train, y_train)
xgb_cv.best_score_
xgb_cv.best_params_        

  • This model gave me a recall score of 0.9164
  • This model had a very slight edge on the random forest model, so I will use this in my predictions.

I then plotted the confusion matrix from the XGBoost model to see how it performed:

cm = confusion_matrix(y_val, y_pred)

disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                             display_labels=None)
disp.plot(values_format='');        

  • This model appears to be the weakest at predicting false negatives (bottom left), predicting 36 people stayed when they actually left
  • This model did have very good accuracy though, predicting the correct outcome 98% of the time

I then moved on to using this XGBoost model on the test set to see how it would perform on data it did not train on:

y_preds = rf_cv.best_estimator_.predict(X_test)
cm = confusion_matrix(y_test, y_preds)

disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                             display_labels=None)
disp.plot(values_format='');        

  • This model had better results than the training data, but was still the weakest at predicting false negatives (bottom left), predicting 27 people stayed when they actually left
  • This model again had great accuracy, with almost 99%

I then looked further into this model to find the most important variables in predicting whether an employee is going to leave or not:

plot_importance(xgb_cv.best_estimator_);        

The top 5 variables in predicting whether or not an employee is going to leave are:

  • Monthly hours
  • Satisfaction level
  • Last evaluation results
  • Tenure
  • Projects

We can use this and look back at our earlier analysis to determine what changes the company can make to decrease the high turnover rate.

Conclusions:

The models and the feature importance extracted from the models confirm that employees at the company are overworked.

To better retain employees, the following recommendations could be presented to the stakeholders:

  • Limit the number of projects that employees can work on
  • Consider promoting employees who have been with the company for longer periods
  • Increase promotions, even if they are small. Promotions are minimal, which leads to lower satisfaction
  • Better reward employees for working longer hours or don't require them to do so
  • Add a cap for monthly hours. Almost all employees with 290+ monthly hours have left
  • Rebuild the evaluation process: High evaluation scores should not be reserved for employees who work 200+ hours per month. A more proportionate scale for rewarding employees is needed

This capstone project has been an enriching experience. It has allowed me to apply and deepen my analytics skills in a real-world context. Thank you for taking the time to read about it, your interest and support are greatly appreciated.

Albert Galick

Founder at Systems Behavioral Research

11 个月

Hey Noah Owsiany! You wanted to explore data science and you didn't contact me?! I googled this exercise and found a link suggesting there were 3008 duplicates in the data. Also, they removed outliers on tenure for the logistic regression calculation. I'd have thrown out the logistic regression instead of the outliers! I got bored browsing this example. This data is probably simulated from an engineered, static, unitary typical employee "die" (and no missing values!). In reality, you could do much more by tracking individuals and considering the unexplored dimensions of individuality and time. Employees leaving is by definition out-of-equilibrium, which makes me doubt the static distribution statistical approach for this case (but my method requires that!) https://github.com/krsign/Advanced-Data-Analytics-Capstone-Projects/blob/main/Activity_%20Course%207%20Salifort%20Motors%20project%20lab.ipynb

回复

要查看或添加评论,请登录

Noah Owsiany的更多文章

社区洞察

其他会员也浏览了