Predicting Employee Turnover: Data Science Capstone Project
Introduction
This capstone project for my Advanced Data Analytics Certificate showcases all the skills I've learned, applied to a fictional scenario provided by Coursera, aimed at solving real-world business challenges.
In the rapidly evolving business landscape, understanding the factors behind employee turnover is crucial for maintaining a healthy corporate culture and minimizing costs associated with recruitment and training. This article goes into a comprehensive data analysis project aimed at uncovering the reasons behind the high turnover rate at Salifort Motors and offering actionable insights to improve employee retention.
The Scenario
Salifort Motors faces a significant challenge with a high employee turnover rate, prompting concerns from the senior leadership team about the impact on the company's culture and financial health. Tasked with addressing this issue, my objective was to analyze employee survey data to identify underlying factors contributing to turnover and develop a predictive model to forecast employee departures. By identifying key predictors of turnover, Salifort Motors aims to implement targeted interventions to enhance job satisfaction, promote professional development, and ultimately reduce the turnover rate, aligning with the company's commitment to supporting employee success.
Skills I utilized in this project:
Analyzing the Data
I was provided with this list, which shows the variables and descriptions from the survey conducted.
The first step was to load packages for data manipulation, visualization, statistical analysis, modeling, and the dataset.
# For data manipulation
import numpy as np
import pandas as pd
# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# For displaying all of the columns in dataframes
pd.set_option('display.max_columns', None)
# For data modeling
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from xgboost import plot_importance
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# For metrics and helpful functions
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
df = pd.read_csv("HR_capstone_dataset.csv")
Next, I gathered basic information and descriptive statistics in exploratory data analysis (EDA).
# Looking at the data
data.head()
data.describe()
# Displaying all column names
list(df)
# Renaming columns
df = df.rename(columns={'Work_accident': 'work_accident', 'Department': 'department','average_montly_hours': 'average_monthly_hours','time_spend_company': 'tenure','number_project': 'projects','promotion_last_5years': 'promoted_last_five_years'})
# Checking for missing values
df.isna().sum()
# Checking for duplicates
df.duplicated().sum()
df1 = df.drop_duplicates(keep='first')
percentile25 = df1["tenure"].quantile(0.25)
percentile75 = df1["tenure"].quantile(0.75)
iqr = percentile75 - percentile25
lower_limit = percentile25 - 1.5 * iqr
upper_limit = percentile75 + 1.5 * iqr
df1["tenure"].value_counts(normalize=False)
outliers = df1[(df1['tenure'] > upper_limit) | (df1['tenure'] < lower_limit)]
print("Outliers in `tenure`:", len(outliers))
Through my first steps in EDA, I found and corrected misspellings in the column names and simplified others after removing duplicate data. I also found 824 outliers in 'tenure', which should be fine with the model I am using.
I then started data exploration, first by looking at how many employees have left:
df1["left"].value_counts(normalize=False)
df1["left"].value_counts(normalize=True)
I then continued to explore data with many different visualizations, looking for trends or patterns in the data. This was a long process, so I will only display some results.
fig = plt.figure(figsize=(5,3))
sns.barplot(data=df,
x='salary',
y='left',
)
plt.title('Salaries vs percentage left');
fig = plt.figure(figsize=(15,8))
sns.boxplot(data=df1, x='average_monthly_hours', y='projects', hue='left', orient="h")
plt.figure(figsize=(16, 3))
sns.scatterplot(data=df1, x='average_monthly_hours', y='promotion_last_5years', hue='left', alpha=0.4)
plt.axvline(x=166.67, color='#ff6361', ls='--')
plt.legend(labels=['166.67 hrs./mo.', 'left', 'stayed'])
plt.title('Monthly hours by promotion last 5 years', fontsize='14');
plt.figure(figsize=(16, 9))
sns.scatterplot(data=df1, x='average_monthly_hours', y='last_evaluation', hue='left', alpha=0.4)
plt.axvline(x=166.67, color='#ff6361', label='166.67 hrs./mo.', ls='--')
plt.legend(labels=['166.67 hrs./mo.', 'left', 'stayed'])
plt.title('Monthly hours by last evaluation score', fontsize='14');
领英推荐
plt.figure(figsize=(16, 9))
heatmap = sns.heatmap(df1.corr(), vmin=-1, vmax=1, annot=True, cmap=sns.color_palette("vlag", as_cmap=True))
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':14}, pad=12);
My next step was to start building the model. I chose to use a tree-based model because this would allow me to predict whether an employee is likely to leave based on their survey responses and find what variables contribute the most to an employee's desire to leave.
X = df1.copy()
X = pd.get_dummies(X,
columns=['salary', 'department'],
drop_first=True)
y = X['left']
X = X.drop(['left'], axis=1)
X_tr, X_test, y_tr, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train, X_val, y_train, y_val = train_test_split(X_tr, y_tr, test_size=0.25, random_state=0)
X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape
I started off with a random forest model:
rf = RandomForestClassifier(random_state=0)
cv_params = {'max_depth': [5, 7, None],
'max_features': [0.3, 0.6],
'max_samples': [0.7],
'min_samples_leaf': [1,2],
'min_samples_split': [2,3],
'n_estimators': [75,100,200],
}
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}
rf_cv = GridSearchCV(rf, cv_params, scoring=scoring, cv=5, refit='recall')
%%time
rf_cv.fit(X_train, y_train)
rf_cv.best_score_
rf_cv.best_params_
I then built an XGBoost model so I could compare the recall score with the random forest model:
xgb = XGBClassifier(objective='binary:logistic', random_state=0)
cv_params = {'max_depth': [4,8,12],
'min_child_weight': [3, 5],
'learning_rate': [0.01, 0.1],
'n_estimators': [300, 500]
}
scoring = {'accuracy', 'precision', 'recall', 'f1'}
xgb_cv = GridSearchCV(xgb, cv_params, scoring=scoring, cv=5, refit='recall')
%%time
xgb_cv.fit(X_train, y_train)
xgb_cv.best_score_
xgb_cv.best_params_
I then plotted the confusion matrix from the XGBoost model to see how it performed:
cm = confusion_matrix(y_val, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=None)
disp.plot(values_format='');
I then moved on to using this XGBoost model on the test set to see how it would perform on data it did not train on:
y_preds = rf_cv.best_estimator_.predict(X_test)
cm = confusion_matrix(y_test, y_preds)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=None)
disp.plot(values_format='');
I then looked further into this model to find the most important variables in predicting whether an employee is going to leave or not:
plot_importance(xgb_cv.best_estimator_);
The top 5 variables in predicting whether or not an employee is going to leave are:
We can use this and look back at our earlier analysis to determine what changes the company can make to decrease the high turnover rate.
Conclusions:
The models and the feature importance extracted from the models confirm that employees at the company are overworked.
To better retain employees, the following recommendations could be presented to the stakeholders:
This capstone project has been an enriching experience. It has allowed me to apply and deepen my analytics skills in a real-world context. Thank you for taking the time to read about it, your interest and support are greatly appreciated.
Founder at Systems Behavioral Research
11 个月Hey Noah Owsiany! You wanted to explore data science and you didn't contact me?! I googled this exercise and found a link suggesting there were 3008 duplicates in the data. Also, they removed outliers on tenure for the logistic regression calculation. I'd have thrown out the logistic regression instead of the outliers! I got bored browsing this example. This data is probably simulated from an engineered, static, unitary typical employee "die" (and no missing values!). In reality, you could do much more by tracking individuals and considering the unexplored dimensions of individuality and time. Employees leaving is by definition out-of-equilibrium, which makes me doubt the static distribution statistical approach for this case (but my method requires that!) https://github.com/krsign/Advanced-Data-Analytics-Capstone-Projects/blob/main/Activity_%20Course%207%20Salifort%20Motors%20project%20lab.ipynb