登录查看更多内容

Predicting Employee Turnover: Data Science Capstone Project

Noah Owsiany

HIVE | OWSH Unlimited | WNY Prosperity Fellow

发布日期: 2024年3月24日

Introduction

This capstone project for my Advanced Data Analytics Certificate showcases all the skills I've learned, applied to a fictional scenario provided by Coursera, aimed at solving real-world business challenges.

In the rapidly evolving business landscape, understanding the factors behind employee turnover is crucial for maintaining a healthy corporate culture and minimizing costs associated with recruitment and training. This article goes into a comprehensive data analysis project aimed at uncovering the reasons behind the high turnover rate at Salifort Motors and offering actionable insights to improve employee retention.

The Scenario

Salifort Motors faces a significant challenge with a high employee turnover rate, prompting concerns from the senior leadership team about the impact on the company's culture and financial health. Tasked with addressing this issue, my objective was to analyze employee survey data to identify underlying factors contributing to turnover and develop a predictive model to forecast employee departures. By identifying key predictors of turnover, Salifort Motors aims to implement targeted interventions to enhance job satisfaction, promote professional development, and ultimately reduce the turnover rate, aligning with the company's commitment to supporting employee success.

Skills I utilized in this project:

Python Programming
Exploratory Data Analysis (EDA)
Statistical Modeling
Machine Learning Modeling
Data Presentation & Reporting

Analyzing the Data

I was provided with this list, which shows the variables and descriptions from the survey conducted.

The first step was to load packages for data manipulation, visualization, statistical analysis, modeling, and the dataset.

# For data manipulation
import numpy as np
import pandas as pd

# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For displaying all of the columns in dataframes
pd.set_option('display.max_columns', None)

# For data modeling
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from xgboost import plot_importance

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# For metrics and helpful functions
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay, classification_report

df = pd.read_csv("HR_capstone_dataset.csv")

Next, I gathered basic information and descriptive statistics in exploratory data analysis (EDA).

# Looking at the data
data.head()
data.describe()
# Displaying all column names
list(df)
# Renaming columns
df = df.rename(columns={'Work_accident': 'work_accident', 'Department': 'department','average_montly_hours': 'average_monthly_hours','time_spend_company': 'tenure','number_project': 'projects','promotion_last_5years': 'promoted_last_five_years'})
# Checking for missing values
df.isna().sum()
# Checking for duplicates
df.duplicated().sum()
df1 = df.drop_duplicates(keep='first')

percentile25 = df1["tenure"].quantile(0.25)
percentile75 = df1["tenure"].quantile(0.75)
iqr = percentile75 - percentile25
lower_limit = percentile25 - 1.5 * iqr
upper_limit = percentile75 + 1.5 * iqr
df1["tenure"].value_counts(normalize=False)
outliers = df1[(df1['tenure'] > upper_limit) | (df1['tenure'] < lower_limit)]
print("Outliers in `tenure`:", len(outliers))

Through my first steps in EDA, I found and corrected misspellings in the column names and simplified others after removing duplicate data. I also found 824 outliers in 'tenure', which should be fine with the model I am using.

I then started data exploration, first by looking at how many employees have left:

df1["left"].value_counts(normalize=False)
df1["left"].value_counts(normalize=True)

10,000 employees have left (83.4%)
1991 employees stayed (16.6%)

I then continued to explore data with many different visualizations, looking for trends or patterns in the data. This was a long process, so I will only display some results.

fig = plt.figure(figsize=(5,3))
sns.barplot(data=df,
            x='salary',
            y='left',
            )
plt.title('Salaries vs percentage left');

This graph shows a clear correlation between salaries and employees leaving, as more employees with lower salaries are leaving than those with high

fig = plt.figure(figsize=(15,8))
sns.boxplot(data=df1, x='average_monthly_hours', y='projects', hue='left', orient="h")

This stacked barplot shows an interesting relationship between the number of projects, average monthly hours, and employees leaving
As you can see with the orange plots on the bottom right of the image, there are many people leaving who are possibly being overworked, with many projects and many hours

plt.figure(figsize=(16, 3))
sns.scatterplot(data=df1, x='average_monthly_hours', y='promotion_last_5years', hue='left', alpha=0.4)
plt.axvline(x=166.67, color='#ff6361', ls='--')
plt.legend(labels=['166.67 hrs./mo.', 'left', 'stayed'])
plt.title('Monthly hours by promotion last 5 years', fontsize='14');

Very few employees who were promoted in the last five years left
Very few employees who worked the most hours were promoted
All of the employees who left were working the longest hours

plt.figure(figsize=(16, 9))
sns.scatterplot(data=df1, x='average_monthly_hours', y='last_evaluation', hue='left', alpha=0.4)
plt.axvline(x=166.67, color='#ff6361', label='166.67 hrs./mo.', ls='--')
plt.legend(labels=['166.67 hrs./mo.', 'left', 'stayed'])
plt.title('Monthly hours by last evaluation score', fontsize='14');

The scatterplot indicates two groups of employees who left: overworked employees who performed very well and employees who worked slightly under the nominal monthly average of 166.67 hours with lower evaluation scores
Most of the employees in this company work well over 167 hours per month

领英推荐

A Beginners Guide to Predictive Analytics: Turning…

Data Science Council of America 1 个月前

EDA and EDA Life Cycle

Gamaka AI 1 年前

How Data Analysis Empowers Informed Decisions

Transights For Training and Consultancy 9 个月前

plt.figure(figsize=(16, 9))
heatmap = sns.heatmap(df1.corr(), vmin=-1, vmax=1, annot=True, cmap=sns.color_palette("vlag", as_cmap=True))
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':14}, pad=12);

This correlation heatmap shows that the number of projects, monthly hours, and evaluation scores all have some positive correlation with each other, and whether an employee leaves is negatively correlated with their satisfaction level

My next step was to start building the model. I chose to use a tree-based model because this would allow me to predict whether an employee is likely to leave based on their survey responses and find what variables contribute the most to an employee's desire to leave.

X = df1.copy()
X = pd.get_dummies(X,
                   columns=['salary', 'department'],
                   drop_first=True)
y = X['left']
X = X.drop(['left'], axis=1)

X_tr, X_test, y_tr, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train, X_val, y_train, y_val = train_test_split(X_tr, y_tr, test_size=0.25, random_state=0)
X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape

I started off with a random forest model:

rf = RandomForestClassifier(random_state=0)

cv_params = {'max_depth': [5, 7, None],
             'max_features': [0.3, 0.6],
             'max_samples': [0.7],
             'min_samples_leaf': [1,2],
             'min_samples_split': [2,3],
             'n_estimators': [75,100,200],
             }

scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}

rf_cv = GridSearchCV(rf, cv_params, scoring=scoring, cv=5, refit='recall')
%%time
rf_cv.fit(X_train, y_train)
rf_cv.best_score_
rf_cv.best_params_

This model gave me a recall score of 0.9148 (pretty good)
Note: Recall is a metric that measures how often a machine learning model correctly identifies positive instances (true positives) from all the actual positive samples in the dataset

I then built an XGBoost model so I could compare the recall score with the random forest model:

xgb = XGBClassifier(objective='binary:logistic', random_state=0)

cv_params = {'max_depth': [4,8,12],
             'min_child_weight': [3, 5],
             'learning_rate': [0.01, 0.1],
             'n_estimators': [300, 500]
             }

scoring = {'accuracy', 'precision', 'recall', 'f1'}

xgb_cv = GridSearchCV(xgb, cv_params, scoring=scoring, cv=5, refit='recall')
%%time
xgb_cv.fit(X_train, y_train)
xgb_cv.best_score_
xgb_cv.best_params_

This model gave me a recall score of 0.9164
This model had a very slight edge on the random forest model, so I will use this in my predictions.

I then plotted the confusion matrix from the XGBoost model to see how it performed:

cm = confusion_matrix(y_val, y_pred)

disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                             display_labels=None)
disp.plot(values_format='');

This model appears to be the weakest at predicting false negatives (bottom left), predicting 36 people stayed when they actually left
This model did have very good accuracy though, predicting the correct outcome 98% of the time

I then moved on to using this XGBoost model on the test set to see how it would perform on data it did not train on:

y_preds = rf_cv.best_estimator_.predict(X_test)
cm = confusion_matrix(y_test, y_preds)

disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                             display_labels=None)
disp.plot(values_format='');

This model had better results than the training data, but was still the weakest at predicting false negatives (bottom left), predicting 27 people stayed when they actually left
This model again had great accuracy, with almost 99%

I then looked further into this model to find the most important variables in predicting whether an employee is going to leave or not:

plot_importance(xgb_cv.best_estimator_);

The top 5 variables in predicting whether or not an employee is going to leave are:

Monthly hours
Satisfaction level
Last evaluation results
Tenure
Projects

We can use this and look back at our earlier analysis to determine what changes the company can make to decrease the high turnover rate.

Conclusions:

The models and the feature importance extracted from the models confirm that employees at the company are overworked.

To better retain employees, the following recommendations could be presented to the stakeholders:

Limit the number of projects that employees can work on
Consider promoting employees who have been with the company for longer periods
Increase promotions, even if they are small. Promotions are minimal, which leads to lower satisfaction
Better reward employees for working longer hours or don't require them to do so
Add a cap for monthly hours. Almost all employees with 290+ monthly hours have left
Rebuild the evaluation process: High evaluation scores should not be reserved for employees who work 200+ hours per month. A more proportionate scale for rewarding employees is needed

This capstone project has been an enriching experience. It has allowed me to apply and deepen my analytics skills in a real-world context. Thank you for taking the time to read about it, your interest and support are greatly appreciated.

Albert Galick

Founder at Systems Behavioral Research

11 个月

Hey Noah Owsiany! You wanted to explore data science and you didn't contact me?! I googled this exercise and found a link suggesting there were 3008 duplicates in the data. Also, they removed outliers on tenure for the logistic regression calculation. I'd have thrown out the logistic regression instead of the outliers! I got bored browsing this example. This data is probably simulated from an engineered, static, unitary typical employee "die" (and no missing values!). In reality, you could do much more by tracking individuals and considering the unexplored dimensions of individuality and time. Employees leaving is by definition out-of-equilibrium, which makes me doubt the static distribution statistical approach for this case (but my method requires that!) https://github.com/krsign/Advanced-Data-Analytics-Capstone-Projects/blob/main/Activity_%20Course%207%20Salifort%20Motors%20project%20lab.ipynb

要查看或添加评论，请登录

Noah Owsiany的更多文章

How I'm Graduating a Year Early with Free Credits

2024年5月7日

How I'm Graduating a Year Early with Free Credits

As a second-year student at the University of Buffalo, I've found an undiscovered method that will allow me to graduate…

1 条评论
Machine Learning: Random Forest and XGBoost Modeling - Data Analytics Project

2024年3月20日

Machine Learning: Random Forest and XGBoost Modeling - Data Analytics Project

Intro to the project: As I reached the final steps of my Google Advanced Data Analytics certification, I made it to my…
Regression Analysis: Simplify Complex Data Relationships - Data Analytics Project

2024年3月10日

Regression Analysis: Simplify Complex Data Relationships - Data Analytics Project

Intro to the project: As I work through my Google Advanced Data Analytics certification, I've made it to my next…

2 条评论
The Power of Statistics - Hypothesis Testing: Advanced Data Analytics Project

2024年3月2日

The Power of Statistics - Hypothesis Testing: Advanced Data Analytics Project

Intro to the project: As I work through my Google Advanced Data Analytics certification, I've made it to my next…
Go Beyond the Numbers - Translate Data into Insights: Advanced Data Analytics Project

2024年2月22日

Go Beyond the Numbers - Translate Data into Insights: Advanced Data Analytics Project

Intro to the project: As I progress through my Google Advanced Data Analytics certification, my journey has led me to a…
Unlocking Insights with Python: Advanced Data Analytics Project

2024年2月12日

Unlocking Insights with Python: Advanced Data Analytics Project

Intro to the project: As I advance through my Google Advanced Data Analytics certification, I've started on an…

2 条评论
PACE Framework & Project Proposal: Foundations of Data Science Project

2024年2月5日

PACE Framework & Project Proposal: Foundations of Data Science Project

The Transition to PACE Previously, my approach to data science projects was guided by a familiar sequence: Ask…
Case Study: How Does a Bike-Share Navigate Speedy Success?

2024年1月20日

Case Study: How Does a Bike-Share Navigate Speedy Success?

Hello everyone! Over the last month, I have worked through the Google Data Analytics Professional Certificate. My final…

8 条评论

See all articles

Predicting Employee Turnover: Data Science Capstone Project

Noah Owsiany

HIVE | OWSH Unlimited | WNY Prosperity Fellow

Introduction

The Scenario

Skills I utilized in this project:

Analyzing the Data

领英推荐

Conclusions:

Noah Owsiany的更多文章

社区洞察

其他会员也浏览了

Data Cleaning and Preparation Techniques

Top 10 Tips to unlock hidden value using deep data analytics

PREDICTIVE ANALYTICS

What is Data analysis? Where it is used?

Data Science Product Building

Impact of Analytics in Daily Life

Tata Group Launches Free Data Visualisation Job Simulation Program for Aspiring Analysts

What is the difference between data processing and data analytics?

A Guide to Data Science for an Organizational evolution

Day 21 New Day New Learning

Introduction

The Scenario

Skills I utilized in this project:

Analyzing the Data

领英推荐

Conclusions:

Noah Owsiany的更多文章

How I'm Graduating a Year Early with Free Credits

Machine Learning: Random Forest and XGBoost Modeling - Data Analytics Project

Regression Analysis: Simplify Complex Data Relationships - Data Analytics Project

The Power of Statistics - Hypothesis Testing: Advanced Data Analytics Project

Go Beyond the Numbers - Translate Data into Insights: Advanced Data Analytics Project

Unlocking Insights with Python: Advanced Data Analytics Project

PACE Framework & Project Proposal: Foundations of Data Science Project

Case Study: How Does a Bike-Share Navigate Speedy Success?

社区洞察

其他会员也浏览了

Data Cleaning and Preparation Techniques

Top 10 Tips to unlock hidden value using deep data analytics

PREDICTIVE ANALYTICS

What is Data analysis? Where it is used?

Data Science Product Building

Impact of Analytics in Daily Life

Tata Group Launches Free Data Visualisation Job Simulation Program for Aspiring Analysts

What is the difference between data processing and data analytics?

A Guide to Data Science for an Organizational evolution

Day 21 New Day New Learning