Analyzing Employee Attrition using R
Pic 1

Analyzing Employee Attrition using R

INTRODUCTION

Human Resources departments have a constant battle trying to retain long-term employees. Longer tenured employees are a great resource to companies as their experience can have a far-reaching influence throughout the business. However, this is not always easy to maintain as employee needs and expectations are always changing. For this project, I'll be a data analyst intern for IBM in the Human Resources department. Recently, there have been many people leaving the company causing a large attrition rate. My boss wants me to explore the reasons why.


THE DATA

This is an augmented data set created by a real IBM data scientist.?The data used in this project can be found here on Kaggle.com. This dataset is comprised of 35 columns and?1,471 rows and consists of the following job role totals:

  Healthcare Representative 131
  Human Resources            52
  Laboratory Technician     259
  Manager                   102
  Manufacturing Director    145
  Research Director          80
  Research Scientist        292
  Sales Executive           326
  Sales Representative       83        

?

What we will learn:

What are the strongest correlations in this data?

Are longer tenured employees more likely to be laid off?

Understanding of: Statistical significance using Box Plots

Hypothesis Testing using a T.Test

Linear regression using linear models

KEY TAKEAWAYS

The data columns with the highest correlation percentages for leavers are:

Age - Total Working Years

Age - MonthlyIncome


No alt text provided for this image
Pic 2


THE ANALYSIS

My boss wants to get an overview of how some of the most important demographics correlate.?A correlation is a relationship between two or more measures. I will first make a correlation matrix using the cor function. The cor function will allow me to perform a quick correlation matrix between all of the columns chosen below:

"Age", "DailyRate", "DistanceFromHome", "Education", "HourlyRate", "MonthlyIncome", "MonthlyRate", "NumCompaniesWorked", "TotalWorkingYears", "TrainingTimesLastYear"


Query

No alt text provided for this image

Result

No alt text provided for this image

The highest correlations are between the yellow highlighted columns. They are:

Age - Education at .208

Age - MonthlyIncome at .497

Age - NumberCompaniesWorked at .299

Age - Total Working at .68

These results are expected as wages generally go up with age and tenure at a company. Education also seems to correlate with age though at a much lower level. Let's examine this correlated data in a more visual way by using a scatter plot. While the age - education correlation shows clearly in the correlation example above, its scatter plot graphics do not transfer well as can be seen below.


Query

No alt text provided for this image

Result

No alt text provided for this image

I decided to run another scatter plot matrix by substituting in two different variables, YearsSinceLastPromotion and YearsInCurrentRole, to see if they had any correlation using the same two methods as above.

No alt text provided for this image

Years since last promotion has a noticeable correlation factor with years in current role. Again, this is not a surprising relationship. Here is the scatter plot query and matrix. The scatter plot view does not really show the strength of the relationship shown above.

Query

No alt text provided for this image

Result

No alt text provided for this image

Statistical significance using box plots

Statistical significance indicates whether the observed differences or relationships between variables in a study are unlikely to have occurred by chance alone. It suggests that the results are likely to reflect a true effect in the population examined.

Age is sometimes seen as a determining factor when layoffs occur due to the normally higher wages longer tenured employees have. So, was as age a factor in who was fired and who was not? I decided to build a box plot to compare the age ranges of those employees who lost their jobs. A higher age seems to be slightly more advantageous as viewed by the lines in the middle of the plots shown below. The average age of people who were retained was a few years older than those who were not. Interesting!

Query

No alt text provided for this image

Result

No alt text provided for this image

Hypothesis Testing using a t.test

Hypothesis testing is a another statistical method used to make decisions or draw conclusions about a population based on a sample of data. It helps us determine whether there is enough evidence to support or reject a specific claim or hypothesis about a population parameter. Hypothesis testing helps us make informed decisions and draw meaningful conclusions based on statistical evidence. It is widely used in research, experiments, quality control, and many other fields to gain insights and validate assumptions about populations. In the query below we are creating variables yes_age for employees who were let go and no_age for the employees who were able to keep their jobs. A t.test function will allow us to get the results.

Query

No alt text provided for this image

Result

The t.test mirrors the results of the box plot. Average age values are shown at the bottom of the results as mean of x and mean of y.

No alt text provided for this image


P-values or, how much do we trust the results.

A p-value gives us an indication of how likely we would observe more extreme data if the null hypothesis (no difference between groups) was true. It helps us determine the strength of evidence against the null hypothesis and make decisions in hypothesis testing. A smaller p-value suggests stronger evidence against the null hypothesis, while a larger p-value suggests weaker evidence.


Query

No alt text provided for this image

Result

The p-value shown below equals 0.6768 percent. A fairly high correlation percentage which would disprove the null hypothesis.

No alt text provided for this image

Linear Regression Models

No alt text provided for this image
pic 4, linear regression example


Linear regression is a statistical method used to understand the relationship between two variables: an independent variable (often called the "predictor" or "input") and a dependent variable (often called the "response" or "output"). It aims to find a straight line that best fits the data points in a scatter plot. It provides insights into how changes in the independent variable impact the dependent variable and allows for predictions or forecasts based on those relationships. The formula for linear regression is?y = mx + b. where y is the target variable, x is the input (or predicting) variable, m is the slope, and b is the y-intercept.


Query

No alt text provided for this image

Result

No alt text provided for this image

The multiple R-squared value above is 0.2479. The p-value is 2.2e16 (e means 10 to the power of); basically 0. Since p is less than 0.05, we can say with 95% confidence this model is statistically significant. I will add another variable, TotalWorkingYears, and see any changes in the results.

Query

No alt text provided for this image

Result

No alt text provided for this image

The multiple R-squared value above is 0.5988. Again the p-value is 2.2e16, essentially 0. Since p is less than 0.05, we can say with 95% confidence that this model is statistically significant and an even a better predictor of monthly income.


CONCLUSION

One of the more powerful tools you can use for data analysis is the R programing language. R programming was designed primarily for statistical computing, and it excels at handling and analyzing data, making it a preferred choice among experienced statisticians and data scientists. R does have a steeper learning curve compared to other data analysis programs.

In this analysis I used box plots, scatterplot matrixes, t.tests and linear regression models to show correlations between several columns of employee attrition data. The data columns with the highest correlation percentages for leavers are: Age - TotalWorkingYears and Age - MonthlyIncome. These results are not too surprising as the number of working years and income come from longer tenured employees. I also showed how age is not necessarily a determining factor in who leaves the company. Using statistical significance, hypothesis testing and linear regression you can really narrow down results and provide the information that you need.

Thank you for taking the time out of your busy day to read my article. Feel free to reach out with any questions or comments. If you, or someone you know, has a vacancy for an entry level data analyst please let me know!



Pic 1 Credit:?Getty Images/iStockphoto

Pic 2 https://images.app.goo.gl/6uS128Gyi5WRct2r8

Pic 3 https://images.app.goo.gl/nPmPRmL1gLer7m3bA

Pic 4 https://www.ablebits.com/office-addins-blog/linear-regression-analysis-excel/

Mohammad Shabbir Taibani

I help Data professionals to enhance their Skills | Team Lead : Data Analyst WFM & In-house Excel Trainer at TOM TOM | MCT-Microsoft Certified Trainer | Forecasting & Planning | Excel | VBA | Python| Power BI

1 年

Jon Ekroth Great work Jon ??

Madeeha Umar

Data Analyst | SQL | Tableau | Excel | R | Data Visualization

1 年

It's a great awesome project article Jon Ekroth. Your in-depth analytical skills and research are really inspiring. Wishing you best for your career.

Stuart Walker

Fraud Prevention Analyst @ M&G PLC | Data Analyst | Data Scientist | Python | SQL | Machine Learning | Data Analytics | Excel | Tableau | Power BI | R

1 年

Nice work Jon ??????

Trevor Maxwell

Technical Business Analyst | Data Nerd | (SQL : Python : Tableau : PowerBI)

1 年

It's a heavy project, Awesome Job! ??

要查看或添加评论,请登录

Jon Ekroth的更多文章

  • ??Path to the NBA Playoffs??

    ??Path to the NBA Playoffs??

    INTRODUCTION As I watched the Boston Celtics compete in the NBA Finals this week, I wondered what stats had the most…

    8 条评论
  • ?? March Madness!! ??

    ?? March Madness!! ??

    March is one of my favorite sports times of the year because it brings the Men's Final Four Basketball Championship or…

    2 条评论
  • Looking at Excel analysis in a whole new way.

    Looking at Excel analysis in a whole new way.

    ??? I have spent many hours using SQL, Tableau and R lately to work with data analysis. ??? Recently, I have gone back…

    8 条评论
  • The World Bank Analysis Using SQL

    The World Bank Analysis Using SQL

    Data analysis of The Wold Banks IDA statement of credits and grants. Background For this report I was “hired” by The…

    8 条评论
  • Book Recommendation

    Book Recommendation

    I recently finished reading storytelling with data by Cole Nussbaumer Knaflic. I really enjoyed learning about how to…

    5 条评论
  • Processing Plant Data with Python.

    Processing Plant Data with Python.

    In this project I have been recently “hired” as a data analyst for a manufacturing / engineering / science company…

  • Your First Job

    Your First Job

    What did you learn about yourself from the experiences of your first job? Looking back, I can see where I developed…

    3 条评论
  • Interview: Data Analyst Report Utah Jazz

    Interview: Data Analyst Report Utah Jazz

    In this project I am “interviewing” with the Utah Jazz for a Data Analyst role. I will be using Tableau public to…

    16 条评论
  • Diabetes Patient Analysis

    Diabetes Patient Analysis

    INTRODUCTION For this SQL project I have been just been “hired” as a health care data analyst and management needs some…

  • Getting to know You.

    Getting to know You.

    I’ve always had a difficult time figuring out what I should be doing for a living. Trial and error is one way to find a…

    2 条评论

社区洞察

其他会员也浏览了