Analyzing Employee Attrition using R
Jon Ekroth
Data Analyst ?Excel ?Tableau ?SQL ? Data ? Visualization? Problem solving ? Troubleshooting
INTRODUCTION
Human Resources departments have a constant battle trying to retain long-term employees. Longer tenured employees are a great resource to companies as their experience can have a far-reaching influence throughout the business. However, this is not always easy to maintain as employee needs and expectations are always changing. For this project, I'll be a data analyst intern for IBM in the Human Resources department. Recently, there have been many people leaving the company causing a large attrition rate. My boss wants me to explore the reasons why.
THE DATA
This is an augmented data set created by a real IBM data scientist.?The data used in this project can be found here on Kaggle.com. This dataset is comprised of 35 columns and?1,471 rows and consists of the following job role totals:
Healthcare Representative 131
Human Resources 52
Laboratory Technician 259
Manager 102
Manufacturing Director 145
Research Director 80
Research Scientist 292
Sales Executive 326
Sales Representative 83
?
What we will learn:
What are the strongest correlations in this data?
Are longer tenured employees more likely to be laid off?
Understanding of: Statistical significance using Box Plots
Hypothesis Testing using a T.Test
Linear regression using linear models
KEY TAKEAWAYS
The data columns with the highest correlation percentages for leavers are:
Age - Total Working Years
Age - MonthlyIncome
THE ANALYSIS
My boss wants to get an overview of how some of the most important demographics correlate.?A correlation is a relationship between two or more measures. I will first make a correlation matrix using the cor function. The cor function will allow me to perform a quick correlation matrix between all of the columns chosen below:
"Age", "DailyRate", "DistanceFromHome", "Education", "HourlyRate", "MonthlyIncome", "MonthlyRate", "NumCompaniesWorked", "TotalWorkingYears", "TrainingTimesLastYear"
Query
Result
The highest correlations are between the yellow highlighted columns. They are:
Age - Education at .208
Age - MonthlyIncome at .497
Age - NumberCompaniesWorked at .299
Age - Total Working at .68
These results are expected as wages generally go up with age and tenure at a company. Education also seems to correlate with age though at a much lower level. Let's examine this correlated data in a more visual way by using a scatter plot. While the age - education correlation shows clearly in the correlation example above, its scatter plot graphics do not transfer well as can be seen below.
Query
Result
I decided to run another scatter plot matrix by substituting in two different variables, YearsSinceLastPromotion and YearsInCurrentRole, to see if they had any correlation using the same two methods as above.
Years since last promotion has a noticeable correlation factor with years in current role. Again, this is not a surprising relationship. Here is the scatter plot query and matrix. The scatter plot view does not really show the strength of the relationship shown above.
Query
Result
领英推荐
Statistical significance using box plots
Statistical significance indicates whether the observed differences or relationships between variables in a study are unlikely to have occurred by chance alone. It suggests that the results are likely to reflect a true effect in the population examined.
Age is sometimes seen as a determining factor when layoffs occur due to the normally higher wages longer tenured employees have. So, was as age a factor in who was fired and who was not? I decided to build a box plot to compare the age ranges of those employees who lost their jobs. A higher age seems to be slightly more advantageous as viewed by the lines in the middle of the plots shown below. The average age of people who were retained was a few years older than those who were not. Interesting!
Query
Result
Hypothesis Testing using a t.test
Hypothesis testing is a another statistical method used to make decisions or draw conclusions about a population based on a sample of data. It helps us determine whether there is enough evidence to support or reject a specific claim or hypothesis about a population parameter. Hypothesis testing helps us make informed decisions and draw meaningful conclusions based on statistical evidence. It is widely used in research, experiments, quality control, and many other fields to gain insights and validate assumptions about populations. In the query below we are creating variables yes_age for employees who were let go and no_age for the employees who were able to keep their jobs. A t.test function will allow us to get the results.
Query
Result
The t.test mirrors the results of the box plot. Average age values are shown at the bottom of the results as mean of x and mean of y.
P-values or, how much do we trust the results.
A p-value gives us an indication of how likely we would observe more extreme data if the null hypothesis (no difference between groups) was true. It helps us determine the strength of evidence against the null hypothesis and make decisions in hypothesis testing. A smaller p-value suggests stronger evidence against the null hypothesis, while a larger p-value suggests weaker evidence.
Query
Result
The p-value shown below equals 0.6768 percent. A fairly high correlation percentage which would disprove the null hypothesis.
Linear Regression Models
Linear regression is a statistical method used to understand the relationship between two variables: an independent variable (often called the "predictor" or "input") and a dependent variable (often called the "response" or "output"). It aims to find a straight line that best fits the data points in a scatter plot. It provides insights into how changes in the independent variable impact the dependent variable and allows for predictions or forecasts based on those relationships. The formula for linear regression is?y = mx + b. where y is the target variable, x is the input (or predicting) variable, m is the slope, and b is the y-intercept.
Query
Result
The multiple R-squared value above is 0.2479. The p-value is 2.2e16 (e means 10 to the power of); basically 0. Since p is less than 0.05, we can say with 95% confidence this model is statistically significant. I will add another variable, TotalWorkingYears, and see any changes in the results.
Query
Result
The multiple R-squared value above is 0.5988. Again the p-value is 2.2e16, essentially 0. Since p is less than 0.05, we can say with 95% confidence that this model is statistically significant and an even a better predictor of monthly income.
CONCLUSION
One of the more powerful tools you can use for data analysis is the R programing language. R programming was designed primarily for statistical computing, and it excels at handling and analyzing data, making it a preferred choice among experienced statisticians and data scientists. R does have a steeper learning curve compared to other data analysis programs.
In this analysis I used box plots, scatterplot matrixes, t.tests and linear regression models to show correlations between several columns of employee attrition data. The data columns with the highest correlation percentages for leavers are: Age - TotalWorkingYears and Age - MonthlyIncome. These results are not too surprising as the number of working years and income come from longer tenured employees. I also showed how age is not necessarily a determining factor in who leaves the company. Using statistical significance, hypothesis testing and linear regression you can really narrow down results and provide the information that you need.
Thank you for taking the time out of your busy day to read my article. Feel free to reach out with any questions or comments. If you, or someone you know, has a vacancy for an entry level data analyst please let me know!
Pic 1 Credit:?Getty Images/iStockphoto
Pic 2 https://images.app.goo.gl/6uS128Gyi5WRct2r8
Pic 3 https://images.app.goo.gl/nPmPRmL1gLer7m3bA
Pic 4 https://www.ablebits.com/office-addins-blog/linear-regression-analysis-excel/
I help Data professionals to enhance their Skills | Team Lead : Data Analyst WFM & In-house Excel Trainer at TOM TOM | MCT-Microsoft Certified Trainer | Forecasting & Planning | Excel | VBA | Python| Power BI
1 年Jon Ekroth Great work Jon ??
Data Analyst | SQL | Tableau | Excel | R | Data Visualization
1 年It's a great awesome project article Jon Ekroth. Your in-depth analytical skills and research are really inspiring. Wishing you best for your career.
Fraud Prevention Analyst @ M&G PLC | Data Analyst | Data Scientist | Python | SQL | Machine Learning | Data Analytics | Excel | Tableau | Power BI | R
1 年Nice work Jon ??????
Technical Business Analyst | Data Nerd | (SQL : Python : Tableau : PowerBI)
1 年It's a heavy project, Awesome Job! ??