Hypothesis Testing With R: Understanding relationships within IMB's HR data

Hypothesis Testing With R: Understanding relationships within IMB's HR data

Overview

In this project, I will be using the statistical programming language R (along with IDE R studio) to perform hypothesis testing from a dataset provided by IBM's HR department.

The dataset can be found here: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset

The purpose of the analysis is to determine what the leading factor causing employees to leave the organisation is (being from redundancies, retirement or so forth).

Creating a correlation matrix

Firstly we need to import the dataset being used, we'll use the read.csv command for this and we assign it to the variable 'dhrdata' using <_ notation.

No alt text provided for this image
Importing data and assigning it to a variable in R

Before we perform any kind of analysis it's always a good idea to take a quick look at the data to get a sense of its structure. We can use the head function to do this, which returns the following table.

No alt text provided for this image
Using the head function in R to view the data structure

From this command we get the shape of the data; 6 rows (because we're using the head command) and 35 columns. We're also given the datatype (e.g. int, chr). The categories we have are: Age, Attrition, BusinessTravel, DailyRate, Department, DistanceFromHome, Education, EducationalField, EmployeeCount, EmployeeNumber, Environment Satisfaction, Gender, HourlyRate, JobInvolvement, JobLevel, JobRole, JobSatisfaction, MaritalStatus, MonthlyIncome, MonthlyRate, NumCompaniesWorked, Over18, OverTime, PercentSalaryHike, PerformanceRating, RelationshipSatisfaction, StandardHours, StockOptionLevel, TotalWorkingYears, TrainingtimesLastYear, WorkLifeBalance, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrentManager.

As we're currently exploring the question "What causes people to leave?" we want to know how all these categories relate to one another. Let's have a look by creating a correlation matrix.

We can create a correlation matrix with this command:

No alt text provided for this image
Creating a correlation matrix


What we're doing in the above command is calling the correlation function (cor()), passing it our dataframe (hrdata) and then selecting which columns we'd like to perform the correlation matrix on. As there are 35 columns I decided not to select them all, additionally as a lot of the datatypes in the dataframe are ordinal (e.g. education, job satisfaction, performance etc) I didn't want to include these int he matrix as it's bad practise.

The output of our correlation matrix is as follows:

No alt text provided for this image
The output of the correlation matrix


What the correlation function has done has calculated the correlation coefficient for all columns crossed with all columns. This forms the basis of the matrix. The correlation coefficient is a value ranging from -1 to 1 where the higher the value (in either direction) the stronger the correlation; for example, a correlation coefficient of -0.89 shows a strong negative correlation whereas a correlation coefficient of 0.14 doesn't really show any correlation between variables. So in the above table let's look for values > 0.4. Apart from the categories that are crossed with themselves (e.g. age and age) where the value will always be 1 (i.e. if you plot the variable then the plot would be y = x) and continues diagonally downward we can see the following correlations stand out: Monthly Income vs Age (0.49), Years At Company & Total Working Years (0.63), Years At Company & Monthly Income (0.51) and Age vs Total Working Years (0.68). Now we might expect that Age and total working years have a strong correlation...the older someone is the more years they have been working. However this is not always the case as a lot of people leave the workforce at some time during their careers most notably to take care of children.

Creating a pair plot

Let's look at some of these relationships visually by creating a pair plot. A pair plot is a series of scatterplots that shows the relationship between two variables over a series of variables.

The command for this in R looks like this. We use the pairs command then set which columns and from which dataframe we'd like to perform the pair plot on.

No alt text provided for this image
Creating a pairplot

Our output looks like this.

No alt text provided for this image
The output of our pairplot command

To interpret this plot we need to understand what each plot represents within the matrix. The lower left plot shows the relationship between years at the company and monthly income. We can see this by looking for the identifier boxes on the top left and lower right respectively. This graph in particular doesn't appear to have a strong correlation. Though the correlation matrix tells us it is 0.51, which is not nothing. It does appear that the longer you're at the company the higher your monthly income however there are a lot of datapoints that say the opposite as there are people with relatively few years at the company in top earning positions. New hires in a specialist field perhaps or new C-suite members?


Creating boxplots

We're almost ready to start our hypothesis testing. But it's usually a good practise to visualise the data beforehand. We're going to be performing a two sample t test. We are looking to find a significant difference between the means of the two samples. But first we'll visualise these with box plots. We'll use the boxplot function and plug in our variables 'Attrition' and 'age'. The question we're asking is have more employees left the business because of a certain characteristic (in this example age). The attrition column we are passing in as a parameter here gives a value of either yes or no, which will form our two sample groups. So a yes in the attrition columns means the employee has left the business and a no means they remained.

No alt text provided for this image
Creating a boxplot in R

The box plot looks like this, we can see our box which its edges represent the interquartile range (the different between the first quartile and third quartile) the median is represented by the thick black line in this box. The whiskers represent the outliers in the data, so by creating this visualisation we can see that the data is not too spread out and both samples have a similar distribution.

No alt text provided for this image
The output of our boxplot in R

By the looks of the plot it looks as though the Yes column had people with on average a lower age than people in the No column. In other words the people who left the company had a lower age (on average). But this is not the best way to answer this question. We need to perform a hypothesis test to determine if this observation is correct and if it is statistically significant.

Performing t-tests

Before we perform a t test we first need to determine what the null and alternate hypothesis are. After performing the t test we can then either accept the null hypothesis or reject it in favour of the alternate hypothesis.

Null Hypothesis: H0 "There is no relationship between the variables"

Alternate Hypothesis: H1 "There is a relationship between the variables"

What we're saying with the first statement (H0) is that the two variables are independent of each other and there is no correlation, for H1 we are saying the opposite is true. Let's now perform the t test and find out.

Mathematically to perform a t test we'd need to do several things to obtain our result. We'd need to calculate both samples means, variances and standard deviations, check that the sample variances are don't differ by a factor greater than 3 and then calculate the common population variance before finally calculating the test statistic. With the test statistic (which follows a t distribution) and compare the test statistic with the critical value (from another table). If the test statistic is greater than the table value this will tell you to reject the null hypothesis in favour of the alternate or if less than to do the opposite.

Fortunately R can calculate this for us with relative ease. The first thing we need to do is assign two new variables yes_age and no_age and we pass in the our dataframe for the former when attrition is equal (==) to "Yes" and for no_age we do the opposite. This creates two groups which differ only by this value. We can then call out t.test function and pass in these two new variables.

No alt text provided for this image
Setting up the conditions for a two sample t test

This returns the following output.

No alt text provided for this image
The output of our t test

The result of our t test gives us the test statistic (t), the degrees of freedom (df) and the p-value. It also tells us the 95% confidence intervals and the means of x and y.

The key thing to look at here is the t statistic, the degrees of freedom and the p value. The procedure that calculates the test statistic compares our data to what is expected under the?null hypothesis. So the greater this value the more the sample data differs from the null hypothesis.

An example of the t distribution is shown below, notice how with each increase in datapoint (v) the peak is higher and the spread narrower. This is one reason why a large number of sample datapoints are required to determine statistical significance. In fact that brings us to the next point in our analysis the degrees of freedom. In a t distributed sample the degrees of freedom are calculated as Na + Nb - 2. This is similar to what was just mentioned, the higher the number of test points in the sample data the greater the degrees of freedom and the more accurate the test.

No alt text provided for this image
The t distribution (source: https://www.jmp.com/en_us/statistics-knowledge-portal/t-test/t-distribution.html)


An example of a t distribution table is shown below. You'll notice that as the number of degrees of freedom increases the value of the t statistic is smaller.


No alt text provided for this image
T table (source: https://www.sjsu.edu/faculty/gerstman/StatPrimer/t-table.pdf)

From our t test we returned a t value of -5.828 and 316.93 degrees of freedom. It's important that we consider the t values absolute value (5.828) for the moment, a negative value indicates a reversal of the effect. Meaning the opposite of our hypothesis. We can see that 5.828 is greater than the t value for samples with degrees of freedom 1000 (as 316.93 df is not shown). We can therefore reject the null hypothesis at a 0.1% significance level (as 5.828 > 3.291). That means that the results of the t test is that the two sample means are different from one another and so age does play a role in employee attrition.

The negative symbol on our t value is dependent on how we set up the t test in R. If we specify the variables in the opposite way the result is positive. We said we wanted to test if the reason people left the company was due to age. More appropriately we asked, as age increases does the probability of leaving the company increase, and the result we saw from the box plot was the opposite. Younger people were more likely to leave and hence our negative t value.

We must also consider the p value however, which is the probability of obtaining results as extreme as the observed results, assuming that the null hypothesis is correct. At a 95% confidence interval we are looking for a p value less than 5% (0.05), if this is obtained then we can say this result is 'statistically significant' at the 95% confidence level (although this issue is hotly debated amongst the statistical community).

P values are calculated?using p-value tables similar to the t table above. P values are calculated from the deviation between the observed value and a chosen reference value, given the probability distribution of the statistic, with a greater difference between the two values corresponding to a lower p-value. So in order to have a statistically significant outcome there needs to be a high enough deviation between the observed value and the value from the p distribution. This is another reason why a large sample size correlates to increased statistical accuracy. If the sample size were too small we couldn't calculate a p value to give us confidence in the observed results.

Our t test gave us the p value of 1.3e-08 (or 0.000000013). When interpreting p values we're looking for a value < 0.05 and as 0.000000013 < 0.05 we can say that the output of our test is statistically significant. We reject the null hypothesis in favour of the alternative hypothesis, there is a relationship between the variables.


Performing T tests with a different factor

Let's now perform another t test to determine if employee number is a significant factor in employee attrition. The logic behind this being, maybe new hires are more likely to leave the company than people that have been there for a longer period of time. This could represent generational differences in attitudes towards work or it could represent differences in management, culture or structure within the organisation that aren't shown in the dataset. Let's first create a boxplot again.

No alt text provided for this image
Creating another boxplot

We get the following boxplot.

No alt text provided for this image
The output of our second boxplot

We can see from the above plot that both samples look very very similar. Let's now perform a t test and analyse the results.

No alt text provided for this image
The output of our second t test

Here we have very different results from our initial t test. We have a very small t value (in general the higher the t value the greater the chance of rejecting the null hypothesis), and we have a much higher p value, its actually higher than our 0.05 'alpha' so in this case we fail to reject the null hypothesis. In other words there is no relationship between these variables (at a 95% CI).

In reality the above example is not best practise for hypothesis testing. Employee number is arbitrary and doesn't really predict anything. If we wanted to know if new employees are more likely to leave than people that have been at the company for a while we'd perform a t test using the YearsAtCompany column. The purpose of the above example was to give a scenario when you would fail to reject the null hypothesis. Interestingly however, when performing a t test using YearsAtCompany we get the following result:

No alt text provided for this image
Our third t test for evaluating the number of years an employee spent at the company

The negative t value indicating that the opposite effect is true, but still being relatively high (in absolute terms). This test essentially shows us the same relationship we saw for our first t test, that people who have been at the company longer are less likely to leave than those who haven't.


Linear Regression in R

From our correlation matrix we saw some relationships emerge between columns. For example between monthly income and age. We can try and predict what someones monthly income is based on their age by using linear regression.

Linear regression is a tool used to predict future outcomes. This is done by drawing a line between the datapoints on a graph, this line is then used to model what future events may be. To make sure the line drawn is not arbitrary the distance between each data point and the line is measured; the distance between the line and the data point is called a residual. All of these residuals are then squared and then summed; hence this is known as the sum of squares. This is done numerous times to ensure accuracy and the line with the lowest sum of squares value is said to have the most accurate trend line or line of best fit.

However, to get a measure of exactly how accurate this line is in predicting what we want to predict we can look to the R-squared value. The R-squared value tells you how much of the variation in the data is as a result of the relationship between the two variables. For example if measuring a groups height and weight and we created a linear model to predict weight based on height and we calculated an R-squared value of 0.7 we would say that 70% of the variation in height is accounted for by the weight. So it's a good measure of accuracy to know how well your model is working. Generally speaking the higher the R-squared value the more accurate your predication can be.

It is calculated by measuring the variation around the mean minus the variation around the line divided by the variation around the mean. Fortunately all of this can be done in R using the lm (linear model) function. Here we are passing in the parameters of MonthlyIncome and Age and specifying which dataframe to model this from. We use the summary function to give us the output of this model.


No alt text provided for this image
Creating a linear regression model in R


No alt text provided for this image
The output of our linear regression model

In the above output we're mainly looking for the R-squared value and the p-value. The former telling us how accurate out model is and the latter if this prediction is significant. The R-squared value is given as 0.2479, let's call this 25%. So our model predicts that the age of a worker is responsible for 25% of the variation in that workers monthly income. The p value here is 0.00000000000000022 which is < 0.05 so we can say that this model is statistically significant at a 95% confidence interval (in other words there is only a 5% probability that the outcome is due to random effects; although we can say that is is actually much lower than even 1%).

We can plot this model also to see what the regression line looks like with respect to the data using the plot function as so.

No alt text provided for this image
Creating a plot with trendline for our linear model

Which returns the following output. We can see there are a lot of datapoints away from the plotted line which explains why our R-squared value was only 25%.

No alt text provided for this image
The output of our linear regression plot


Our model wasn't very good at predicting monthly income. That's partly due to the fact that we only used a single input predictor. We can add more of these to get more accurate results. This is known as multivariate analysis. So instead of predicting monthly income based on age alone, let's try with total years at the company as well.

No alt text provided for this image
Multivariate linear regression

Like before we create a variable in order to call that variable with the summary() function. We use the linear model function and input we want to predict the relationship between MonthlyIncome and Age + TotalWorkingYears, and select our dataframe.

The output is as follows:

No alt text provided for this image
The output of our multivariate analysis

Here we can see that our R-squared value is 0.5988, approximately 60%. That's a much better model for predicting monthly income. We can't plot it unfortunately as it's no longer a two dimensional model however we can increase the number of factors in our formula for more accurate predictions.


Conclusion

Thanks for reading about my project using R to analyse HR data from IBM. In this project I have covered:

  • Creating correlation matrixes with R
  • Creating pair plots with R
  • An overview of hypothesis testing
  • The t statistic and p statistic
  • Creating boxplots with R
  • Two sample T tests with R
  • Linear regression with R
  • Multivariate regression with R

If you'd like to look at some of my other projects they can be found on my portfolio website in the link below.

要查看或添加评论,请登录

James Booth的更多文章

社区洞察

其他会员也浏览了