Analyzing Employee Turnover - Predictive Methods
Richard Rosenow
Keeping the People in People Analytics | VP, Strategy at One Model | Speaker, Podcast Guest, Advisor
Why study turnover?
At first glance, ‘intent to leave’ seems like it should be pretty good predictor of turnover. If a coworker told me that they were going to quit, I feel like I’d have a pretty good sense of how likely they were to leave. However, many researchers have developed constructs to measure this intention and the results are surprising.
For example, there was a meta-analytic study (i.e., study of studies) in 2000 by Rodger Griffeth and Peter Hom on turnover that found the construct ‘intent to leave’ had a shared variance with actually leaving of 12% across all studies (explains roughly 12% of why people leave). That’s pretty good for a study on human behavior, but it does leave a reader wondering what is going on. If an employee’s own stated intention to leave the organization is only right 12% of the time, we know we have lot more to understand about why people quit before we can start predicting it.
I covered some of the descriptive methods of analyzing turnover in my last post, but those measures are not enough to learn what else could be causing turnover. There’s a quote attributed to Yogi Berra that goes “It’s difficult to make predictions, especially about the future” and that holds doubly true for predictions about people. However, there are a number of sophisticated methods that have been developed to get us closer to predicting turnover.
Advanced Methods in Turnover Analysis
I call these advanced methods in turnover because of the statistical background needed to apply them. As much as I wish I could, I will not be able to teach you how to perform these methods by the end of this post. My goal here is to make sure these methods are on your radar and to point you to resources where you can learn more about them. I want this to be a starter resource for anyone looking to predict turnover.
In the sections below, I’m going to introduce logistic regression and survival analysis and then speak briefly to tree methods (decision trees and random forests). Lastly, I’ll overview IBM’s Watson as a tool for analyzing turnover and point you in the right direction of some predictive analytics vendors automating these methods for HR applications.
For the non-technical readers these measures start to apply some advanced statistical methods and statistical language. Another goal of mine since writing the HR analytics starter kits (pt 1 and pt 2) has been to keep these posts entry level where I can to ease anyone into understanding these topics and how to learn more.
If you are curious about learning more (or just wish you remembered more) about statistics, there are two highly digestible reference books I'd recommend for refreshing your memory on the topics.
- The Cartoon Guide to Statistics by Larry Gonick - seems gimmicky at first but you'll pick up the equivalent of Stat 101 before you realize you're learning.
- Statistics in Plain English by Timothy Urdan - Title says it all. Also a great reference guide for explaining statistics to a non-technical audience
Logistic Regression
To start with why this matters, logistic regression is the method I've seen used most often in creating predictions for turnover. This technique is one way that we can start to gain insights into questions such as "why are people leaving", "what can we do to affect turnover rates", or "who is most likely to quit in the next year?"
The goal of any type of regression is to predict an outcome using one or more other factors. We want to predict turnover, the dependent variable, using our workplace data which are the independent variables. If you're new to data science and interested in getting a better handle on what regression can accomplish, there's a great non-technical primer on regression from journalistsresource.org - Regression Analysis Primer for Journalists.
I'll freely admit that it took me a while to wrap my head around regression. Adding "logistic" to regression, makes it sounds newly terrifying. Here's the good news: logistic regression, in basic terms, is a form of regression that is used when the outcome you're trying to predict is either a 1 or 0. This is the case for predicting turnover; people either quit or they don't.
I recently came across what I think is one of the best explanations of logistic regression out there. Paul Dalen, consultant at Clarity Solution Group, wrote an article on LinkedIn titled "Who's afraid of Logistic Regression". I used to be afraid, but after reading his post I am fearless. For anyone looking to dip their toe into logistic regression this is a great resource to bookmark.
For an HR example of how to apply logistic regression, there is an excellent walkthrough on using logistic regression to study turnover by Rupesh Khare. Rupesh, currently VP of Analytics at HSBC India, and his team used demographic data to create a risk model for attrition. His PDF download linked here walks through the process step-by-step: Employee Attrition Risk Assessment Using Logistic Regression.
To level-set with the capabilities of logistic regression, this is where I’m sometimes jealous of functions like engineering. In engineering, if the inputs are right and the environment is controlled, then someone can predict failure of a machine or tool part to an incredibly narrow window with high accuracy. With human behavior however, we are not that close. The phenomenon that we study are still too complex and human behavior too varied to give a narrow window.
Another downside to logistic regression is that the output can be a little difficult to interpret. The first output you'll likely receive is an odds-ratio comparing the likelihood of one group of quitting to the likelihood of another group. To convert this odds-ratio into a percentage likelihood of turnover, you'll have to walk through a few more steps. Luckily, since I started putting this post together Paul Dalen has followed up with another post on Logistic Regression walking through these steps: "Logistic Regression for Small Business Decision Makers". While it does involve some math, this is another good resource to bookmark.
Survival Analysis
Survival analysis, also known as event history analysis, is an advanced statistical technique used to estimate the probability of an event occurring over time. This technique has a history in the medical sciences where it was used to predict survival of patients. While it may have a morbid history, it’s a technique that translates well to estimating turnover. If you recall back to the Analyzing Turnover - Descriptive Methods post, this technique is essentially an advanced and statistically sound cohort analysis. Here are some examples of output from a survival analysis.
These slides above are the output of a survival analysis which was clipped from a wonderful slideshare by Tom Briggs, currently a researcher for the US DOD. In his presentation, Tom reviews an application of survival analysis to analyze the difference in turnover for candidates who were given a realistic job preview, and those that were not. As you can see above, those who were given the RJP appear to have a higher cumulative survival rate over time than those who were given the traditional job preview.
This change appears to results in a 15% higher likelihood of being with the company at 12 months out from hire which is a wonderful discovery. I personally think the output of survival analysis is one of the most visually appealing and easily digestible of the different methods. Tom’s slideshare goes through the technique in more detail - Survival Analysis for Predicting Employee Turnover.
As a further example, I’d like to provide you with a link to an overview of how to create and apply survival analysis in R - Employee Attrition Survival Analysis. To make it clear up front, I am not the Richard who made the analysis on this site. The author, Richard Puzon, has put together some of the best examples of applying advanced R techniques to HR issues (more references to him later). However, he unfortunately does not publish his contact information on the site, so I don’t know much more about him as of now. His work however is detailed and highly informative on how to go about applying this technique.
In contrast to the early outputs of logistic regression, survival analysis can be used to produce a likelihood of attrition at a given point in time for a particular employee. The added ability to quickly produce survival charts and attrition likelihood over time makes this a great measure in my book.
Tree Methods - Decision Trees and Random Forest
I debated not including decision trees and random forest in this post. Since I’m still getting up to speed on decision trees and random forest techniques myself I wasn’t sure if I could do it justice. I’ve seen them used and referenced often with regard to predicting turnover and I feel I would be remiss to leave it out. As such, this will be a high-level section with more resources and links to authors who can explain this in much greater detail.
Compared to logistic regression or survival analysis which get down to the individual level right away, a decision tree model starts with all of the employees and then sorts them into smaller and smaller groups based on their likelihood of attrition. This creates a tree-like structure with a central node and many leaves for each path.
For a really beautifully done introduction to decision trees, take a look at r2d3.us for their visual introduction to machine learning. Scrolling through their page gives you an engaging overview of how and why this technique works.
For an HR example, Divyabh Misra, Founder of CrowdAnalytix, has put together a great slideshare of how decision trees can be applied to turnover. In his study, "employee attrition analysis", he reviews how tenure among other factors drives turnover at SanDisk . Image from his slideshare below:
The random forest technique then builds on the decision tree model. At a high level, random forest takes random selections of data from your dataset, and bunches them into their own decision trees. It then takes the average of all of the trees that it creates to make a prediction. The idea here is that many smaller predictions, when taken together, can end up creating a stronger prediction. For a more thorough introduction to Random Forest, there is a great plain english Quora answer - "How does Randomization work in a Random Forest".
Here’s a rough example of how this process could look.
The image above was pulled from an article posted by Dan Kellet on AnalyticBridge.com. The article, "Making Data Science Accessible - Machine Learning - Tree Method" is a fantastic overview of both methods above.
The main reason I wanted to introduce tree methods despite having a loose grip on them is to introduce examples I've found from practitioners. There are two fantastic examples, the first from Lyndon Sundmark and second from Richard Puzon.
- "People Analytics Using R - Employee Churn Example" - Lyndon has a great series of articles applying R to analyze workforce data. He has created a mock dataset and great example of using decision trees and random forest to understand turnover.
- "Employee Attrition, Exploratory Data Analysis" - Richard Puzon put together another great example of how to apply R to analyze turnover using decision trees and random forest. His analysis includes his R code.
Lastly I’d like to make a plug for the R library – Rattle. Rattle is a graphical user interface which will run regression, decision trees, and random forest on your datasets. I know that these techniques can be used to predict turnover and with Rattle I can run them quickly, which officially makes me dangerous and on my way towards useful.
I still have a lot more to learn about the statistics behind decision trees and random forests before I’d feel comfortable analyzing employee data or presenting on them, but as I learn and find additional resources I’ll be sure to pass them on to you as well. I hope you’ll do the same.
Analytics as a Service
IBM Watson is described as “A smart data discovery service available on the cloud, it guides data exploration, automates predictive analytics and enables effortless dashboard and infographic creation.” What IBM and other service providers in this area are doing is creating platforms that automates some or all of the modeling involved with creating a predictive models.
It won’t be as customizable or nuanced as an analysis you prepare yourself, but the speed and accessibility of the service appear to make up for that in full. If you’d like to take a look and try this out, IBM has produced a sample dataset, and a great walkthrough for how to go about using IBM Watson as a modeling tool for understanding turnover.
IBM Watson - "Use Case for HR Retaining Valuable Employees" - Dataset
While I’m mentioning services, I’d like to point you to four others in addition to IBM that stand out in this space. The five companies below have developed software platforms that automate most of the data science process. They utilize the techniques I mentioned throughout this article and build on that with others I haven't been able to cover.
Listed alphabetically.
Going beyond
I hope the above gave you a sampling of techniques you can use to start analyzing and eventually predicting turnover. I also hope some of the examples and external resources have shown you how to move forward with learning more. I want this to be a base resource for anyone looking to go down this path and I'd love to hear about other techniques you use or resources you've found in the comments below. A big thank you to all of the sources and practitioner examples I pulled content from to put this together. I look forward to hearing your thoughts about this topic.
- Richard Rosenow
Other articles I've published on Linkedin:
- Analyzing Employee Turnover - Descriptive Methods
- In Defense of Middle Measures: The Use of Constructs in HR Analytics
- HR Analytics Starter Kit Part 1 - Intro to HR analytics
- HR Analytics Starter Kit - Part 2 - Intro to R programming
- HR Analytics Starter Kit - Part 3 - Podcasts
- HR has Last Mover Advantage in HR Analytics
Educator | Speaker | People Data Enthusiast. I believe in data FOR not just about people.
4 年I have no clue how I never saw this article before (or maybe I did, but am rediscovering it's joy anew now). Great work; I love it!
Head of Enterprise Analytics @ Ambuja Cements Limited
6 年Hi Richard, good article. What are the different ways you suggest to improve turnover models? Adding more variables, creating more sophisticated models, creating focused models for each areas like separate models for top management or for sales employees, etc.
HR - Business Partner
6 年Hi Richard, I am an HR professional. Our datas are mostly in excel sheets. Wanted to know if the function "Forecast" is a good alternative to understand the employee turnover for atleast the next quarter. What are your thoughts ?
Director - Org Effectiveness; Chief of Staff to CHRO
7 年What a great article with links to so many helpful resources. A request/topic idea - prepping data for these analyses. For example, when putting together data for the logistic regression, where do you start/stop the window of time? How do you deal with changing variables like salary/title. How would you calculate years with company (based on end date of your window)? Any resources related to variable engineering that would be helpful?
Excellent article Richard. Thanks for sharing.