People Analytics: Building for Interpretability in Turnover Models
Recently I had the pleasure of working with a talented group of students from the Singapore University of Technology and Design. They were given the task of helping build a very common people analytics application: predicting employee turnover (the merits, relevance and ethics of such an application is open to debate and can be discussed separately).
The Brief: Build a model that can accurately predict an employee's
risk of turnover at a 0-6 month, 6-12 month and >12 month timeframe.
The two non-negotiable requirements were:
- Accuracy: A high True Positive and low False Positive rate. Most practitioners would emphasize a low False Negative but we had our reasons to not.
- Interpretability: In People Analytics, interpretability of a model is key to it's adoption. End users in People Analytics will often want to understand why a model is predicting what it is. In fact, GDPR has new provisions requiring AI decisions to be interpretable.
Now any analytics practitioner might be able to point out very quickly that there is an in-built trade off between these two requirements. Accurate models are rarely interpretable. Interpretable models are rarely accurate. But we wanted to test this assumed dichotomy. Because in People Analytics, it is not enough to be just accurate - it needs to be interpretable by its users.
In addition to our two strict requirements, the team was provided with a strong list of HR metrics, a sufficiently big data set and the necessary infrastructure to evaluate the following algorithms:
As usual, XGBoosting performed the best at predicting turnover (cited as one of the most used algorithms on Kaggle). In fact, its TP and FP rates satisfied our requirements on accuracy. The easy to interpret models such as GLM and Logistic Regressions just did not compare.
However, anyone who has worked with this algorithm before can testify to how hard it is to figure out what's happening in its black box. We could tell stakeholders Bob was at a high risk of turnover - but we could not explain why.
Or could we?
Building interpretability into an algorithm like XGBoost was not straight forward - but it was possible. In addition to providing stakeholders with a name of an employee at risk, we provided them with an interactive playground to modify existing features and re-run the model to point at the features which were causing the model to rate them as at risk. If Bob had a promotion in the last year, would the model come to the same conclusion? Yes it would. If Bob was in a smaller team, would the model come to the same conclusion? Yes it would. What if he was paid higher than he would be at market? No. Voila.
A bit effort intensive as the user would have to play around with multiple iterations to understand each case better - but it allowed us to maintain high accuracy while providing stakeholders the necessary inner workings of the model to be more interpretable.
A few disclaimers:
- This post is meant to address the false dichotomy between interpretability and accuracy - not to encourage the use of individual turnover models. In fact, I would go so far as to say actions such as increments in pay and provisions of promotions should never be based on the risk of turnover. It can potentially be disastrous for a culture of meritocracy. An aggregated analysis of common turnover drivers should be as far as turnover modelling goes.
- There is a lot of debate around the need for interpretability in the first place. Doctor John Elder of Elder Research believes humans are far too given to confirmation bias based on prior experience to be able to objectively interpret a model’s results anyway. The debate is on-going. Read more here.
- The data used in the visuals are completely based on fake data and is only used to illustrate the methodology.
- Opinions are my own.
Are we there yet?
6 年I agree that an increase in accuracy ( AUC curve) does not mean you need to sacrifice interpretability , ( ie. a deep learning model does not neccesarily offer a better AUC compare to XGboost in classification problem , but XGboost can produce a much better interpretability (Feature importance ) than deep learning model . Another way we interpret is using the SHAP "chart ' , it ranked the feature importance (ie compa ratio has higher weightage than length of service ) and each dot represent a single occurrence of the sample data with the colour depicting the value of the sample ( high or low ) , just sharing :)