5 Things to Consider in Developing a Useful and Impactful Predictive Analytics Model
Five Things is a series of thoughts on the arts and science of finance, analytics, and corporate development, with occasional forays into leadership, communication, and other topics for the well-rounded professional
The focus of this article is not to explore the latest statistical technique or "big data" concepts. Rather, it examines a pretty generic, vanilla application of a tried-and-true statistical method and how to make the results useful and impactful.
A few years ago I took on a consulting project for a client who wanted to predict the number of hours required to serve a given customer in a given year. This was important because the customers were not charged on an hourly basis, but on a negotiated fee for a bundle of different services. So predicting the drivers of the time required would really help to drive pricing decisions and understand customer-level profitability.
This was a fun and challenging project, and one that really highlighted the key elements of really useful predictive analytics that can be applied to an important business problem.
So, here are the 5 things gleaned from this project and my experience that can help you make a predictive modeling exercise useful and impactful:
1. Building a predictive model is an exercise in using data and math: but mostly it's about the data
The foundation of success on a project like this is having good, reliable data. The greatest, most mathematically and statistically sound analysis will be useless if applied to bad data. You need to:
- Understand where your data is coming from, including the who, what, why, and when. In this case, the data was gathered by individual time tracking over the past 18 month.
- Ask - is it enough? In this case, we measured the time to serve 2,000 customers for 18 months - not bad.
- Ask - is it good? Do some initial exploration. Look for outliers. In this case, there were some clients where the cost to serve was close to 0 and some where the cost to serve was very high. This led to some even better discussions about how the data was gathered and how and led us to (carefully) remove or replace "bad" data.
2. The road to predictive analysis starts with descriptive and diagnostic analysis
Once your data is ready, jumping to the modeling phase is premature. Thinking about Gartner's analytics maturity model. . .
. . .we do not want to jump in to predictive analytics just yet. There's too much to be gained in exploring the descriptive and diagnostic analytics first, looking at things like:
- Simple statistics for each potential variable (mean, standard deviation, percentiles, histograms)
- Looking at scatterplots (in this case, does it look like there is a relation to the cost?)
- Delving a bit into diagnostics, explore the correlation matrix? ( Which of the potential descriptive variables are correlated with cost? Which of the potential descriptive variables are correlated with each other?
I found a lot of value in reviewing with my client the "fact pack" of these summary statistics for each variable in that it 1) provided another level of scrutiny in terms of data quality 2) gave them some very interesting and useful information they never never seen and 3) it was a great "warm up" for what was to come.
3. Balance building the "best" model with building the "most useful" model
Now that we have data than we understand well and have some initial insights, we are ready to build our model, considering:
- What kind of statistical technique? In this case, we used good old multiple linear regression. Essentially we are looking for a formula (where Y is "cost to serve" and x's are the most important descriptive varibles):
- Which variables are important enough to use in our model (the x's)? There are some techniques that make this more of a science (e.g. stepwise linear regression) and then there is the trial-and-error approach. The best method, in my opinion, is to use what we learned in our data exploration to understand the most likely candidates for "the x's" and than to do a bit of trial and error on likely models until you get the right fit. You can also investigate, where it seem to make sense, non-linear models (e.g. exponential or logarithmic lines) and interactions between variables.
- How many descriptive variables to use? In this case, limited the number of x variables in our final model was quite important - to ensure effective communication, understanding, and easier implementation. Is a model with 40 variable better than a model with 10 variables if it has a slightly better model fit statistics? The key is to get to "just right" in terms of the best descriptive variables
4. Don't trust your model - test it!
Good in-sample and out-of-sample test results are really important in communicating that your model works well in real life. In this case, I ignored, or "held out", at random, 25% of the data for the last calendar year and all of the data for the 6 months of the current year. Since this data wasn't used in building the model in step 3, I could use this data to validate that my model holds up (and, yes, it pretty much did):
5. The real main event: communicating and implementing
My clients were not data scientists or statisticians, but they needed to be comfortable in using this model to make impactful business decisions. So probably one of my most critical tasks was to communicate and gain buy in for the methodology and results of this model. Specifically, I needed to address questions like:
- What process did you follow? (The five step process of data gathering, data cleaning, data exploration, modeling building, and model validation)
- What methodology/technique did you choose and why? (I chose an appropriate model, given the structure of the problem, that is regularly taught in intermediate/MBA-level statistics courses)
- How accurate is this model going to be at estimating the cost of services? (I was able to compare my model vs. the current estimation process and show that it was far better)
- What variables factor into the estimation? Why did you choose variable x? Why didn't you choose variable y? (I had to spend a lot of time discussing how variable were chosen, how many of them are correlated with one another, and how fewer variables can be better)
In Conclusion
I feel good that this outcome was both useful and impactful to my client. The key was not having the most sophisticated modeling technique, using the newest "big data" techniques, or leveraging enormous computing power, but rather it succeeded from following a structured process, getting regular client engagement and feedback, and a executing a strong communication plan.
Until next time. . .
[Insert closing catch phrase here] Work hard, work smart, and keep in touch!
With 20+ years of experience helping firms make strategically and financially sound decisions that drive profitable growth, Joe Krekelberg has held finance, corporate development, and actuarial leadership positions in multiple industries. He is located in the greater Minneapolis-St. Paul area and can be reached at [email protected].