Too Clever by Half: Avoiding Analytics Pitfalls
In my first “real” job after finishing my engineering degree, I worked under an executive who had a reputation for trusting his “gut” in almost any situation. It wasn’t long before I quipped that he was the only person I knew who could draw a trend line with only one data point. Since then, I’ve worked with others suffering from the same affliction, but surely such attitudes are now a thing of the past. After all, this is the era of big data, machine learning, and predictive analytics—we’ve left “gut feel” behind us. Right?
Interest in predictive analytics has grown steadily over the past few years, with a notable spike beginning about two years ago. Much of the excitement is justified, and is built upon a solid foundation of technologies and practices. Consider Netflix. In 2006, the company launched an open competition promising a $1 million prize to any individual or group that could demonstrate a 10% improvement over its own Cinematch movie-recommendation algorithm (it was three years before someone collected). Since then, Netflix and many other companies have continued to invest in machine learning technology to enable predictive analytics. Million dollar prizes and technology investments aren’t based upon whims. There is real value in predictive models.
Lies, Damned Lies, and Statistics
Mark Twain popularized the expression, “There are three kinds of lies: lies, damned lies, and statistics.” He didn’t invent the saying, and in fact the underlying idea is as old as humanity. As humans, we are prone to see patterns in everything, and we filter our observations of the world to fit these patterns. Confirmation bias is a powerful psychological principle, and can apply equally when we are intentionally misleading others, or unconsciously misleading ourselves. This principle applies to an executive drawing a trend line through one point, or to the analysis of massive amounts of information.
Consider constellations. In reality, the stars visible in the night sky are nothing more than a random set of coordinates, each with a different spectral frequency distribution and intensity. And yet, our ancient ancestors applied patterns to this random “dataset” in the form of constellations. An entire mythology was built around these patterns, developed into symbols of the zodiac, and extrapolated into the pseudoscience of astrology. All of this from a completely random set of data, demonstrating that given a sufficiently complex mental model, it’s possible to “explain” any dataset.
As Simple as Possible, but No Simpler
The same principle applies in the world of machine learning, where it’s known as overfitting. Given a sufficiently large number of variables and enough computing power, it’s possible to identify correlations within very large datasets even when there is absolutely no causation. Who would have thought that a support vector machine could exhibit the same flaws as an ancient stargazer?
The solution to overfitting is model simplification. Albert Einstein is reputed to have said, “Everything should be as simple as it can be, but not simpler.” There’s no evidence that he ever stated the idea so succinctly, but he did say, “It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.” Though Einstein may have been a bit wordy, his point is rock-solid. The best possible model for a set of data is one that balances the quality of “fit” with model simplicity.
The trick, of course, is to achieve the right balance. Fortunately, there are techniques to help us manage this problem. The foundation of all learning is an error function that measures the difference between the output of a hypothetical model and the sample dataset used for learning. A more complex hypothesis will typically have less error (although as complexity increases, the rate of improvement declines). Simply optimizing for minimum error across the sample dataset will take us down the road of excess complexity, however. Before you know it, you’re creating constellations where none exist.
Divide and Conquer
The key to achieving model simplicity is to separate data into subsets. One subset can be used to train a hypothesis, while a non-overlapping test data subset is used for validation. A meaningful hypothesis will have similar error levels across the training sample and the test sample. By contrast, an overly complex hypothesis will show impressively low error rates for the training sample, but won’t be able to repeat this performance with the test sample. It’s possible of course that all our hypotheses fail the validation test. If that’s the case, there are two possibilities. It may be that we simply need to apply a different learning algorithm. Instead of a SVM approach, perhaps we need to use a long short-term memory neural network. The other possibility is that, like the night sky, our dataset is completely random and there is nothing to be learned from it. In that case, either there is no underlying causation at all, or we failed to collect the necessary variables to identify it.
Where to From Here?
The lessons we should learn from all of this are quite simple, as they should be. When analyzing data, we need to maintain a healthy skepticism, and remain open to the possibility that the data we have won’t really tell us anything. When we find patterns, we need to rigorously validate them. Finally, we need to accept that even the best models are imperfect. The data we base our models on is usually noisy and incomplete, and this translates into errors for both training and testing. “Good enough” is truly good enough, and it’s important to establish a non-zero level of acceptable error.
There is a great deal of progress being made on several fronts in predictive analytics. One of the most interesting areas is the development of applications and frameworks for unsupervised learning. These tools automate processes for model exploration, sample segregation, learning, and model validation, and can produce impressive results. Some providers have adapted these frameworks to create solutions tailored to specific domains such as marketing analytics or operations optimization, and for different industry verticals.
There is a massive opportunity to be realized from predictive analytics and the machine learning technologies that underpin the field. Just remember that sophisticated technology is not magic, and can be misapplied. If you’re considering implementing predictive analytics, don’t rush into a technology investment. Instead, start by identifying the goal of your analytics initiative, then invest in expert guidance and training so that you can create a plan for a sustainable investment in technology that provides real business value. It’s important for analytics to keep up with changes in the market and business as well. Don’t implement analytics in isolation, but instead make it part of a sustainable business process, with defined processes for governance, including employee training, ongoing technology assessment, and regular review of model relevance and accuracy.