Look Inside Your Box
I recently had the privilege of speaking with two data scientists from a very, very large corporation. Some confidences must be kept, but suffice it to say that their publicly-traded employer has a market cap in the hundreds of billions. It employs over two hundred data scientists, half of which have PhDs (typically in Stastistics or Mathematics). I asked them whether they prefer parametric or non-parametric models.
Now for those unaware, most multivariate statistical models fall into the categories of being parametric, non-parametric, or a combination of the two. "Parametric" refers to mathematical relationships between variables, the simplest being a linear trendline. "Non-Parametric" refers to relationships between variables which are difficult to characterize with straightforward mathematical functions. (The typical approach in such situations is to segment the data into many partitions which can be tackled with parametric approaches individually, creating lengthy piecewise functions. Then when multiple variables have piecewise functions, things get complicated fast.) Yet furthermore, "non-parametric" can often be used to refer to machine-learning techniques collectively (since they often resort to non-parametric approaches). Hence, "non-parametric" often encapsulates artificial intelligence (AI), machine learning (ML), artificial neural networks (ANN), and deep learning (DL).
For simplicity's sake, let's refer to parametric models as being "white box" models and non-parametric as being "black box" models. (I'll also consent the term "gray box" for models when there is just one or a few piecewise functions with a manageable number of steps.) One of the most stark differences between the white and black boxes is their transparency or lack thereof. It is easy to see what is going on in a white-box model, but nearly impossible to discern what is happening in most black-box models (more details are in this MIT Technology Review article). The black boxes are opaque.
Now, a lot of hype about data science has focused on the AI side (including ML, ANN, and DL). Perhaps the mystery of what happens in black boxes lends some mysticism and intrigue to the hype. However, if you need a model which is easily interpretable, you would likely be better off with one of those transparent, white-box solutions. Furthermore, all statistical models need to be validated before being used. It is typically much easier to track your "data lineage" (here meaning that which happens to your data within a multivariate model) and validate your model's stability when you are dealing with a parametric, white-box model. A prominent statistician (one with a publications cited over 10,000 times) has told me that, paired with a subject matter expert, he can build parametric models which beat non-parametric models 9 times out of 10. That is worth consideration.
So, getting back to the conversation mentioned at the top with the aforementioned data scientists, what do you suppose their model preference was?
Their organization prefers parametric models. When asked why, they responded that it is far easier to communicate the results from parametric models to upper management. After all, the parameters become quantified model-wide, not just for a small piecewise segment in multidimensional space. White-box interpretations are widely applicable. They retain their utility in a wider variety of scenarios.
In all fairness, there are plenty of scenarios where black-box solutions are best. If you are drilling a well several thousand feet below the surface and want it to automatically adapt its drilling direction based on changes in the mineralogy, porosity, brittleness, and hardness of rocks around it, an automated non-parametric approach might be just the trick. After all, each decision point for the device is specific to its point in space and time. Besides, it is impractical to transmit the data to surface, update a parametric model, and send instructions back down when you are drilling a well at 300 feet per hour. As for autonomous vehicles (a.k.a. driverless cars), they have even more pronounced time considerations.
As can be seen in the above expansion on Drew Conway's original Data Science Venn Diagram, the trouble with machine learning in general is that it lacks intuition. (I delve deeper into this in my KDnuggets post titled The Essential Data Science Venn Diagram.) Some might argue that it may gain intuition soon. I am not aware of any ML engines which know what salt tastes like, though, and it might take a while. Most humans benefit from decades of interactions with the physical world and society around them, giving them a vast depth of intuition. Hence, building models with input from subject-matter experts is regarded as a best practice.
As some readers have already surmised, there is an irony embedded in the title of this article. If you attempt to look inside a multivariate data model, you may have a hard time deconvoluting it if piecewise analysis has been applied to more than one of your variables.
-----------------------
The world is what we make of it, and it needs to be smarter. If you enjoyed this article, please like and share it so that others may find and benefit from it as well.
I speak about these subjects in greater depth in a lecture series I give to clients. A shortened version of the first of my talks was presented to the Houston Geological Society on 4/17/2017 (a portion of which is now available on the HGSGeoEducation YouTube channel) and the AAPG SWS meeting in Midland, TX on 5/1/2017. Organizations interested in my lecture series can learn more at my website (www.adret-llc.com).
Automation Project Manager and Product Manager
7 å¹´I especially like the comment that we choose parametric because it is easier to communicate results to upper management. So true. The most successful argument is the one which more closely fits management's preconceptions.
IT Developer and Epidemiologist skilled in Analytics, AI and Machine Learning , Database Administration and Cloud Infrastructure. Also skilled in Epidemiology and Laboratory Science. Problem Solver, highly creative.
7 å¹´Fantastic read, Sheather would be very proud. As usual your analysis and attention to this component of data analytics is spot on. The interplay between automation, validity and intuition juxtaposed with paramedic and non parametric models was especially useful with showing the application of data. Im looking forward to continual articles, research and the like, and wish you and Adret LLC much success.
HSE Manager Skyborn Renewables GmbH
7 å¹´Great read.
Visioning and Execution of Digital Twin & Digital Transformation Solutions for Process Industry
7 å¹´Great article
Vice President & Chief Commercial Officer @ Honeywell | MS in Analytics
7 å¹´Good article Andrew