Data Science: theory free or hypothesis based?
Tom Breur
July 2022
A question sometimes comes up whether data science projects should be “theory driven”, or whether discoveries from data alone can be sufficient to guide the work, i.e. be “data-driven.” It strikes me this presumed dichotomy resembles the apparent paradox between statistics and machine learning approaches. To my mind, this is not a real contradiction but rather a matter of “horses for courses”: sometimes I resort to one, and sometimes the other. Both approaches have their legitimate use cases. If you want to be a versatile data scientist, you absolutely need both tools in your quiver.
There is a place for “theory free” data mining. In this context, I’ll be using “theory free” as if it were the opposite of “theory driven”, where either ‘data’ (theory free) or our notions about how the world works (theory driven) determine which patterns we inspect, first. Especially when data are abundant, this can be a fertile exploratory step to surface patterns in the data that stand out, and merit further inspection. In some domains, lots of research may have been done and we already know quite a lot, a priori. Other domains are relatively barren, and little prior knowledge is available. Flagging remarkable associations in the data can be a viable first step to generating hypotheses. If knowledge in a particular domain is in a very early stage, then finding remarkable associations and further digging into those can be as good a place to start as any.
What is important to realize, though, is that patterns in data are only the start of a journey of discovery. As such, patterns in the data are merely “nice to know.” Only when we can make plausible connections to mechanics of real-world processes do these patterns acquire value, and it is only insight into real world phenomena that enables us to act in beneficial ways. Data in and of themselves are just, well, data. Chris Anderson, Wired Editor-in-Chief was once recorded saying: “… with enough data, the numbers speak for themselves.” Let it be noted that I emphatically do not buy into that vision.
Both machine learning and statistics aim to learn from data. They both strive to generate insights, although the approach on how to get there may appear to differ wildly. Machine learning is usually seen as a subfield within computer science, whereas statistics is seen as a subfield of mathematics, more specifically applied probability theory. Thought leader Leo Breiman wrote an insightful paper describing some of these differences “Statistical Modeling: The Two Cultures ”, with reference to profoundly different philosophies on the underlying mechanisms that generated the data we use to derive insights.
Breiman pits “traditional” statistics against contemporary data science, although he uses slightly different terms. Mind you, in 2001 when he wrote this paper, the term “data science” was still very new! The key distinction is that one of those two cultures (“traditional statistics”) “ … assumes that the data are generated by a given stochastic data model.” Which explains the need to make assumptions about underlying distributions (e.g. Gaussian, Binomial, Poisson, etc.). The other culture Breiman refers to “ … uses algorithmic models and treats the data mechanism as unknown.” In the introduction of his paper Breiman points out that “Algorithmic modeling, both in theory and in practice, has developed rapidly in fields outside statistics.”
Some authors (e.g.: Ian Hacking , The emergence of probability , 1975) attribute this divide to historical reasons that gave rise to the controversy between Frequentist and Bayesian methods of statistics. There is a view on statistics where its value comes from applying knowledge about frequency of occurrence (“distributions”) in support of decision-making. Another, more or less parallel strand developed where probability and statistics were leveraged to quantify (!) ones strength of belief. In many ways, it’s quite remarkable how this dichotomy between Frequentist and Bayesian statistics has endured over the centuries. This bifurcation in the meaning of “probability” started ?essentially since around 1660 when Pascal’s conversations with Fermat laid the groundwork for modern probability theory as we know and still apply it today.
Data-driven
The data-driven or machine learning approach to analysis tends to favor prediction over explanation (see e.g. Galit Shmueli ’s excellent paper from 2010 “To Explain or to Predict? ”). A predictive model serves only to provide a quantitative estimate for the likelihood of occurrence of some future event. Maybe because of the way our brains are wired, we humans tend to attribute at least a partially causal explanation to such predictions, no matter how adamant warnings to the contrary may be. These are emphatically associative relations, without any claim whatsoever with regards to causality.
领英推荐
This “flavor” of predictive modeling has been a driving force behind the spectacular growth of data science. It’s a capability that has been applied across many disciplines like fraud detection, churn modeling, credit scoring, insurance claims behavior, direct marketing, and many, many other domains. Because of the commercial potential –which can be considerable– I consider this to be the bread and butter of data science. This (kind of) application is utilitarian or even opportunistic from a business perspective, rather than geared towards insight and understanding (like statisticians or social scientists might pursue). But decades of highly profitable applications have silenced calls for substantive explanations or even claims to causality. The money is (too) good.
Machine learning aims to capitalize on associations in data, without making claims to the underlying (“principled”) reasons why these correlations occur. There are also no assumptions whatsoever as to the nature of underlying distributions. As long as these patterns in the data can be extrapolated and monetized, businesses typically find sufficient justification to pursue this endeavor. In another post (How do you predict the future? , 2018) I laid out a parallel but patently different approach to “predictive modeling” that most people associate with system dynamics modeling. Note that although both are “forward looking” methods, they otherwise bear exceedingly little resemblance when it comes to prediction.
Theory driven
As Breiman (2001, Statistical Modeling: The Two Cultures ) has convincingly laid out, the statistical (“Frequentist”) tradition historically aimed to provide insight and explanations for “how the world works”, labeled by Breiman as “data models.” Breiman uses the term “data models” in this context to describe a mechanism that gave rise to how data were generated. In contrast, machine learners rarely spend much thought surfacing the mechanism that generates the data: they typically consider the data a given. They also tend to be agnostic with regards to causal explanations. Such explanation may be useful to have, but are secondary to the ultimate pursuit of the most accurate predictions possible.
Some phenomena, like Benford’s Law (sometimes also called Newcomb-Benford’s law), might be incompletely understood, but nonetheless very useful to apply. Fraud examiners use it to scan documents containing real numbers (quantities) to see if the pattern of distribution in digits matches what one would expect on the basis of chance. Another, arguably more well-known example is Riemann’s Hypothesis that is widely believed to be true, even if the proof for it still hasn’t been delivered (see e.g.: https://www.sciencenews.org/article/mathematicians-progress-riemann-hypothesis-proof ). Many technical processes like cellphone-to-tower reception hinge on Riemann’s conjecture, and every day we enjoy the success of wireless telephony. Examples like this show that sometimes application can precede a full understanding of theory, even when we don’t have comprehensive insight into underlying mechanisms that drive or cause the consistent associations in the data.
Statistics, one could argue, takes the opposite tack from machine learning. It starts with “theory” – what Breiman refers to as the data model. This is an assumed mechanism that generated the data, and those data then usually also have some assumed distribution, like a Gaussian, Binomial, etc. In Frequentist statistics, this distribution must always be assumed, in order to “lean on” (benefit from) the mathematical principles (axioms) we use to make inferences. Note how in this tradition, theory drives analysis, not the other way around.
Conclusion
Data science is coming of age, and has proven its value both in applications as well as driving theory development. In this vignette, I have tried to set traditional (“Frequentist”) statistics against data science, to clarify the distinction. Needless to say, any seasoned data scientist will surely want to embrace a solid grounding in statistics to complement their data science tool set.
When there is a relatively sound, and agreed upon body of knowledge, more formal hypothesis testing may well be the most suitable path to progress. When there is little established knowledge, exploratory methods are more likely to favor the common tools that data scientists are associated with. Obviously this is not an either/or consideration. In this case, you can have your cake, and eat it, too.
Whether you are a die-hard statistician, grounded in academics and seeking to develop theories, or an opportunistic data scientist, geared towards commercial gains rather than insight, making yourself familiar with the entire spectrum of options seems prudent. Academics can benefit from data science methods, as the progress in the last few decades has shown. Breiman saw this right, and very early. At the same time, data scientists worth their salt, will immediately understand that gaining a deeper understanding of data generating mechanisms will ultimate further their objectives of prediction, too. Make it so.
Matemático Autodidacta
5 个月A Non-Trivial Zero. y counterexample. Demostration: If: σ = 0.99970141973107 R = i(-0.2443504425376) σ' = -0.00029858026893 N = i(-0.2443504425376) When: s = [(σ + R) / ( σ' + N)] Then: s It is a non-trivial Zero. And it is also a couterexample to: Reiman'n Hypothesis. Given the: ζ(s) = 0 When: t = σ + R t' = σ' + N Them: s = t / t' When: σ ≠ 1/2 σ' ≠ 1/2 Then: Reiman'n Hypothesis It is ambiguous. Since the condition is sufficient but not necessary. Then: Is it True or false...? Mathematician: Rodolfo Nieves
Managing Director Connected Data Group and Connected Data Academy, Board member DAMA NL, DENODO as a Managed Service, TIBCO Data Virtualization, Analytics, AI. Experienced consultant/instructor, CDMP certified
2 年As always a pleasure to read.
CEO, The Consulting Guild GmbH / Process & People Consultant
2 年As I said over at twitter: This one gave me a whole new view on the concept of looking at data “theory free.” Hat tips to Tom Breur