'Use this data' fallacy
After observing a common fallacy about conceptual comprehension of data that many companies or problem owners are constantly making, I thought writing some comments on this would be worthwhile.
So, what is that fallacy?
It occurs when the problem owners provide a subset of the data that they possess. They might think that the data they provide could be related to the problem and could be helpful to solve the problem.
It might seem that there should be no problem with these statements. The fact is there is one and it is on ‘… they possess ...’ part.
It is undoubtedly true that systems are recording activities around us, which then we call data. Data is just some recordings of observations of some reality in a predefined structure that had been decided and designed by the programmer and/or the data/database modeler for the system that collects. When the problem owners provide the data that they have been storing for a while, they are just sharing some result of recording activities and somehow they believe that the information hidden in the data could be related to the problem itself.
The fallacy arises right on this point.
It is true that some information that can be extracted from some data would solve the problem, however, the required data -that will have the information that could be helpful to solve the problem- should be identified
领英推荐
without even considering what kind of data or what kind of technology or infrastructure the problem owners already have!
These are indeed defined as the problem constraints that should be addressed afterward. The process of creating data requirements should be unbiased to these constraints.
A well-known approach to eliminate those biases would be a hypothesis-driven problem-solving approach -sometimes called McKinsey's problem-solving approach- that eventually converts ideas to facts through proving or disproving the hypotheses derived. These fact statements would then contain all the required data that should be used, or collected if not currently being done. Of course, additional data should be required to conduct these hypothesis testing processes, but these efforts will end up with a statistically proven list of data requirements to solve the actual problem. This approach is easily applied to problems under the domain of predictive analytics where we have to find distinguishing and distinctive features that are factors for their dependent variable.
When these kinds of approaches are not followed or no time is allocated for them by the problem owner, what we are expected is just to be able to create some systems using currently stored and provided data, rather than to design a proven system that solves the problem.
Briefly, we, as data scientists/decision scientists/data professionals/{or however you like to call}, require the data that is somehow proven to help to solve the problem, not some recordings that are allegedly related to the problem. However, to be honest, what I always observed is that clients only request us to extract whatever and however kind of information in the data they possess, and they give no time for hypothesis derivation or issue tree studies. Data should indeed speak or represent what information it has, however, like politicians, we sometimes should not let everyone that can speak…
To succeed in these approaches, we must also follow a structured and systematic approach for the problem definition and formulation that involve defining the purpose, symptoms, systems that the problems and the roles interact with, metrics and decision criteria, problem users, scope, and so forth. When this is not followed, any solution alternatives would result from some blind activities because hypotheses can only be formulated based on a formal problem definition…
Best,
Business Development Manager - Navea Electronics FZE (TE Connectivity)
3 年Sincere thx for very insightful reading. As a freshman on data driven business solutions, my tendency is to see the data at hand as a main source to solve the problem. I ll question the data twice