登录查看更多内容

'Use this data' fallacy

Meliksah C.

Senior Data Scientist at Visa

发布日期: 2022年1月13日

After observing a common fallacy about conceptual comprehension of data that many companies or problem owners are constantly making, I thought writing some comments on this would be worthwhile.

So, what is that fallacy?

It occurs when the problem owners provide a subset of the data that they possess. They might think that the data they provide could be related to the problem and could be helpful to solve the problem.

It might seem that there should be no problem with these statements. The fact is there is one and it is on ‘… they possess ...’ part.

It is undoubtedly true that systems are recording activities around us, which then we call data. Data is just some recordings of observations of some reality in a predefined structure that had been decided and designed by the programmer and/or the data/database modeler for the system that collects. When the problem owners provide the data that they have been storing for a while, they are just sharing some result of recording activities and somehow they believe that the information hidden in the data could be related to the problem itself.

The fallacy arises right on this point.

It is true that some information that can be extracted from some data would solve the problem, however, the required data -that will have the information that could be helpful to solve the problem- should be identified

领英推荐

The Realities of Data Analysis: 5 Things You Wish Were…

Benjamin Bennett Alexander 1 个月前

How would you identify and address outliers that…

Yogita Kolekar Thoke?? 8 个月前

How to Build a Golden Record and achieve a Single…

Vijayraj Amin 7 个月前

without even considering what kind of data or what kind of technology or infrastructure the problem owners already have!

These are indeed defined as the problem constraints that should be addressed afterward. The process of creating data requirements should be unbiased to these constraints.

A well-known approach to eliminate those biases would be a hypothesis-driven problem-solving approach -sometimes called McKinsey's problem-solving approach- that eventually converts ideas to facts through proving or disproving the hypotheses derived. These fact statements would then contain all the required data that should be used, or collected if not currently being done. Of course, additional data should be required to conduct these hypothesis testing processes, but these efforts will end up with a statistically proven list of data requirements to solve the actual problem. This approach is easily applied to problems under the domain of predictive analytics where we have to find distinguishing and distinctive features that are factors for their dependent variable.

When these kinds of approaches are not followed or no time is allocated for them by the problem owner, what we are expected is just to be able to create some systems using currently stored and provided data, rather than to design a proven system that solves the problem.

Briefly, we, as data scientists/decision scientists/data professionals/{or however you like to call}, require the data that is somehow proven to help to solve the problem, not some recordings that are allegedly related to the problem. However, to be honest, what I always observed is that clients only request us to extract whatever and however kind of information in the data they possess, and they give no time for hypothesis derivation or issue tree studies. Data should indeed speak or represent what information it has, however, like politicians, we sometimes should not let everyone that can speak…

To succeed in these approaches, we must also follow a structured and systematic approach for the problem definition and formulation that involve defining the purpose, symptoms, systems that the problems and the roles interact with, metrics and decision criteria, problem users, scope, and so forth. When this is not followed, any solution alternatives would result from some blind activities because hypotheses can only be formulated based on a formal problem definition…

Best,

'Use this data' fallacy

Meliksah C.

Senior Data Scientist at Visa

领英推荐

社区洞察

其他会员也浏览了

Ways of Identifying outliers and missing values in your data during exploratory data analysis?

Does your Start-up need a Data team?

Data Projects: Feasibility Review & Calculating ROI

How Data Leakage can become difficult for you? | Know all about Data Leakage

Oh Snap. You're 3/4 Through Your BI Initiative and the Data Was Wrong: Part 1

Top 10 Data Cleaning Techniques You Should Know in 2025

The Art of Data Cleaning: Ensuring Accuracy in Your Analysis

Curiosity and Courage are Core Competencies for Analysts

How to deal with 'Data Impostor Syndrome' (DIS)

How to Analyze Data: A Basic Guide