Detecting Data Distortions:  
The Three Types of Biases every Manager and Data Scientist should know

Detecting Data Distortions: The Three Types of Biases every Manager and Data Scientist should know

Update/disclaimer: This article only discusses cognitive, statistical, and social/communicative biases relevant for data analytics. For the context of AI, biases have another meaning and designate useful cues that algorithms can use for instance for grouping purposes. This article does not address these "useful" machine learning biases; neither does it discuss more fundamental (historic/encultured) biases that are due to our overall socio-technical constraints and preferences.

Like most of you, I trust my intuition in areas where I have lots of experience, but I rather rely on data where I don’t. In the latter case, I just feel more confident when I have relevant data to make a decision. But beware: the mere availability of data may give us a false sense of certainty.

In other words: whenever there is relevant data, we feel that our decision making must improve.?But that ain’t necessarily so. Sometimes data is the very reason why we make a wrong decision.

Why? Because data, its analysis (including AI), or the way that it is communicated or used may be severely biased or misleading. That is the bad news.

The good news is that you can cultivate a healthy skepticism against data biases: You can detect or even prevent such distortions and de-bias your data. You can immunize your analytics endeavors against these recurring thinking errors.

How? By knowing about them, by recognizing them in your analytics work, by understanding and addressing their root causes, and of course by knowing their remedies.

Here are thus, in a nutshell, ten crucial biases in the analytics process that every manager and analyst should know. If you are using data in HR processes, in risk assessment, in marketing or sales, in controlling, or in credit approval, then you want to make sure that you are steering clear of these pitfalls. In these contexts, any of the following biases can lead to decision fiascoes, as biased data may lead you to wrong recommendations.

The biases are structured along the data gathering, analysis, and application (i.e., communication and usage) process. From the large group of biases (see our interactive map at bias.visual-literacy.org for more than 180 of them) we have chosen the ten below for the following three reasons:

·???????We have seen that they occur frequently in the analytics process of many organizations.

·???????They have a big negative impact on the quality of analytics and the subsequent decisions.

·???????They can be prevented, as effective countermeasures exist against them.

In the next section, we discuss why these biases happen (their root causes), how you can recognize them (the symptoms), and of course how to fight them (their remedies) to improve the quality of data analytics.

Managers should be especially mindful about the first and third type of data biases, whereas analysts should (hopefully) already be aware of the biases in group II.

No alt text provided for this image

Figure 1: Ten analytics biases every professional should know.

I.???????????????DATA GATHERING BIASES

It sounds paradoxical, but one of the biggest mistakes that you can make in analytics is simply working with the data that you have.

One of the biggest mistakes that you can make in analytics is working only with the data that you already have.

It’s like in the story of the man searching for his keys under a streetlight: A passerby asks him whether he is sure that he has lost them there, to which the man replies: “No, I lost them over there, but the lighting is much better over here.” So just because you have data, doesn’t mean that it’s the right one for your decisions.

When looking for data, be aware of three specific biases that may distort your data sourcing: our tendency to use conveniently available data instead of the right data, our tendency to look at data that was completed rather than data that is still missing (think of customer surveys), and our tendency to seek data that confirms our initial opinion. Here is our snapshot of each data gathering bias:

·???????Selection bias:

o??Description: our tendency to use conveniently available data instead of representative data (e.g., the participants in a study differ systematically from the population of interest)

o??Root causes: time pressure, laziness, budget constraints, technical constraints.

o??Symptoms: skewed data that does not represent the full spectrum of the underlying population (e.g., overly positive product evaluations), gaps between expected outcomes (e.g., successful product launch) and reality (e.g., the product flops).

o??Remedies: examine your sampling approach and the inclusion/exclusion criteria that you apply, use randomization methods when selecting items from your population of interest.

·???????Survivor bias:

o??Description: This is a specific type of selection bias, namely focusing on the results that came through and ignoring what has not. For example, only analyzing completed customer surveys, and ignoring those that have not been fully completed.

o??Root causes: important data collection opportunities were overlooked, barriers to data completion at the source, cumbersome data entering process.

o??Symptoms: data is skewed (for example only happy customers or really upset clients have answered the survey), gaps between expected outcomes and reality.

o??Remedies: follow-up on data sources that did not yield data and find alternative ways to achieve completion. If possible, make the data entering process a more seamless experience.

·???????Confirmation bias:

o??Description: data analysts sometimes only seek data to confirm their (or their managers') opinions.

o??Root causes: social / peer pressure, opinionated mindset, overly homogenous analytics team, time pressure.

o??Symptoms: data corresponds perfectly to one’s own hypotheses (“too good to be true”).

o??Remedies: actively seek out contradictory data. Split up the data gathering and/or analysis tasks among two independent teams. Ask for data or variables that have been excluded from the analysis.

II.???????????????DATA ANALYSIS BIASES

Once you have de-biased your data gathering approach, make sure that you also immunize your data analysis against typical statistical biases. These classic statistics blunders are not just the result of sloppy thinking. They may also result from a na?ve treatment of data, or an overly narrow analysis focus. Here are our top four data analysis biases:

·???????Confounding variables:

o??Description: not taking forces – i.e., variables – into account that affect the association between two things (i.e., resulting in a mixing of effects). Thinking that a drives b, just because a and b move in the same direction (e.g., swimming pool visits may not drive ice cream sales, as both are driven by hot temperatures)

o??Root causes: incomplete hypotheses or models.

o??Symptoms: spurious associations among variables, no observed association although it would be reasonable to assume that there is one.

o??Remedies: Measure and report all variables that might affect an outcome, include potential confounding variables in your analyses, provide adjusted estimates for associations after the effects of the confounder have been removed.

·???????Neglecting Outliers:

o??Description: not acknowledging outliers (radically different items in a sample) at all or simply eliminating them.

o??Root causes: exotic or extreme items in data sets that go unchecked.

o??Symptoms: When you plot your data, you see a few items that are far apart from the rest.

o??Remedies: identify outliers and their impact on the data’s descriptive statistics, use appropriate measures of central tendency (e.g., median instead of mean), run analyses without the outliers and compare results.

·???????Normality bias:

o??Description: not taking the actual distribution of the sample into account (for example an employee survey, where most employees are quite happy with their working conditions).

o??Root causes: assuming normal distribution for a data set (even if it’s not a bell curve) and running statistical tests that are for normal distribution only.

o??Symptoms: unreliable quality indicators for the statistical tests.

o??Remedies: examine the real frequency distribution of the sample and run the tests that are fit for that kind of distribution.

·???????Overfitting:

o??Description: playing with models so that they fit the data we have, but not beyond it.

o??Root causes: a limited data sample, a model that is too specific.

o??Symptoms: a seemingly perfect model that accommodates all the available data perfectly, but is bad at predicting future observations (beyond the dataset).

o??Remedies: collect additional data to extend and re-validate the model, remove variables that do not really have a relationship with the outcome.

III.?????????????DATA APPLICATION BIASES

Data has no value if it is not properly communicated and used. The last step in the analytics process – communication and use – is thus of special importance. In this crucial step several things can go wrong: The data analysts could communicate their results badly (incomprehensibly using jargon) or the managers could misinterpret the results (because they overestimate their own data literacy or confuse correlation with causation). The corresponding biases read as follows:

·???????Curse of knowledge:

o??Description: analysts fail to adequately communicate (simplify) their analyses to managers because they have forgotten how complex their procedures are.

o??Root causes: lacking knowledge about target groups of analysis. Lacking data storytelling skills.

o??Symptoms: puzzled looks on managers’ faces, off-topic questions, lacking follow-up.

o??Remedies: grandma test (how would you explain it to your grandmother?). Seek feedback from managers on what they find most difficult to understand. Communication training for data scientists.

·???????Dunning-Kruger effect:

o??Description: managers overestimate their grasp of statistics at times and are unaware of their wrong data interpretation or use.

o??Root causes: overoptimism of managers regarding their own statistics understanding.

o??Symptoms: superficial data conversations.

o??Remedies: the first rule of the Dunning-Kruger club is that you don’t know that you’re a member of it, so enable managers to pre-test their data literacy and discover their knowledge gaps. Ask them challenging questions so that they can see the limitations of their own statistics know-how (in a face-saving way).

·???????Causation bias:

o??Description: believing that one factor causes another, simply because the two are correlated (for example employee fluctuations and sales).

o??Root causes: limited statistics understanding.

o??Symptoms: “strange” relationships that contradict common sense, a design that does not allow for such inferences (e.g., because data was not gathered using strict experimental methods).

o??Remedies: inform managers about the difference between correlation and causation. Show the additional tests that need to be made to clarify causation (beyond correlation).

You now know ten of the most relevant biases in the analytics context. Use this knowledge wisely: Bust those biases, detect those distortions, and control the quality of your data-based decision making.

Inform both managers and data analysts about these risks and about their remedies and install safeguards or countermeasures wherever possible. First and foremost, however, protect yourself against the specific biases that are most likely to affect you (take our self-test for this). Shakespeare’s following famous quote is a useful reminder for this last point:

"A fool thinks himself to be wise, but a wise man knows himself to be a fool."

Further biases with a relevance for analytics can be found here:

https://data36.com/statistical-bias-types-explained/#:~:text=The%20most%20important%20statistical%20bias%20types,-There%20is%20a&text=These%20are%3A,Recall%20bias

https://blogs.oracle.com/analytics/10-cognitive-biases-in-business-analytics-and-how-to-avoid-them

https://www.allerin.com/blog/avoiding-bias-in-data-analytics

https://medium.com/de-bijenkorf-techblog/cognitive-biases-in-data-analytics-b53ea3f688e4

A great article of my St. Gallen colleagues on the key biases in machine learning can be found here:

https://aisel.aisnet.org/cgi/viewcontent.cgi?article=1166&context=wi2021

Denis O'Sullivan

Senior Research Engineer, Innovation Leader and Consultant. Senior Consultant at P2S. PhD Chemical Engineer and Process Engineering and Scale-Up Specialist. Passionate about making the world better for everyone.

3 å¹´

Beautiful article, thank you for this! I haven't seen this presented so clearly before. Maybe what we need to do is just send this as pre-reading before meetings in which we'll discuss collecting or analysing complex data, so that we're all on the same page and ready to avoid all these traps. It's also a great check-list. There are more than 10 ways to go wrong, but if we eliminate these, we'll eliminate a huge number of bad decisions!

要查看或添加评论,请登录

Martin J. Eppler的更多文章

社区洞察

其他会员也浏览了