ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Detecting Data Distortions: The Three Types of Biases every Manager and Data Scientist should know

Martin J. Eppler

Professor of Communications Management & Author

å‘å¸ƒæ—¥æœŸ: 2021å¹´3æœˆ9æ—¥

Update/disclaimer: This article only discusses cognitive, statistical, and social/communicative biases relevant for data analytics. For the context of AI, biases have another meaning and designate useful cues that algorithms can use for instance for grouping purposes. This article does not address these "useful" machine learning biases; neither does it discuss more fundamental (historic/encultured) biases that are due to our overall socio-technical constraints and preferences.

Like most of you, I trust my intuition in areas where I have lots of experience, but I rather rely on data where I donâ€™t. In the latter case, I just feel more confident when I have relevant data to make a decision. But beware: the mere availability of data may give us a false sense of certainty.

In other words: whenever there is relevant data, we feel that our decision making must improve.?But that ainâ€™t necessarily so. Sometimes data is the very reason why we make a wrong decision.

Why? Because data, its analysis (including AI), or the way that it is communicated or used may be severely biased or misleading. That is the bad news.

The good news is that you can cultivate a healthy skepticism against data biases: You can detect or even prevent such distortions and de-bias your data. You can immunize your analytics endeavors against these recurring thinking errors.

How? By knowing about them, by recognizing them in your analytics work, by understanding and addressing their root causes, and of course by knowing their remedies.

Here are thus, in a nutshell, ten crucial biases in the analytics process that every manager and analyst should know. If you are using data in HR processes, in risk assessment, in marketing or sales, in controlling, or in credit approval, then you want to make sure that you are steering clear of these pitfalls. In these contexts, any of the following biases can lead to decision fiascoes, as biased data may lead you to wrong recommendations.

The biases are structured along the data gathering, analysis, and application (i.e., communication and usage) process. From the large group of biases (see our interactive map at bias.visual-literacy.org for more than 180 of them) we have chosen the ten below for the following three reasons:

Â·???????We have seen that they occur frequently in the analytics process of many organizations.

Â·???????They have a big negative impact on the quality of analytics and the subsequent decisions.

Â·???????They can be prevented, as effective countermeasures exist against them.

In the next section, we discuss why these biases happen (their root causes), how you can recognize them (the symptoms), and of course how to fight them (their remedies) to improve the quality of data analytics.

Managers should be especially mindful about the first and third type of data biases, whereas analysts should (hopefully) already be aware of the biases in group II.

Figure 1: Ten analytics biases every professional should know.

I.???????????????DATA GATHERING BIASES

It sounds paradoxical, but one of the biggest mistakes that you can make in analytics is simply working with the data that you have.

One of the biggest mistakes that you can make in analytics is working only with the data that you already have.

Itâ€™s like in the story of the man searching for his keys under a streetlight: A passerby asks him whether he is sure that he has lost them there, to which the man replies: â€œNo, I lost them over there, but the lighting is much better over here.â€ So just because you have data, doesnâ€™t mean that itâ€™s the right one for your decisions.

When looking for data, be aware of three specific biases that may distort your data sourcing: our tendency to use conveniently available data instead of the right data, our tendency to look at data that was completed rather than data that is still missing (think of customer surveys), and our tendency to seek data that confirms our initial opinion. Here is our snapshot of each data gathering bias:

Â·???????Selection bias:

o??Description: our tendency to use conveniently available data instead of representative data (e.g., the participants in a study differ systematically from the population of interest)

o??Root causes: time pressure, laziness, budget constraints, technical constraints.

o??Symptoms: skewed data that does not represent the full spectrum of the underlying population (e.g., overly positive product evaluations), gaps between expected outcomes (e.g., successful product launch) and reality (e.g., the product flops).

o??Remedies: examine your sampling approach and the inclusion/exclusion criteria that you apply, use randomization methods when selecting items from your population of interest.

Â·???????Survivor bias:

o??Description: This is a specific type of selection bias, namely focusing on the results that came through and ignoring what has not. For example, only analyzing completed customer surveys, and ignoring those that have not been fully completed.

o??Root causes: important data collection opportunities were overlooked, barriers to data completion at the source, cumbersome data entering process.

o??Symptoms: data is skewed (for example only happy customers or really upset clients have answered the survey), gaps between expected outcomes and reality.

o??Remedies: follow-up on data sources that did not yield data and find alternative ways to achieve completion. If possible, make the data entering process a more seamless experience.

Â·???????Confirmation bias:

o??Description: data analysts sometimes only seek data to confirm their (or their managers') opinions.

o??Root causes: social / peer pressure, opinionated mindset, overly homogenous analytics team, time pressure.

o??Symptoms: data corresponds perfectly to oneâ€™s own hypotheses (â€œtoo good to be trueâ€).

o??Remedies: actively seek out contradictory data. Split up the data gathering and/or analysis tasks among two independent teams. Ask for data or variables that have been excluded from the analysis.

II.???????????????DATA ANALYSIS BIASES

Once you have de-biased your data gathering approach, make sure that you also immunize your data analysis against typical statistical biases. These classic statistics blunders are not just the result of sloppy thinking. They may also result from a na?ve treatment of data, or an overly narrow analysis focus. Here are our top four data analysis biases:

Â·???????Confounding variables:

o??Description: not taking forces â€“ i.e., variables â€“ into account that affect the association between two things (i.e., resulting in a mixing of effects). Thinking that a drives b, just because a and b move in the same direction (e.g., swimming pool visits may not drive ice cream sales, as both are driven by hot temperatures)

o??Root causes: incomplete hypotheses or models.

o??Symptoms: spurious associations among variables, no observed association although it would be reasonable to assume that there is one.

o??Remedies: Measure and report all variables that might affect an outcome, include potential confounding variables in your analyses, provide adjusted estimates for associations after the effects of the confounder have been removed.

é¢†è‹±æŽ¨è

Future Trends in Data Democratization: Envisioning Tomorrow's Insights

Future Trends in Data Democratization: Envisioningâ€¦

Incept Data Solutions, Inc. 1 å¹´å‰

Creating Data Infrastructure for AI and BI At Scale

Bernard Marr 3 å¹´å‰

Haven't You Integrated Your Data with AI And ML? Do It Now with Valuable Lessons by Khagesh

Haven't You Integrated Your Data with AI And ML? Doâ€¦

The Executive Outlook 5 ä¸ªæœˆå‰

Â·???????Neglecting Outliers:

o??Description: not acknowledging outliers (radically different items in a sample) at all or simply eliminating them.

o??Root causes: exotic or extreme items in data sets that go unchecked.

o??Symptoms: When you plot your data, you see a few items that are far apart from the rest.

o??Remedies: identify outliers and their impact on the dataâ€™s descriptive statistics, use appropriate measures of central tendency (e.g., median instead of mean), run analyses without the outliers and compare results.

Â·???????Normality bias:

o??Description: not taking the actual distribution of the sample into account (for example an employee survey, where most employees are quite happy with their working conditions).

o??Root causes: assuming normal distribution for a data set (even if itâ€™s not a bell curve) and running statistical tests that are for normal distribution only.

o??Symptoms: unreliable quality indicators for the statistical tests.

o??Remedies: examine the real frequency distribution of the sample and run the tests that are fit for that kind of distribution.

Â·???????Overfitting:

o??Description: playing with models so that they fit the data we have, but not beyond it.

o??Root causes: a limited data sample, a model that is too specific.

o??Symptoms: a seemingly perfect model that accommodates all the available data perfectly, but is bad at predicting future observations (beyond the dataset).

o??Remedies: collect additional data to extend and re-validate the model, remove variables that do not really have a relationship with the outcome.

III.?????????????DATA APPLICATION BIASES

Data has no value if it is not properly communicated and used. The last step in the analytics process â€“ communication and use â€“ is thus of special importance. In this crucial step several things can go wrong: The data analysts could communicate their results badly (incomprehensibly using jargon) or the managers could misinterpret the results (because they overestimate their own data literacy or confuse correlation with causation). The corresponding biases read as follows:

Â·???????Curse of knowledge:

o??Description: analysts fail to adequately communicate (simplify) their analyses to managers because they have forgotten how complex their procedures are.

o??Root causes: lacking knowledge about target groups of analysis. Lacking data storytelling skills.

o??Symptoms: puzzled looks on managersâ€™ faces, off-topic questions, lacking follow-up.

o??Remedies: grandma test (how would you explain it to your grandmother?). Seek feedback from managers on what they find most difficult to understand. Communication training for data scientists.

Â·???????Dunning-Kruger effect:

o??Description: managers overestimate their grasp of statistics at times and are unaware of their wrong data interpretation or use.

o??Root causes: overoptimism of managers regarding their own statistics understanding.

o??Symptoms: superficial data conversations.

o??Remedies: the first rule of the Dunning-Kruger club is that you donâ€™t know that youâ€™re a member of it, so enable managers to pre-test their data literacy and discover their knowledge gaps. Ask them challenging questions so that they can see the limitations of their own statistics know-how (in a face-saving way).

Â·???????Causation bias:

o??Description: believing that one factor causes another, simply because the two are correlated (for example employee fluctuations and sales).

o??Root causes: limited statistics understanding.

o??Symptoms: â€œstrangeâ€ relationships that contradict common sense, a design that does not allow for such inferences (e.g., because data was not gathered using strict experimental methods).

o??Remedies: inform managers about the difference between correlation and causation. Show the additional tests that need to be made to clarify causation (beyond correlation).

You now know ten of the most relevant biases in the analytics context. Use this knowledge wisely: Bust those biases, detect those distortions, and control the quality of your data-based decision making.

Inform both managers and data analysts about these risks and about their remedies and install safeguards or countermeasures wherever possible. First and foremost, however, protect yourself against the specific biases that are most likely to affect you (take our self-test for this). Shakespeareâ€™s following famous quote is a useful reminder for this last point:

"A fool thinks himself to be wise, but a wise man knows himself to be a fool."

Further biases with a relevance for analytics can be found here:

https://data36.com/statistical-bias-types-explained/#:~:text=The%20most%20important%20statistical%20bias%20types,-There%20is%20a&text=These%20are%3A,Recall%20bias

https://blogs.oracle.com/analytics/10-cognitive-biases-in-business-analytics-and-how-to-avoid-them

https://www.allerin.com/blog/avoiding-bias-in-data-analytics

https://medium.com/de-bijenkorf-techblog/cognitive-biases-in-data-analytics-b53ea3f688e4

A great article of my St. Gallen colleagues on the key biases in machine learning can be found here:

https://aisel.aisnet.org/cgi/viewcontent.cgi?article=1166&context=wi2021

Denis O'Sullivan

Senior Research Engineer, Innovation Leader and Consultant. Senior Consultant at P2S. PhD Chemical Engineer and Process Engineering and Scale-Up Specialist. Passionate about making the world better for everyone.

3 å¹´

Beautiful article, thank you for this! I haven't seen this presented so clearly before. Maybe what we need to do is just send this as pre-reading before meetings in which we'll discuss collecting or analysing complex data, so that we're all on the same page and ready to avoid all these traps. It's also a great check-list. There are more than 10 ways to go wrong, but if we eliminate these, we'll eliminate a huge number of bad decisions!

èµž

å›žå¤

1 æ¬¡å›žåº”

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Martin J. Epplerçš„æ›´å¤šæ–‡ç«

Mark these Questions: Solving Problems with the "Dynamic Dozen"

2022å¹´6æœˆ1æ—¥

Mark these Questions: Solving Problems with the "Dynamic Dozen"

The late Peter Drucker famously wrote that he was a better advisor than most because he focused on finding the rightâ€¦

3 æ¡è¯„è®º
Five Ways to Use Knowledge Visualization in Your Work Right Now

2022å¹´4æœˆ12æ—¥

Five Ways to Use Knowledge Visualization in Your Work Right Now

We all want to work better, simpler, more effectively, right? If you are a knowledge worker (such as a specialistâ€¦

2 æ¡è¯„è®º
Introducing Visual Variation: The Powerful Principle for Fast Learning, Focused Knowledge Sharing, and Fluid Creativity

2021å¹´12æœˆ11æ—¥

Introducing Visual Variation: The Powerful Principle for Fast Learning, Focused Knowledge Sharing, and Fluid Creativity

You do not understand anything, until you understand it in more than one way. Marvin Minsky (AI pioneer and MITâ€¦

8 æ¡è¯„è®º
Five Habits for Better Decisions: Debiasing with the TUNER Approach

2021å¹´6æœˆ23æ—¥

Five Habits for Better Decisions: Debiasing with the TUNER Approach

Take the fear factor out of important decisions with a simple five step action plan to de-bias your decision-makingâ€¦

2 æ¡è¯„è®º
LEAN Productivity: a Positive Approach to Get Things Done

2021å¹´6æœˆ1æ—¥

LEAN Productivity: a Positive Approach to Get Things Done

We have all been there â€“ especially in home office times: periods of low productivity, lacking motivation, orâ€¦

1 æ¡è¯„è®º
Data DESIGN: Six "must-know" Data Visualization Principles for Everyone

2021å¹´4æœˆ20æ—¥

Data DESIGN: Six "must-know" Data Visualization Principles for Everyone

We are all in the data business, whether we realize it or not. Hence, we all need to become competent in conveying dataâ€¦

6 æ¡è¯„è®º
How to Storify your Data: The Five Magic Ingredients of Data Story Telling

2021å¹´1æœˆ7æ—¥

How to Storify your Data: The Five Magic Ingredients of Data Story Telling

You have prepared and analyzed important data and want to convey it to others in the most accessible, concise, andâ€¦

13 æ¡è¯„è®º
Management in the (post-) COVID Era: Tools for Turbulent Times with the VUCA Approach

2020å¹´9æœˆ3æ—¥

Management in the (post-) COVID Era: Tools for Turbulent Times with the VUCA Approach

How do you manage and lead in a covid world? What is a business paradigm that can help you deal with the turbulent andâ€¦

6 æ¡è¯„è®º
Talking Data: The Three Kinds of Questions that Managers should ask their Analytics Experts

2020å¹´6æœˆ6æ—¥

Talking Data: The Three Kinds of Questions that Managers should ask their Analytics Experts

Slowly but surely good practices in analytics are taking hold in organizations and most managers are becoming familiarâ€¦

13 æ¡è¯„è®º
Ditching those Demo Demons: How (not) to do Software Presentations

2020å¹´1æœˆ28æ—¥

Ditching those Demo Demons: How (not) to do Software Presentations

Few things pain me more than having to endure a badly conducted software demonstration in a presentation or trainingâ€¦

7 æ¡è¯„è®º

See all articles

Detecting Data Distortions: The Three Types of Biases every Manager and Data Scientist should know

Martin J. Eppler

Professor of Communications Management & Author

é¢†è‹±æŽ¨è

Martin J. Epplerçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Predyktableâ€™s Predicted Data & Analytics Trends 2025

Feature screening for ultra-high dimensional, multi-class data

Data Democratisation | Thomas McLachlan shares insights on the topic

Is Data Quality the CIO's AI Dilemma?

Handling Outliers in ML: Best Practices for Robust Data Preprocessing

Shaping the Future of Gen AI: Insights from Gartner Experts

?? Embracing the Future of Data Analytics in 2025 ??

Isolation Forest: Unmasking Anomalies in Your Data

Model Evaluation Metrics: A Comprehensive Guide

Why Now: Becoming the "Professor of Data" in the AI Era

é¢†è‹±æŽ¨è

Martin J. Epplerçš„æ›´å¤šæ–‡ç«

Mark these Questions: Solving Problems with the "Dynamic Dozen"

Five Ways to Use Knowledge Visualization in Your Work Right Now

Introducing Visual Variation: The Powerful Principle for Fast Learning, Focused Knowledge Sharing, and Fluid Creativity

Five Habits for Better Decisions: Debiasing with the TUNER Approach

LEAN Productivity: a Positive Approach to Get Things Done

Data DESIGN: Six "must-know" Data Visualization Principles for Everyone

How to Storify your Data: The Five Magic Ingredients of Data Story Telling

Management in the (post-) COVID Era: Tools for Turbulent Times with the VUCA Approach

Talking Data: The Three Kinds of Questions that Managers should ask their Analytics Experts

Ditching those Demo Demons: How (not) to do Software Presentations

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Predyktableâ€™s Predicted Data & Analytics Trends 2025

Feature screening for ultra-high dimensional, multi-class data

Data Democratisation | Thomas McLachlan shares insights on the topic

Is Data Quality the CIO's AI Dilemma?

Handling Outliers in ML: Best Practices for Robust Data Preprocessing

Shaping the Future of Gen AI: Insights from Gartner Experts

?? Embracing the Future of Data Analytics in 2025 ??

Isolation Forest: Unmasking Anomalies in Your Data

Model Evaluation Metrics: A Comprehensive Guide

Why Now: Becoming the "Professor of Data" in the AI Era

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†