The loneliness of  the penguin: data science and cognitive biases

The loneliness of the penguin: data science and cognitive biases

I have only recently come across an interesting article on The Psychology of Data Science by Lisa Christina Winter. Her contribution deals with the fundamental issue of cognitive biases. In particular, she mentions the confirmation bias, “the human tendency to confirm, rather than disconfirm, pre-existing hypotheses”. Since it is one of the most common biases, it may be worth to investigate it deeper.

As Karl Popper insisted in his outline of the scientific method, hypotheses can be falsified (proved to be false) but not verified (proved to be true). This is the case since - no matter how much evidence is collected to corroborate (i.e. to support) a hypothesis - it can never be proved to be true, because we cannot rule out that - sooner or later - some counter-evidence will be found, even just a single observation, which is enough to prove that the hypothesis is false:

  • Hypothesis: All birds fly.
  • Evidence: doves, crows, pigeons (corroboration).
  • Counter-evidence: a penguin (falsification).

Nonetheless, a scientist may unconsciously try to confirm a hypothesis in different ways (if this is performed consciously, then it becomes an ethical problem). The researcher may select a sample which is not properly randomized, and therefore does not represent the general population. If only flying birds are selected (for example because the sample is small), then the study presents a sampling bias. Similarly, in a review study, only papers favoring a hypothesis may be collected (selection bias). Sampling bias and selection bias are almost synonyms, but the former is usually referred to the choice of the data to be collected, while the latter to the choice of the results to be considered.

In other cases, scientists may hold on to the initial hypothesis - no matter how much counter-evidence is collected - because they rely too heavily on the first information they acquire. For example, they may have not been to Antarctica, and so far they have only observed flying birds. This bias is called anchoring, or insufficient adjustment.

Ad hoc hypotheses may be used to anchor to a prior hypothesis:

  • Hypothesis: All birds fly.
  • Counter-evidence: a penguin.
  • Ad hoc hypothesis: Penguins are not birds.

Note that hypotheses are usually general, while evidence is particular, as in classic deduction. Ad hoc hypotheses are not necessarily false, but their adoption in order to make a hypothesis more robust to falsification may imply a confirmation bias. To mitigate this risk, explanations should be kept as simple as possible, and the number of necessary hypotheses minimized (adopting the so-called Ockham's razor).

Similarly, judges may confirm the presumption of innocence, even if the collected evidence is against the suspect, according to the general assumption that in democratic countries a free criminal (false negative) is preferred to a convicted innocent (false positive). A famous argument used to confirm a hypothesis is the No True Scotsman, where an assumption is changed ad hoc.

The opposite of anchoring - the base rate fallacy - may happen in medicine. Doctors may hospitalize a patient who shows some symptoms of an illness, even if its prevalence (the base rate or relative frequency) is very low in the general population (i.e. the disease is rare). The base rate is more likely to be neglected if the suspected illness is life-threatening. In an article on Psychology Today, JI Krueger suggested that the two opposing biases (anchoring and base rate fallacy) could work together to mitigate each other.

Nonetheless, the confirmation bias seems to be intrinsic to classic deductive approach to inference, which relies on a priori hypotheses (and therefore on assumptions or prejudices). Thus, part of the problem (and of the solution) may be methodological, not only psychological. For example, exploratory data analysis (i.e. hypothesis finding) can be performed before confirmatory data analysis (i.e. hypothesis testing) in order to evaluate new hypotheses which are suggested by the data themselves.

要查看或添加评论,请登录

Davide Barbieri的更多文章

  • Where do ideas come from?

    Where do ideas come from?

    The problem of the origin of ideas is the same as that of the origin of words. J De Maistre, St Petersburg Dialogues.

    7 条评论
  • On social experiments

    On social experiments

    Yesterday I participated to an interesting meeting on gender-based violence, here in Kiev, Ukraine. It presented the…

    12 条评论
  • Notes on security

    Notes on security

    On the accuracy of security controls A high accuracy is often given as a sign of good classification performances. But…

    2 条评论
  • On scientific agreement: a Lutheran dilemma

    On scientific agreement: a Lutheran dilemma

    As we all have witnessed during the covid-19 pandemic, there was not a universal agreement on medical matters, even…

  • Science and what is not

    Science and what is not

    If there is a demarcation criterion between science and the rest, that is falsification. We especially owe this idea to…

    2 条评论
  • Quantitative risk assessment for early warning tasks. The case of Moldova

    Quantitative risk assessment for early warning tasks. The case of Moldova

    We live in interesting and challenging times, which bring risks and opportunities, as it is always the case. Therefore,…

    6 条评论
  • Foundations of information security

    Foundations of information security

    The hype around cybersecurity has raised dramatically in recent times, as the number of cyberattacks and incidents have…

    4 条评论
  • On paper (but better not)

    On paper (but better not)

    Verba volant, scripta manent The new Italian Democratic Party (PD) secretary, Elena “Elly” Schlein recently reported…

    4 条评论
  • Beers and diapers: on spurious correlations

    Beers and diapers: on spurious correlations

    It is often told that the origins of data mining (the analysis of large corporate databases) can be found in the retail…

  • The God of Data Science: On Computational Metaphysics

    The God of Data Science: On Computational Metaphysics

    The famous Austrian mathematician Kurt G?del (1906 – 1978), a giant of contemporary philosophical and scientific…

    14 条评论

社区洞察

其他会员也浏览了