Data Science Bias; Lying Computers
I read this absolutely brilliant thriller by Terry Hayes called I am Pilgrim where one of the characters in the book is reading an analytics report and comments "Computers don't lie, but liars can compute". Before I put my thoughts down with respect to that comment, do read the book, it is fantastic and Terry Hayes knows how to tell a story. He wrote the screenplay for two of my favorite films - Mad Max 2 and Dead Calm, and both were thrillers which were way above the norm.
I am a data scientist, no correct that, I am a statistician and am starting to come to terms with what Data Science is as compared to Statistical Science. Once I have a better handle on that, I will write about it, but today, I wanted to think about the comment - "Computers don't lie, but liars can compute". I want to adapt this comment a bit and move the focus from "a lie" to "bias". My adapted comment would be :-
"Computers are not biased, but biased people can compute"
The above statement has to be thought of seriously in the context of data science. In my current understanding of data science, additional requirements to be a good data scientist, besides being a good statistician are :- (1) ability to think and act seamlessly between business problems, analytical problems, analytics solution and business solutions; (2) developing and implementing smart algorithms and programming; (3) data assimilation through varied data sources; (4) data visualization and story telling. Based on this, you can see that a data scientist is exposed to and is at risk to multiple sources of bias. As a statistician, you are formally taught about biases and how to avoid and minimize them, but a data scientist has to worry about other sources of bias, not only statistical bias. This risk is important to acknowledge and that is why, you can say that computers are unbiased, but biased people who can compute (including yourself) are problematic. Below are some sources of bias that should be taken care of so that the effect of "biased people computing" could be minimized ....
Bias due to ...
- Use of wrong statistical technique
- Utilizing not fit for purpose programming environment or not having access to best tool for the purpose
- Answering the "wrong version" of business problem and implementing a "sub-optimal business solution" despite solving the associated analytical problem correctly
- Assimilating data which was collected for a purpose other than the business problem at hand; assimilating too much data; leaving out data sources due to constraints which could be business related or due to personal bias
- Showing selective data which makes the underlying story either too clear or obtuse; Telling the associated story without the transparency needed to make it sound unbiased
As you can see the above buckets are very broad buckets of potential biases. Adding to that, the complications related to the fact that the bias can come from you, can come from someone higher up in the organization, can come due to various constraints, can happen due to technical knowledge gap or can be present due to outright lying by people involved, it is important to think carefully about "biased people can compute".
A simple way to do that, at least as a thought process, is to act like a sceptic of the highest order and question everything, but do it without becoming a cynic, a doubter.
Happy Unbiasing as it will make data science all that more powerful!
Master’s in Business Administration | Project Data Manager | Global Business Solutions at Novartis
5 年Magnificent!
Associate Director Compliance and Training at Novotech Clinical Research India Private Limited
5 年Wonderful Sir, I have?noticed each bias while learning/ applying to RWD/ RWE. You have it precise.
Founder at CP+ Associates GmbH (Switzerland) and CEO at Pharmacometrics Africa NPC (South Africa)
5 年Thoughtful post : “Computers don't lie, but liars can compute”.
Founder and CEO of MeDaStats LLC | Statistician
5 年Fantastic post, Ashwini!