Epic Data Science Fails and how to avoid them

Epic Data Science Fails and how to avoid them

We blame “Data Science” but it’s rarely the science at fault and more commonly the humans. After all, Data Scientists talk about K-Means, Regression, TensorFlow, Neural Networks, Natural Language Processing (NLP) as if they will magically remove the potential for failure. But failure is still there and always has been. In fact, the propensity for failure was identified by the father of modern computing, Charles Babbage, and the term “Garbage In, Garbage Out” (GIGO) was coined by William D Melin in 1957 to explain it.

GIGO simply means – if there are errors in the input or training data, the output will be useless or worse, misleading. Who gathers the training data – and the mistakes can be costly. Just a few million here and there!

Let’s take a look at four recent demonstrations of the principle from Google and Amazon amongst others.

Ignoring collectors – when data is collected, those that set it up know the limitations. Ignore them at your peril! An example: Once company had an IoT sensor that collected humidity data that is out in the field. Having only 2% of measurements missing due to radio transmission errors doesn’t seem to be a problem unless they realised that transmission only fails when it rains. Failure to understand how the data was collected lead to erroneous correlations and incorrect predictions. In extreme cases, correlation can be the result of the collection methodology rather than the item being measured.

Correlation v. Causation - Correlation is a measure of a relationship or connection between two or more things. Causation is where one act contributes to the production of another event. Just because two things are highly correlated does not mean one caused the other. The problem occurs because it is extremely easy to find correlations between measures – simply do enough analysis and you’ll find some correlations at random. For example, in “Spurious correlations: Margarine linked to divorce?” the author found an example of a 99% correlation between margarine consumption and divorce. Does that mean there is a link? (Just in case you’re unsure, the answer is “No”)!

Poor model validation - Considerable thought must be given to the data that is used to validate a model as Amazon discovered with its new recruiting engine. It turned out that the majority of training and validation data was from men, so the AI algorithm ended up being biased against women. In fact, the algorithm was so good, that even changing all the pronouns to female, the algorithm still picked up men! The AI project was ultimately abandoned. To improve model validation, consider the issue at the beginning of the project – does your data represent the real world or does it have in-built bias that your prediction could exacerbate?

Out-of-band data – that is data that appears to be a long way from the normal should be investigated and resolve rather than simply ignored as a part of routine policy. For example, “In the Plex: How Google Thinks, Works, and Shapes Our Lives”, Reese explains how this type of error cost Google’s ISP millions. Google needed to transmit huge quantities of data across from west to east coast USA, so they rented a fibre connection. Full usage of the fibre would have cost $250,000 per month but they exploited a loophole in the billing process. The ISP removed all “outlier” bandwidth measurements and charged for the remainder. So, Google transferred all their data in 24 hours per month which meant that, after outliers were removed, they apparently used no bandwidth. The ISP charged them zero! Epic Fail! The lesson: review why the outliers exist before removing or ignoring them.

When you next see an unusual result, go to the source. Have we ignored contrary data? Have we mixed correlation and causation? Did we validate the model correctly? Did we exclude data which could lead to a dramatically different result?

Priya Mishra

Public Speaker| Global B2B Conference Organizer of our flagship event | Management Consultant | Corporate Strategy | Solution Provider | Business Process Enthusiast

2 年

Brian, thanks for sharing!

回复

要查看或添加评论,请登录

Brian Dorricott的更多文章

社区洞察

其他会员也浏览了