登录查看更多内容

Avoiding The Biggest Big Data Fallacy, Where More Data Means Higher Accuracy With Real-World Examples

David R.

Developing Businesses through Digital Transformation | MSBA, GenAI, ML/AI, PMP, Databricks Champion

发布日期: 2021年3月8日

The go-to response when you find that your Machine Learning model gets low accuracy scores tends to be that you need more data, and it is easy to see why. Being the easiest and fastest way to improve a model, it is the go-to solution for Data Scientist struggling to meet deadlines. However, sometimes too much data can lead to some serious consequences from misleading outputs to straight up discrimination. It is these issues that tends to not be discussed, which is the main point of this article. I would like to share some of the common traps that comes from thinking that “More Data equals better results” and accompany them with some real-world examples of the consequences that ensues.

Getting Lost in the Data

When I was studying to be a Data Scientist, datasets were provided for us to work on and to build a model out of. I used to shudder at the contents of these datasets which comes with 100 to over 300 columns of data. Imagine doing feature engineering on every column. So, what did we used to do? We started with:

Dropping columns where majority of the data is missing
Drop columns where correlation with other variable columns is high
And when times are tough, PCA!

Once that is done, we see the remanding columns left and we let accuracy score lead the way. Boy was I wrong. The problem with this approach is that we tend to not ask the real question which is:

What kind of data I should be asking for to answer my objective?

You see, I started with “This is the data I have, let’s see how it answers the problem at hand” when I should be starting with “So this is the problem, what information do I need to solve the problem”. By going through this route, we lay out the kind of data we need, and see which data best resembles our requirement in the dataset. This prevents us from blindly using columns and getting lost in the data.

Talking about getting lost in data, here is an example of Data Analyst using wrong types of data and comparing it with each other:

Here we have a study about the average life expectancy of Artist from Different Genres being compared with the average life expectancy for males and females in the country. At first glance you might think “Oh my…it’s amazing just a choice of music genre could severely impact the life expectancy!” However, something seems off about this chart…

The female and male life expectancy lines seem to be showing an increase in life expectancy over time, which makes sense, but years are not on the horizontal axis! The horizontal axis is 14 musical

genres. Why are the trends in U.S. life expectancy graphed in comparison to musical genres, rather than by year?

Also, how is it possible that rappers and hip-hop artist on average live only till their 30s? Don’t some of them survive way longer? Well, it could be that the “age” of the genre was not considered.

Hip hop began in the late 1970s. People who began doing hip hop in 1980 at age 20 and are still alive would be 57 years old in 2017. Anyone who died before 2017 would be younger than 57. People who began doing hip hop after 1980 would be even younger. In contrast, blues, jazz, and country have been around long enough for performers to grow old.

Never let your dataset drive your direction. Think critically and plan the direction you want to take.

Putting Data Before Critical Thinking

Have you heard about the Google Flu Project? It was this grand project by Google to use big data of searched words can accurately estimate the current level of weekly influenza activity in each region of the United States with a proclaimed accuracy of 97.5%. Here is an excerpt about the project:

Google Flu Trends (GFT) was a web service operated by Google. It provided estimates of influenza activity for more than 25 countries. By aggregating Google Search queries, it attempted to make accurate predictions about flu activity. This project was first launched in 2008 by Google.org to help predict outbreaks of flu.

https://en.wikipedia.org/wiki/Google_Flu_Trends

This sounded amazing at that time! Google’s data-mining program looked at 50 million search queries and identified the 45 queries that were the most closely correlated with the incidence of flu. It was pure-and-simple data mining and a terrific example of the Feynman Trap.

What is the Feynman Trap?

It is the act of ransacking data for patterns without any preconceived idea of what one is looking for

It was praised by many Data Scientist, with a MIT professor explaining that “This seems like a really clever way of using data that is created unintentionally by the users of Google to see

patterns in the world that would otherwise be invisible. I think we are just scratching the surface of what’s possible with collective intelligence.”

How wrong he was. While the model’s training score was higher, the real-world prediction had an accuracy of less than 50% which makes it as useful as flipping a coin. The project was scrapped within 3 years. Here was the reason:

One source of problems is that people making flu-related Google searches may know very little about how to diagnose flu; searches for flu or flu symptoms may well be researching disease symptoms that are similar to flu, but are not actually flu. Furthermore, analysis of search terms reportedly tracked by Google, such as "fever" and "cough", as well as effects of changes in their search algorithm over time, have raised concerns about the meaning of its predictions.

Basically, bunch of data that may not be related to the Flu that they were estimating got piled in into the dataset being used to build the model. When the accuracy was high, they took it as word.

One thing to remember is that with a huge amount of data, the probability of you getting the pattern that you want increases (for the training operation) but when it comes to actual application, do not be surprised when it fails you.

Forcing the Data to work for you (or p-hacking)!

I tend to cringe whenever I hear a study being talked about on a news show and get more skeptical about what the results shows. While I do understand that the topics that tends to get brought up in these shows are meant to be provocative to drive views (things like whether “Fats” are good for you or if “milk” is killing you), sometimes I wished they looked at if the results were statistically significant before announcing the results of these studies. As Data Scientist that is usually the key metric we use to see if our results are acceptable. However, because of its importance as a metric to get findings approved by the public, it tends to be abused.

In 2011 a project called the Reproducibility Project was initiated by Brian Nosek to reproduce the results based on 100 published papers which claimed at statistical significance. The idea was to visualize the growing problem of failed reproducibility in the social sciences (it is a common problem as incentives are only available for new papers, not papers checking on others). What the project found was that if one were to follow the steps outlined in these papers word-for-word, only a meager 35 (36.1%) publication can be replicated. Think about that when the news start talking about the results of a study.

So why does this happen? It could be due many reasons, but the common reason is p-hacking. P-hacking is the practice of repeating the experiments until you get the golden p-value. So here is usually how it works. You have a huge dataset in which you need to take a sample to run your experiment on. At the end of the experiment, you then find that your results are not statistically significant. So, do you report this? Well with p-hacking you don’t! What you could do is take another sample and repeat, and you do this until you get statistically significant results! Think about it, how often do you read papers where the publishers report what is the failure success ratio? If you have 9 insignificant results and only 1 significant result, most likely you’ll only hear about that 1.

Don’t p-hack!

?Forgetting Regression Toward the Mean

This is a little difficult to explain, so let’s start with a simple way to show the effects of Regression Toward the Mean:

Imagine if you have 5 golf players of equal skill and you asked them to do a short putt about 10 times with the following results being recorded:

Player 1: 9 out of 10

Player 2: 8 out of 10

Player 3: 8 out of 10

Player 4: 4 out of 10

Player 5: 5 out of 10

You would think that Players 1 to 3 are much better than 4 and 5 in putting. However, if you were to repeat this experiment, what do you think the results would be? Would Players 1 to 3 score just as much, or would you think that Players 4 and 5 would outdo them in the next round?

That’s the effect of regression to the mean. When we get a dataset, it is likely that we are seeing the effect of single occurrences (like round 1of putting). If you were to repeat the experiment, their performances would get closer to their mean level of performance and thus not be reflective in our model.

An example of how ignoring effects of this movement could be detrimental is with this study that showed that students who used a specific blood pressure drug could help them get better scores in the Scholastic Aptitude Test (SAT). An excerpt of this test can be found here:

Basically, an experiment was done where student who had failed their SAT exams the first time were given this drug before retaking the exam. What the study showed was that on average, student showed an increase in test scores by 20 points if they consumed this drug. Now that would sound amazing, but is it though?

There could be many other reasons that explained this increase, such as maybe students were too nervous the first-time round that made them make silly mistakes in the test. Hence, they performed further away from the mean.

When you find that your model’s performance shows some slight variance from actual outcomes, check to see if it’s a matter of Regression from the Mean before throwing it out.

I will be sharing more about my journey in this area and if you are interested in how it's going, feel free to check me out at davidraj.tech! Also do check out this book by Gary Smith and Jay Cordes titled "The 9 Pitfalls of Data Science" in which inspired this article. It goes in much details as to how big data tend to mislead Data Scientist.

要查看或添加评论，请登录

David R.的更多文章

Deploying A Convolutional Neural Network On A Microcontroller – Bridging The Gap Of ML And IOT

2021年3月13日

Deploying A Convolutional Neural Network On A Microcontroller – Bridging The Gap Of ML And IOT

With the introduction of end-to-end open source platform for machine learning like TensorFlow and MXNet, it makes it…
PSA: Building a Portfolio To Stand Out in the Job Market

2020年10月29日

PSA: Building a Portfolio To Stand Out in the Job Market

Last week I went down to my university to collect my transcript and well… it was not the joyous occasion that I was…
Learning how to be an Effective Consultant

2020年10月22日

Learning how to be an Effective Consultant

During my time as a Sales Rep many years ago, I always found myself being a consultant to my clients. This was mainly…

3 条评论
Distributing Machine Learning Models as Web Endpoints using Flask, Gunicorn, and Heroku

2020年10月16日

Distributing Machine Learning Models as Web Endpoints using Flask, Gunicorn, and Heroku

Note: This tutorial assumes that you have access to a Jupyter Notebook Environment, Heroku Account and also a basic…

1 条评论
Navigating Hosting Services That Are Efficient And Affordable

2020年10月15日

Navigating Hosting Services That Are Efficient And Affordable

When I first started my path down web development, I used to freelance as a website developer to make some money while…

3 条评论
Inventory Management – Doing it right!

2019年7月30日

Inventory Management – Doing it right!

Inventory Management is the process of managing the stock you have for sale. Over the years the science of inventory…

1 条评论
What does it take to start an e-commerce business?

2019年7月13日

What does it take to start an e-commerce business?

It is no secret that the e-commerce space has been expanding rapidly and will continue to do so for many years to come.…
PSA: It time to update your website if you have Google Maps!

2018年8月9日

PSA: It time to update your website if you have Google Maps!

As of July 2018, Google has reviewed their API policies over their line of solutions. Most notably would be the Google…
Do not do Software Upgrades, instead perform Process Upgrades!

2016年12月20日

Do not do Software Upgrades, instead perform Process Upgrades!

Every year we will get that call from our software vendors upon a new version of their application which comes packed…
Innovation in Singapore is BAD– only because we PERCEIVE it to be

2016年12月6日

Innovation in Singapore is BAD– only because we PERCEIVE it to be

When it comes to Innovation, Singapore isn’t the country that most people would refer too. Most people would think USA,…

See all articles

Avoiding The Biggest Big Data Fallacy, Where More Data Means Higher Accuracy With Real-World Examples

David R.

Developing Businesses through Digital Transformation | MSBA, GenAI, ML/AI, PMP, Databricks Champion

David R.的更多文章

社区洞察

其他会员也浏览了

Are you “Thinking with Data”?

My favourite data books of the year

Data: The Key to Understanding Our World

Beyond the Basics: Pairing Your Data with the Perfect Hypothesis Test

The Key to Insight Discovery: Where to Look in Big Data to Find Insights

Imputing Missing Data: Playing with Fire

The Pitfalls of Data Science (and how you can avoid them)

Big data: one more illusion (a second look)

Understanding the Z-Test: A Comprehensive Guide for Data Enthusiasts

How useful is your data?

David R.的更多文章

Deploying A Convolutional Neural Network On A Microcontroller – Bridging The Gap Of ML And IOT

PSA: Building a Portfolio To Stand Out in the Job Market

Learning how to be an Effective Consultant

Distributing Machine Learning Models as Web Endpoints using Flask, Gunicorn, and Heroku

Navigating Hosting Services That Are Efficient And Affordable

Inventory Management – Doing it right!

What does it take to start an e-commerce business?

PSA: It time to update your website if you have Google Maps!

Do not do Software Upgrades, instead perform Process Upgrades!

Innovation in Singapore is BAD– only because we PERCEIVE it to be

社区洞察

其他会员也浏览了

Are you “Thinking with Data”?

My favourite data books of the year

Data: The Key to Understanding Our World

Beyond the Basics: Pairing Your Data with the Perfect Hypothesis Test

The Key to Insight Discovery: Where to Look in Big Data to Find Insights

Imputing Missing Data: Playing with Fire

The Pitfalls of Data Science (and how you can avoid them)

Big data: one more illusion (a second look)

Understanding the Z-Test: A Comprehensive Guide for Data Enthusiasts

How useful is your data?