Avoiding The Biggest Big Data Fallacy, Where More Data Means Higher Accuracy With Real-World Examples
The go-to response when you find that your Machine Learning model gets low accuracy scores tends to be that you need more data, and it is easy to see why. Being the easiest and fastest way to improve a model, it is the go-to solution for Data Scientist struggling to meet deadlines. However, sometimes too much data can lead to some serious consequences from misleading outputs to straight up discrimination. It is these issues that tends to not be discussed, which is the main point of this article. I would like to share some of the common traps that comes from thinking that “More Data equals better results” and accompany them with some real-world examples of the consequences that ensues.
Getting Lost in the Data
When I was studying to be a Data Scientist, datasets were provided for us to work on and to build a model out of. I used to shudder at the contents of these datasets which comes with 100 to over 300 columns of data. Imagine doing feature engineering on every column. So, what did we used to do? We started with:
- Dropping columns where majority of the data is missing
- Drop columns where correlation with other variable columns is high
- And when times are tough, PCA!
Once that is done, we see the remanding columns left and we let accuracy score lead the way. Boy was I wrong. The problem with this approach is that we tend to not ask the real question which is:
What kind of data I should be asking for to answer my objective?
You see, I started with “This is the data I have, let’s see how it answers the problem at hand” when I should be starting with “So this is the problem, what information do I need to solve the problem”. By going through this route, we lay out the kind of data we need, and see which data best resembles our requirement in the dataset. This prevents us from blindly using columns and getting lost in the data.
Talking about getting lost in data, here is an example of Data Analyst using wrong types of data and comparing it with each other:
Here we have a study about the average life expectancy of Artist from Different Genres being compared with the average life expectancy for males and females in the country. At first glance you might think “Oh my…it’s amazing just a choice of music genre could severely impact the life expectancy!” However, something seems off about this chart…
The female and male life expectancy lines seem to be showing an increase in life expectancy over time, which makes sense, but years are not on the horizontal axis! The horizontal axis is 14 musical
genres. Why are the trends in U.S. life expectancy graphed in comparison to musical genres, rather than by year?
Also, how is it possible that rappers and hip-hop artist on average live only till their 30s? Don’t some of them survive way longer? Well, it could be that the “age” of the genre was not considered.
Hip hop began in the late 1970s. People who began doing hip hop in 1980 at age 20 and are still alive would be 57 years old in 2017. Anyone who died before 2017 would be younger than 57. People who began doing hip hop after 1980 would be even younger. In contrast, blues, jazz, and country have been around long enough for performers to grow old.
Never let your dataset drive your direction. Think critically and plan the direction you want to take.
Putting Data Before Critical Thinking
Have you heard about the Google Flu Project? It was this grand project by Google to use big data of searched words can accurately estimate the current level of weekly influenza activity in each region of the United States with a proclaimed accuracy of 97.5%. Here is an excerpt about the project:
Google Flu Trends (GFT) was a web service operated by Google. It provided estimates of influenza activity for more than 25 countries. By aggregating Google Search queries, it attempted to make accurate predictions about flu activity. This project was first launched in 2008 by Google.org to help predict outbreaks of flu.
https://en.wikipedia.org/wiki/Google_Flu_Trends
This sounded amazing at that time! Google’s data-mining program looked at 50 million search queries and identified the 45 queries that were the most closely correlated with the incidence of flu. It was pure-and-simple data mining and a terrific example of the Feynman Trap.
What is the Feynman Trap?
It is the act of ransacking data for patterns without any preconceived idea of what one is looking for
It was praised by many Data Scientist, with a MIT professor explaining that “This seems like a really clever way of using data that is created unintentionally by the users of Google to see
patterns in the world that would otherwise be invisible. I think we are just scratching the surface of what’s possible with collective intelligence.”
How wrong he was. While the model’s training score was higher, the real-world prediction had an accuracy of less than 50% which makes it as useful as flipping a coin. The project was scrapped within 3 years. Here was the reason:
One source of problems is that people making flu-related Google searches may know very little about how to diagnose flu; searches for flu or flu symptoms may well be researching disease symptoms that are similar to flu, but are not actually flu. Furthermore, analysis of search terms reportedly tracked by Google, such as "fever" and "cough", as well as effects of changes in their search algorithm over time, have raised concerns about the meaning of its predictions.
Basically, bunch of data that may not be related to the Flu that they were estimating got piled in into the dataset being used to build the model. When the accuracy was high, they took it as word.
One thing to remember is that with a huge amount of data, the probability of you getting the pattern that you want increases (for the training operation) but when it comes to actual application, do not be surprised when it fails you.
Forcing the Data to work for you (or p-hacking)!
I tend to cringe whenever I hear a study being talked about on a news show and get more skeptical about what the results shows. While I do understand that the topics that tends to get brought up in these shows are meant to be provocative to drive views (things like whether “Fats” are good for you or if “milk” is killing you), sometimes I wished they looked at if the results were statistically significant before announcing the results of these studies. As Data Scientist that is usually the key metric we use to see if our results are acceptable. However, because of its importance as a metric to get findings approved by the public, it tends to be abused.
In 2011 a project called the Reproducibility Project was initiated by Brian Nosek to reproduce the results based on 100 published papers which claimed at statistical significance. The idea was to visualize the growing problem of failed reproducibility in the social sciences (it is a common problem as incentives are only available for new papers, not papers checking on others). What the project found was that if one were to follow the steps outlined in these papers word-for-word, only a meager 35 (36.1%) publication can be replicated. Think about that when the news start talking about the results of a study.
So why does this happen? It could be due many reasons, but the common reason is p-hacking. P-hacking is the practice of repeating the experiments until you get the golden p-value. So here is usually how it works. You have a huge dataset in which you need to take a sample to run your experiment on. At the end of the experiment, you then find that your results are not statistically significant. So, do you report this? Well with p-hacking you don’t! What you could do is take another sample and repeat, and you do this until you get statistically significant results! Think about it, how often do you read papers where the publishers report what is the failure success ratio? If you have 9 insignificant results and only 1 significant result, most likely you’ll only hear about that 1.
Don’t p-hack!
?Forgetting Regression Toward the Mean
This is a little difficult to explain, so let’s start with a simple way to show the effects of Regression Toward the Mean:
Imagine if you have 5 golf players of equal skill and you asked them to do a short putt about 10 times with the following results being recorded:
Player 1: 9 out of 10
Player 2: 8 out of 10
Player 3: 8 out of 10
Player 4: 4 out of 10
Player 5: 5 out of 10
You would think that Players 1 to 3 are much better than 4 and 5 in putting. However, if you were to repeat this experiment, what do you think the results would be? Would Players 1 to 3 score just as much, or would you think that Players 4 and 5 would outdo them in the next round?
That’s the effect of regression to the mean. When we get a dataset, it is likely that we are seeing the effect of single occurrences (like round 1of putting). If you were to repeat the experiment, their performances would get closer to their mean level of performance and thus not be reflective in our model.
An example of how ignoring effects of this movement could be detrimental is with this study that showed that students who used a specific blood pressure drug could help them get better scores in the Scholastic Aptitude Test (SAT). An excerpt of this test can be found here:
Basically, an experiment was done where student who had failed their SAT exams the first time were given this drug before retaking the exam. What the study showed was that on average, student showed an increase in test scores by 20 points if they consumed this drug. Now that would sound amazing, but is it though?
There could be many other reasons that explained this increase, such as maybe students were too nervous the first-time round that made them make silly mistakes in the test. Hence, they performed further away from the mean.
When you find that your model’s performance shows some slight variance from actual outcomes, check to see if it’s a matter of Regression from the Mean before throwing it out.
I will be sharing more about my journey in this area and if you are interested in how it's going, feel free to check me out at davidraj.tech! Also do check out this book by Gary Smith and Jay Cordes titled "The 9 Pitfalls of Data Science" in which inspired this article. It goes in much details as to how big data tend to mislead Data Scientist.