A case of spongy datasets: Missing values
Manu Nellutla
Manager, Digital Integration Architect @ KPMG US | Leading Digital Transformation efforts
It is often said that 80 percent of any work is preparation and in data science, it might be a tad bit more. Why? for the simple fact of "Garbage in -> Garbage out". One of the first steps in any analysis is data cleaning and prepping of any irregularities.
The most common of irregularities is a case of missing values and as I like to call "Spong Datasets".
Yes, all about pores. We start with understanding the data observations, their associations, their correlations, and how significant is any datapoint to the outcome of the analysis. Like, Will the lack of values for that data point induce any biases or elevate if existed.
Checking for missing values.
So, step 1 is to check for if there are missing values in the dataset and in what columns. The simple way of doing this is eyeballing. But when the datasets are large we can check programmatically. I work with Python and R and listed below are some functions I use:
- Python: Pandas library has a function called isnull() and dataframe.isnull().sum() will print all the columns and number of missing values
- R : colSums(is.na(data)) will do the job.
Now, how much data is missing and what is allowed vs not is another discussion. I am skipping it for now.
About Missing Values
Disclaimer: As far as I know, there is no SILVER BULLET to this problem. And this is my method to the madness.
So, I started to make some checklists on how to handle missing values. The more I researched more complex it got and very soon. Before we go deep we need to understand 2 main things.
- How and why is the data missing?
- How should we handle them?
Let's start with "How and Why ". After some researching I found, (Assumptions of) Missing data can be divided as follows:
- MCAR - Missing Completely At Random: Assume MCAR If the probability of being missing is independent of any other variables in the dataset. Example: failure to capture data - computer failure, system died, etc...
- MAR - Missing At Random: Assume MAR if the probability of inclusion depends on other observed data. Based on the reads, MAR is most common and realistic. Example: data is not captured because of some options. (option - I don't want to identify my gender or race. OR weights are missing because the scale is on soft surface)
- MNAR - Missing Not At Random: Assume MNAR if not above 2 cases. Example: (iffy) - measuring device works intermittently ( I made it up)
Handling Missing Values
Now that we have an assumption on missing values, the next question is how do you handle it? Again, broadly in 2 ways.
- Drop the rows: That says it all. We exclude the rows/columns with missing values in our modeling/analysis. When you should? When they are an extremely small percentage of the dataset. When you Should not: When they have some kind of dependency and removing them can introduce bias or elevate bias. One way to check it is to do a histogram and see if it is skewed.
- Fill the rows (a.k.a Imputation): There are many many ways to do this. Let's look at a few.
- ** Fill Zero or Constants - This method is good for categorical variables. However, known to introduce bias.
- ** Fill Mean or Median - A most common way to fill values. This method is good for small numeric data sets.
- ** Fill by KNN algorithm - Another popular way is this algorithm which uses K nearest numbers to predict missing values.
While filling missing values demands its own discussion, the above methods, are just for starters.
Note for the future:
There is a push to reduce data cleaning and preparation time with data wrangling, AI, and machine learning. I am including just 2 links I found which are trying to reverse the 80-20 rule of data cleaning vs task execution.
That's it for now. This article is part of my learning so please provide feedback, observations, corrections, comments, and any tips.
Building teams through innovation, investment and research.
4 年Well done Manu. I appreciate your articles. Thank you