A case of spongy datasets: Missing values
Manu Nellutla's Phot edit.

A case of spongy datasets: Missing values

It is often said that 80 percent of any work is preparation and in data science, it might be a tad bit more. Why? for the simple fact of "Garbage in -> Garbage out". One of the first steps in any analysis is data cleaning and prepping of any irregularities.

The most common of irregularities is a case of missing values and as I like to call "Spong Datasets".

Yes, all about pores. We start with understanding the data observations, their associations, their correlations, and how significant is any datapoint to the outcome of the analysis. Like, Will the lack of values for that data point induce any biases or elevate if existed.

Checking for missing values.

So, step 1 is to check for if there are missing values in the dataset and in what columns. The simple way of doing this is eyeballing. But when the datasets are large we can check programmatically. I work with Python and R and listed below are some functions I use:

  • Python: Pandas library has a function called isnull() and dataframe.isnull().sum() will print all the columns and number of missing values
  • R colSums(is.na(data)) will do the job.
Now, how much data is missing and what is allowed vs not is another discussion. I am skipping it for now.

About Missing Values

Disclaimer: As far as I know, there is no SILVER BULLET to this problem. And this is my method to the madness.

So, I started to make some checklists on how to handle missing values. The more I researched more complex it got and very soon. Before we go deep we need to understand 2 main things.

  • How and why is the data missing?
  • How should we handle them?

Let's start with "How and Why ". After some researching I found, (Assumptions of) Missing data can be divided as follows: 

  • MCAR - Missing Completely At Random: Assume MCAR If the probability of being missing is independent of any other variables in the dataset. Example: failure to capture data - computer failure, system died, etc... 
  • MAR - Missing At Random: Assume MAR if the probability of inclusion depends on other observed data. Based on the reads, MAR is most common and realistic. Example: data is not captured because of some options. (option - I don't want to identify my gender or race. OR weights are missing because the scale is on soft surface)
  • MNAR - Missing Not At Random: Assume MNAR if not above 2 cases. Example: (iffy) - measuring device works intermittently ( I made it up)

Handling Missing Values

Now that we have an assumption on missing values, the next question is how do you handle it? Again, broadly in 2 ways.

  • Drop the rows: That says it all. We exclude the rows/columns with missing values in our modeling/analysis. When you should? When they are an extremely small percentage of the dataset. When you Should not: When they have some kind of dependency and removing them can introduce bias or elevate bias. One way to check it is to do a histogram and see if it is skewed.
  • Fill the rows (a.k.a Imputation): There are many many ways to do this. Let's look at a few.
  • ** Fill Zero or Constants - This method is good for categorical variables. However, known to introduce bias.
  • ** Fill Mean or Median - A most common way to fill values. This method is good for small numeric data sets.
  • ** Fill by KNN algorithm - Another popular way is this algorithm which uses K nearest numbers to predict missing values.

While filling missing values demands its own discussion, the above methods, are just for starters.

Note for the future:

There is a push to reduce data cleaning and preparation time with data wrangling, AI, and machine learning. I am including just 2 links I found which are trying to reverse the 80-20 rule of data cleaning vs task execution.

That's it for now. This article is part of my learning so please provide feedback, observations, corrections, comments, and any tips.

Paul C.

Building teams through innovation, investment and research.

4 年

Well done Manu. I appreciate your articles. Thank you

回复

要查看或添加评论,请登录

Manu Nellutla的更多文章

  • Tell... Tell.. signs of overfitting

    Tell... Tell.. signs of overfitting

    This week in class we learned how overfitting makes your solution more promising than it really is. There are many…

  • The math of similarity - Cohesion

    The math of similarity - Cohesion

    Cohesiveness Of Data Points This week we have been studying a part of machine learning using the K-NN algorithm. An…

    3 条评论
  • To plot and what to plot?

    To plot and what to plot?

    As a green data science enthusiast, I try to visually depict my data interpretations a bit more. There are a lot of…

社区洞察

其他会员也浏览了