登录查看更多内容

A case of spongy datasets: Missing values

Manu Nellutla

Manager, Digital Integration Architect @ KPMG US | Leading Digital Transformation efforts

发布日期: 2020年6月17日

It is often said that 80 percent of any work is preparation and in data science, it might be a tad bit more. Why? for the simple fact of "Garbage in -> Garbage out". One of the first steps in any analysis is data cleaning and prepping of any irregularities.

The most common of irregularities is a case of missing values and as I like to call "Spong Datasets".

Yes, all about pores. We start with understanding the data observations, their associations, their correlations, and how significant is any datapoint to the outcome of the analysis. Like, Will the lack of values for that data point induce any biases or elevate if existed.

Checking for missing values.

So, step 1 is to check for if there are missing values in the dataset and in what columns. The simple way of doing this is eyeballing. But when the datasets are large we can check programmatically. I work with Python and R and listed below are some functions I use:

Python: Pandas library has a function called isnull() and dataframe.isnull().sum() will print all the columns and number of missing values
R : colSums(is.na(data)) will do the job.

Now, how much data is missing and what is allowed vs not is another discussion. I am skipping it for now.

About Missing Values

Disclaimer: As far as I know, there is no SILVER BULLET to this problem. And this is my method to the madness.

So, I started to make some checklists on how to handle missing values. The more I researched more complex it got and very soon. Before we go deep we need to understand 2 main things.

How and why is the data missing?
How should we handle them?

Let's start with "How and Why ". After some researching I found, (Assumptions of) Missing data can be divided as follows:

MCAR - Missing Completely At Random: Assume MCAR If the probability of being missing is independent of any other variables in the dataset. Example: failure to capture data - computer failure, system died, etc...
MAR - Missing At Random: Assume MAR if the probability of inclusion depends on other observed data. Based on the reads, MAR is most common and realistic. Example: data is not captured because of some options. (option - I don't want to identify my gender or race. OR weights are missing because the scale is on soft surface)
MNAR - Missing Not At Random: Assume MNAR if not above 2 cases. Example: (iffy) - measuring device works intermittently ( I made it up)

Handling Missing Values

Now that we have an assumption on missing values, the next question is how do you handle it? Again, broadly in 2 ways.

Drop the rows: That says it all. We exclude the rows/columns with missing values in our modeling/analysis. When you should? When they are an extremely small percentage of the dataset. When you Should not: When they have some kind of dependency and removing them can introduce bias or elevate bias. One way to check it is to do a histogram and see if it is skewed.
Fill the rows (a.k.a Imputation): There are many many ways to do this. Let's look at a few.
** Fill Zero or Constants - This method is good for categorical variables. However, known to introduce bias.
** Fill Mean or Median - A most common way to fill values. This method is good for small numeric data sets.
** Fill by KNN algorithm - Another popular way is this algorithm which uses K nearest numbers to predict missing values.

While filling missing values demands its own discussion, the above methods, are just for starters.

Note for the future:

There is a push to reduce data cleaning and preparation time with data wrangling, AI, and machine learning. I am including just 2 links I found which are trying to reverse the 80-20 rule of data cleaning vs task execution.

data cataloging
data wrangling

That's it for now. This article is part of my learning so please provide feedback, observations, corrections, comments, and any tips.

Paul C.

Building teams through innovation, investment and research.

4 年

Well done Manu. I appreciate your articles. Thank you

查看更多评论

要查看或添加评论，请登录

Manu Nellutla的更多文章

Tell... Tell.. signs of overfitting

2020年8月8日

Tell... Tell.. signs of overfitting

This week in class we learned how overfitting makes your solution more promising than it really is. There are many…
The math of similarity - Cohesion

2020年7月7日

The math of similarity - Cohesion

Cohesiveness Of Data Points This week we have been studying a part of machine learning using the K-NN algorithm. An…

3 条评论
To plot and what to plot?

2020年6月13日

To plot and what to plot?

As a green data science enthusiast, I try to visually depict my data interpretations a bit more. There are a lot of…

A case of spongy datasets: Missing values

Manu Nellutla

Manager, Digital Integration Architect @ KPMG US | Leading Digital Transformation efforts

Checking for missing values.

About Missing Values

Handling Missing Values

Note for the future:

Manu Nellutla的更多文章

社区洞察

其他会员也浏览了

Mastering pandas for Large Datasets: Strategies for Efficient Processing

Cleaning the DATA

Data Scientist Journey with the 100 Days of Code Challenge - Part 1

Essential Tools for Aspiring Data Scientists: Your Path to Success

?? Unlock Time Series Insights Using Python’s KPSS Test ??

The Effects of Data Noise on the Efficiency of Vector Search Algorithms

Choosing Your Companion for Data and AI Journey: Jupyter Notebook vs. Dataiku DSS. Part 2.

How to index data into Vector DB from highly unstructured pdfs

Time Series Vectors in Neo4j

Checking for missing values.

About Missing Values

Handling Missing Values

Note for the future:

Manu Nellutla的更多文章

Tell... Tell.. signs of overfitting

The math of similarity - Cohesion

To plot and what to plot?

社区洞察

其他会员也浏览了

Mastering pandas for Large Datasets: Strategies for Efficient Processing

Cleaning the DATA

Data Scientist Journey with the 100 Days of Code Challenge - Part 1

Essential Tools for Aspiring Data Scientists: Your Path to Success

?? Unlock Time Series Insights Using Python’s KPSS Test ??

The Effects of Data Noise on the Efficiency of Vector Search Algorithms

Choosing Your Companion for Data and AI Journey: Jupyter Notebook vs. Dataiku DSS. Part 2.

How to index data into Vector DB from highly unstructured pdfs

Time Series Vectors in Neo4j