登录查看更多内容

Data Imputation Techniques

Sakthivel A

Software Engineer | From Ideas to Reality: Creating Web Solutions That Make an Impact

发布日期: 2020年10月16日

+ 关注

“The idea of imputation is both seductive and dangerous”

(R.J.A Little & D.B. Rubin)

MISSING DATA

Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest. The problem of missing data is relatively common in almost all research and can have a significant effect on the conclusions that can be drawn from the data

Missing data present various problems. First, the absence of data reduces statistical power, which refers to the probability that the test will reject the null hypothesis when it is false. Second, the lost data can cause bias in the estimation of parameters. Third, it can reduce the representativeness of the samples. Fourth, it may complicate the analysis of the study. Each of these distortions may threaten the validity of the trials and can lead to invalid conclusions.

Type of Missing Data

first described and divided the types of missing data according to the assumptions based on the reasons for the missing data .In general, there are three types of missing data according to the mechanisms of missingness

1.MCAR(Missing completely at random)

2.MAR(Missing at random)

3.MNAR(Missing not at random)

Missing completely at random

Missing completely at random (MCAR) is defined as when the probability that the data are missing is not related to either the specific value which is supposed to be obtained or the set of observed responses. MCAR is an ideal but unreasonable assumption for many studies performed in the field of anesthesiology. However, if data are missing by design, because of an equipment failure or because the samples are lost in transit or technically unsatisfactory, such data are regarded as being MCAR.

The statistical advantage of data that are MCAR is that the analysis remains unbiased. Power may be lost in the design, but the estimated parameters are not biased by the absence of the data.

Missing at random

Missing at random (MAR) is a more realistic assumption for the studies performed in the anesthetic field. Data are regarded to be MAR when the probability that the responses are missing depends on the set of observed responses, but is not related to the specific missing values which is expected to be obtained.

As we tend to consider randomness as not producing bias, we may think that MAR does not present a problem. However, MAR does not mean that the missing data can be ignored. If a dropout variable is MAR, we may expect that the probability of a dropout of the variable in each case is conditionally independent of the variable, which is obtained currently and expected to be obtained in the future, given the history of the obtained variable prior to that case.

Missing not at random

If the characters of the data do not meet those of MCAR or MAR, then they fall into the category of missing not at random (MNAR).

The cases of MNAR data are problematic. The only way to obtain an unbiased estimate of the parameters in such a case is to model the missing data. The model may then be incorporated into a more complex one for estimating the missing values.

Handling Missing Data

In real world data, there are some instances where a particular element is absent because of various reasons, such as, corrupt data, failure to load the information, or incomplete extraction. Handling the missing values is one of the greatest challenges faced by analysts, because making the right decision on how to handle it generates robust data models. Let us look at different ways of imputing the missing values

Aside from this, there are three main problems that missing data causes:

Bias
More laborious processing
Reduced efficiency in outcomes

Imputation simply means replacing the missing values with an estimate, then analyzing the full data set as if the imputed values were actual observed values

1.Deleting Rows

2.Replacing with Mean/Median/Mode

3.Assign a new unique category

4. Predicting the missing value

5.Using algorithms which support Missing values

Titanic dataset

Deleting Rows

This method commonly used to handle the null values. Here, we either delete a particular row if it has a null value for a particular feature and a particular column if it has more than 70-75% of missing values. This method is advised only when there are enough samples in the data set. One has to make sure that after we have deleted the data, there is no addition of bias. Removing the data will lead to loss of information which will not give the expected results while predicting the output

pros:

Complete removal of data with missing values results in robust and highly accurate model
Deleting a particular row or a column with no specific information is better, since it does not have a high weightage

Cons:

Loss of information and data
Works poorly if the percentage of missing values is high (say 30%), compared to the whole dataset

Replacing with Mean/Median/Mode

This strategy can be applied on a feature which has numeric data like the age of a person or the ticket fare. We can calculate the mean, median or mode of the feature and replace it with the missing values. This is an approximation which can add variance to the data set. But the loss of the data can be negated by this method which yields better results compared to removal of rows and columns. Replacing with the above three approximations are a statistical approach of handling the missing values. This method is also called as leaking the data while training. Another way is to approximate it with the deviation of neighbouring values. This works better if the data is linear.

To replace it with median and mode we can use the following to calculate the same:

pros:

This is a better approach when the data size is small
It can prevent data loss which results in removal of the rows and columns

cons:

Imputing the approximations add variance and bias
Works poorly compared to other multiple-imputations method

Assign a new unique category

A categorical feature will have a definite number of possibilities, such as gender, for example. Since they have a definite number of classes, we can assign another class for the missing values. Here, the features Cabin and Embarked have missing values which can be replaced with a new category, say, U for ‘unknown’. This strategy will add more information into the dataset which will result in the change of variance. Since they are categorical, we need to find one hot encoding to convert it to a numeric form for the algorithm to understand it. Let us look at how it can be done in Python.

pros:

Less possibilities with one extra category, resulting in low variance after one hot encoding — since it is categorical
Negates the loss of data by adding an unique category

cons:

Adds less variance
Adds another feature to the model while encoding, which may result in poor performance

Predicting the missing value

Using the features which do not have missing values, we can predict the nulls with the help of a machine learning algorithm. This method may result in better accuracy, unless a missing value is expected to have a very high variance. We will be using linear regression to replace the nulls in the feature ‘age’, using other available features. One can experiment with different algorithms and check which gives the best accuracy instead of sticking to a single algorithm

pros:

Imputing the missing variable is an improvement as long as the bias from the same is smaller than the omitted variable bias
Yields unbiased estimates of the model parameters

cons:

Bias also arises when an incomplete conditioning set is used for a categorical variable
Considered only as a proxy for the true values

Using algorithms which support Missing values

KNN is a machine learning algorithm which works on the principle of distance measure. This algorithm can be used when there are nulls present in the dataset. While the algorithm is applied, KNN considers the missing values by taking the majority of the K nearest values. In this particular dataset, taking into account the person’s age, sex, class etc, we will assume that people having same data for the above mentioned features will have the same kind of fare.

Another algorithm which can be used here is RandomForest. This model produces a robust result because it works well on non-linear and the categorical data. It adapts to the data structure taking into consideration of the high variance or the bias, producing better results on large datasets

pros:

Does not require creation of a predictive model for each attribute with missing data in the dataset
Correlation of the data is neglected

cons:

Is a very time consuming process and it can be critical in data mining where large databases are being extracted
Choice of distance functions can be Euclidean, Manhattan etc. which is do not yield a robust result

Conclusion

Every dataset we come across will almost have some missing values which need to be dealt with. But handling them in an intelligent way and giving rise to robust models is a challenging task. We have gone through a number of ways in which nulls can be replaced. It is not necessary to handle a particular dataset in one single manner. One can use various methods on different features depending on how and what the data is about. Having a small domain knowledge about the data is important, which can give you an insight about how to approach the problem

要查看或添加评论，请登录

Sakthivel A的更多文章

Machine Learning Basic Concept

2020年10月16日

Machine Learning Basic Concept

Artificial Intelligence (AI): Artificial intelligence refers to the simulation of human intelligence in machine that…

Data Imputation Techniques

Sakthivel A

Software Engineer | From Ideas to Reality: Creating Web Solutions That Make an Impact

MISSING DATA

Type of Missing Data

Missing completely at random

Missing at random

Handling Missing Data

Imputation simply means replacing the missing values with an estimate, then analyzing the full data set as if the imputed values were actual observed values

1.Deleting Rows

2.Replacing with Mean/Median/Mode

3.Assign a new unique category

4. Predicting the missing value

5.Using algorithms which support Missing values

Deleting Rows

Replacing with Mean/Median/Mode

Assign a new unique category

Predicting the missing value

Using algorithms which support Missing values

Conclusion

Sakthivel A的更多文章

社区洞察

其他会员也浏览了

Top RAG Papers of the Week (November Week 4, 2024)

Dr AI?

Precision Medicine and Operational Efficacy: A Prescription for AI Integration Within the Healthcare Systems – An Executive View -Ebrahim Barkoudah MD

The Biases of Artificial Intelligence in Medicine

What Is Using IBM Watson In Everyday Medicine Like?

Challenges and insights from a Human Disease Ontology mapping project

AI’s ‘profound’ impact on medicine, and more insights from the full interview with Dr. Eric Topol

Navigating the Depths of Meta-Analysis in Research Papers: A Step-by-Step Guide

Decentralized medicine = DeMed. What is it?

Decode, Click, Discover: Simplifying Digital Medical Article Searches

MISSING DATA

Type of Missing Data

Missing completely at random

Missing at random

Handling Missing Data

Imputation simply means replacing the missing values with an estimate, then analyzing the full data set as if the imputed values were actual observed values

1.Deleting Rows

2.Replacing with Mean/Median/Mode

3.Assign a new unique category

4. Predicting the missing value

5.Using algorithms which support Missing values

Deleting Rows

Replacing with Mean/Median/Mode

Assign a new unique category

Predicting the missing value

Using algorithms which support Missing values

Conclusion

Sakthivel A的更多文章

Machine Learning Basic Concept

社区洞察

其他会员也浏览了

Top RAG Papers of the Week (November Week 4, 2024)

Dr AI?

Precision Medicine and Operational Efficacy: A Prescription for AI Integration Within the Healthcare Systems – An Executive View -Ebrahim Barkoudah MD

The Biases of Artificial Intelligence in Medicine

What Is Using IBM Watson In Everyday Medicine Like?

Challenges and insights from a Human Disease Ontology mapping project

AI’s ‘profound’ impact on medicine, and more insights from the full interview with Dr. Eric Topol

Navigating the Depths of Meta-Analysis in Research Papers: A Step-by-Step Guide

Decentralized medicine = DeMed. What is it?

Decode, Click, Discover: Simplifying Digital Medical Article Searches