DATA IMPUTATION
Imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". There are three main problems that missing data causes: missing data can introduce a substantial amount of bias, make the handling and analysis of the data more arduous, and create reductions in efficiency. Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with listwise deletion of cases that have missing values. That is to say, when one or more values are missing for a case, most statistical packages default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results. In datasets, missing values could be represented as ‘?’, ‘nan’, ’N/A’, blank cell, or sometimes ‘-999’, ’inf’, ‘-inf’.
Missing data mechanisms:-
The study of missing data was formalized by Donald Rubin with the concept of missing mechanism in which missing-data indicators are random variables and assigned a distribution. Missing data mechanism describes the underlying mechanism that generates missing data and can be categorized into three types
1- Missing completely at random (MCAR)
2- Missing at random (MAR)
3- Missing not at random (MNAR).
MCAR means that the occurrence of missing values is completely at random, not related to any variable. MAR implies that the missingness only relates to the observed data and NMAR refers to the case that the missing values are related to both observed and unobserved variables and the missing mechanism cannot be ignored.
It is important to consider missing data mechanisms when deciding how to deal with missing data. If the missing data mechanism is MCAR, some simple methods may yield unbiased estimates but when the missing mechanism is NMAR, no method will likely uncover the truth unless additional information is unknown.
SIMPLE DATA IMPUTATION
SIMPLE DATA IMPUTATION can be defined as averages or extractions from a predictive distribution of missing values, require a method of creating a predictive distribution for imputation based on the observed data
-Little and Rubin [2019]
Simple Data Imputation defines two generic approaches for generating this distribution: explicit modeling and implicit modeling.
EXPLICIT MODELING:- In explicit modeling, the predictive distribution is based on a formal statistical model, for example, multivariate normal, therefore the assumptions are explicit. Examples of explicit modeling are average imputation, regression imputation, stochastic regression imputation.
IMPLICIT MODELING:- In implicit modeling, the focus is on an algorithm, which implies an underlying model. Assumptions are implied, but they still need to be carefully evaluated to ensure they are reasonable. These are examples of implicit modeling: Hot Deck imputation, imputation by replacement, and Cold Deck imputation.
MEAN/MEDIAN IMPUTATION-Imputing missing values is the best method when you have large amounts of data to deal with. The simplest methods to impute missing values include filling in a constant or the mean of the variable or other basic statistical parameters like median and mode.
MODE IMPUTATION-
Mode imputation is easy to apply – but using it the wrong way might screw the quality of your data. It replaces missing values of a categorical variable by the mode of non-missing cases of that variable.
IMPUTATION USING (MOST FREQUENT) OR (ZERO/CONSTANT) VALUES:
Most Frequent is another statistical strategy to impute missing values and It works with categorical features (strings or numerical representations) by replacing missing data with the most frequent values within each column.
Zero or Constant imputation — as the name suggests — it replaces the missing values with either zero or any constant value you specify. It’s generally in a random category.
REGRESSION IMPUTATION-
Regression imputation has the opposite problem of mean imputation. A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where the value of that variable is missing. The problem is that the imputed data do not have an error term included in their estimation, thus the estimates fit perfectly along the regression line without any residual variance. This causes relationships to be over identified and suggest greater precision in the imputed values than is warranted. Regression assumes that the imputed values fall directly on a regression line with a nonzero slope, so it implies a Correlation of 1 between the predictors and the missing outcome variable. Opposing the mean substitution method, regression imputation will overestimate the correlations, however, the variances and covariances are underestimated. Another way to improve regression imputation is the stochastic regression imputation, where a random error is added to the predicted value from the regression.
Stochastic Regression Imputation:
Stochastic regression imputation aims to reduce the bias by an extra step of augmenting each predicted score with a residual term. This residual term is normally distributed with a mean of zero and a variance equal to the residual variance from the regression of the predictor on the outcome.
Hot-Deck Imputation:-
In this method, missing values are replaced with random values from that column. While this has the advantage of being simple, be extra careful if you’re trying to examine the nature of the features and how they relate to each other, since multivariable relationships will be distorted.
Subsitution Imputation:
This technique is more convenient in a survey context and consists in replace nonresponding units with alternative units not observed in the current sample.
Cold deck Imputation:
This technique consists in replace the missing value for one constant from an external source, such as a value from a previous realization of the same survey. This technique is similar to substitution, but in this case, a constant value is used and in the substitution technique different values can be used to substitute the missing values.
K-nearest neighbor (KNN) imputation
Besides model-based imputation like regression imputation, neighbor-based imputation can also be used. K-nearest neighbor (KNN) imputation is an example of neighbor-based imputation. For a discrete variable, KNN imputer uses the most frequent value among the k nearest neighbors and, for a continuous variable, use the mean or mode.
Multiple Data Imputation
Multiple imputations is a general approach to the problem of missing data that is available in several commonly used statistical packages. It aims to allow for the uncertainty about the missing data by creating several different plausible imputed data sets and appropriately combining results obtained from each of them.
The figure illustrates these concepts and the steps in the multiple imputation process are as follows:
1- For each attribute that has a missing value in a data set record, a set of n values to be imputed is generated;
2- A statistical analysis is performed on each data set, generated from the use of one of the n replacement suggestions generated in the previous item;
3- The results of the analyses performed are combined to produce a set of results.
Imputation Using Multivariate Imputation by Chained Equation (MICE):-
Multivariate imputation by chained equations (MICE) has emerged as a principled method of dealing with missing data. Despite properties that make MICE particularly useful for large imputation procedures and advances in software development that now make it accessible to many researchers, many psychiatric researchers have not been trained in these methods and few practical resources exist to guide researchers in the implementation of this technique.
Imputation Using Deep Learning (Data wig):-
This method works very well with categorical and non-numerical features. It is a library that learns Machine Learning models using Deep Neural Networks to impute missing values in a data frame. It also supports both CPU and GPU for training.
Their several more imputation technique like Composite Data Imputation and Cascading Data Imputation
Conclusion:
Missing data is a common problem in practical data analysis. Imputation simply means that we replace the missing values with some guessed/estimated ones. Data Imputations can be defined as averages or extractions from a predictive distribution of missing values, require a method of creating a predictive distribution for imputation based on the observed data. These methods are generally reasonable to use when the data mechanism is MCAR or MAR.
However, when deciding how to impute missing values in practice, it is important to consider:
· the context of the data
· amount of missing data
· missing data mechanism
For instance, if all values below/above a threshold of a variable are missing (an example of NMAR), none of the methods will impute values similar to the truth. So all dataset we come across will almost have some missing values which need to be dealt with. But handling them in an intelligent way and giving rise to robust models is a challenging task. One can use various methods on different features depending on how and what the data is about.