Imputation: Method of dealing with missing values.

For many reasons, datasets may contain missing values, NaNs or missing data expressed in some other form. This is a very common situation. However, it is important to remember that many machine learning algorithms can only work with numerical data. In this case, we should not ignore missing data. There are many ways to deal with missing data. The most common one is to delete the row/column with missing data from the dataset, but this means taking the risk of losing valuable data that may be in the same row or column. In this article, I will talk about how we can make our missing data valuable with a better method.

The scikit-learn library offers us the Impute class to make missing data more valuable. With the help of this class you can fill the missing value in your dataset by imputing it. Although the scikit-learn library offers us 4 types of imputation methods (SimpleImputer, IterativeImputer, KNNImputer, MissingIndicator), in this article I will only discuss the univariate (SimpleImputer) and multivariate (IterativeImputer) methods.

What is Univariate and Multivariate Imputation?

Univariate imputation uses non-missing values in the i`th feature dimension to impute missing data in the same dimension. The multivariate imputation algorithm uses all available feature dimensions to impute missing data. When we start to consider these algorithms one by one, you will see the difference.

Univariate Imputation

The SimpleImputer class in Scikit-learn provides simple strategies for filling in missing values. You can use a constant value or the statistics (mean, median, etc.) of the column with the missing values. Let's take a look at the following code fragment.

In this code fragment, we used the NumPy library to create a random data with missing values. Then, with the help of the SimpleImputer we imported, we filled the missing values in our sample data using the "mean" strategy. Now let's look at the output of the code:


We see the original and imputed sample dataset. Really very useful! Now let's talk about Multivariate imputation.

Multivariate Imputation

In the scikit-learn library, multivariate imputation methods are provided by the IterativeImputer class. This class is particularly useful when you have multiple properties with missing values and you want to impute them using information from other properties. IterativeImputer models each property with missing values as a function of the other properties and does so iteratively. At each step, one feature column is set as output y and the other columns are treated as input X. A regressor is placed on (X, y) for the known y. Then, the regressor is used to predict missing values of y and this is done iteratively for each feature. Let's take a look at the following code fragment:

Since it is still experimental, we first need to activate this predictor. Then we import the predictor model that we will use to impute. In this example I used the BayesianRidge model from scikit-learn`s linear model class. Then we write the rest of the code as above and print the new imputed dataset on the screen. Let's see the code output:

As we can see, we have our imputed dataset here. It looks very fun and nice!

Conclusion

In conclusion, we have discussed the SimpleImputer and IterativeImputer classes from the scikit-learn library to deal with the nuisance of missing values when analysing a dataset or trying to train a model on the dataset. I hope this was helpful to all of you!

Resources

https://scikit-learn.org/stable/modules/impute.html

https://medium.com/technofunnel/handling-missing-data-in-python-using-scikit-imputer-7607c8957740

https://medium.com/towards-data-science/imputer-class-in-python-from-scratch-66df6ae067e1


Sarfi Habibova

Junior Cybersecurity Analyst | Student at Khazar University

1 年

Keep going on!!!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了