Top 10 Ways to deal with Missing Values in Python

Top 10 Ways to deal with Missing Values in Python

Python for data analysis is super powerful, but dealing with missing values can be a big challenge. There are several ways to deal with missing values, but each has pros and cons. Choosing the right approach for your data and analysis is the most important thing.

In this article, I am writing about the top 10 ways to deal with Missing Values in Python while you're doing exploratory data analysis. You can follow Babu Chakraborty on LinkedIn or connect with me for a catch-up call!

But let's understand the types of missing data you might find while exploring any dataset and how to classify them.

  • Missing Completely at Random (MCAR): The probability of missing data is unrelated to the precise value obtained or the collection of practical answers.
  • Missing at Random (MAR): The probability of missing responses is decided by collecting observed responses rather than the exact missing values expected to be reached.
  • Missing not at Random (MNAR): Other than the abovementioned categories, MNAR is the missing data. The MNAR data cases are a pain to deal with. Modelling the lost data is the only way to get a fair approximation of the parameters in this situation.

How to overcome Missing data in our dataset?

Let's explore Python's top 10 ways to deal with Missing Values.

Don't do anything

Don't do anything about the missing data. I think it's better to hand over total control to the algorithm over how it responds to the data. On the other hand, various algorithms react differently to missing data. Some algorithms, for example, identify the best imputation values for missing data based on training loss reduction. Take XGBoost, for instance.

In some cases, such as linear regression, an error will occur. It simply means that you'll have to deal with missing data either during the pre-processing phases or when the model fails, and we'll have to figure out what went wrong. This section is basically like a trial and error technique; depending on the reaction, we'll proceed.

Drop it if it is not in use?

Unless it's time series model or we are dealing with any data and time object, we can work by negating the missing data. My rationale is that a healthy data set with accurate values will give better results than a data set with imputed values (Some may disagree, though!).

Excluding observations with missing data is the following most easy approach. However, you risk missing some critical data points as a result. You may do this by using the Python pandas package's dropna() function to remove all the columns with missing values.

However, rather than eliminating all missing values from all columns, utilize your domain knowledge or seek the help of a domain expert to selectively remove the rows/columns with missing values that aren't relevant to the machine learning problem.

Imputation by Mean

Using this approach, you may compute the mean of a column's non-missing values and then replace the missing values in each column separately and independently of the others. The most significant disadvantage is that it can only be used with numerical data. It's a simple and fast method that works well with small numerical datasets.

However, there are certain limitations, such as the fact that feature correlations are ignored. In addition, it only works for a single column at a time. Furthermore, if the outlier treatment is skipped, a skewed mean value will almost certainly be substituted, lowering the model's overall quality.

Imputation by Median

Another technique of imputation that addresses the outlier problem in the previous method is to utilize median values. When sorted, it ignores the influence of outliers and updates the middle value that occurred in that column.

Imputation by Most frequent values (mode)

This method may be applied to categorical variables with a finite set of values. To impute, you can use the most common value. For example, whether the available alternatives are nominal category values such as True/False or conditions such as normal/abnormal. It is especially true for ordinal categorical factors such as educational attainment. Some examples are pre-primary, primary, secondary, high school, graduation, and so on.

Unfortunately, because this method ignores feature connections, data bias is dangerous. If the category values aren't balanced, you're more likely to introduce bias into the data (class imbalance problem).

Imputation for Categorical values

When categorical columns have missing values, the most general category may be utilized to fill in the gaps. If missing values exist, a new variety can be created to replace them.

Last observation carried forward (LOCF)

It is a standard statistical approach for analyzing longitudinal repeated measures data when some follow-up observations are missing.

Linear Interpolation

It's the method of approximating a missing value by joining dots in increasing order along a straight line. In a nutshell, it calculates the unknown value in the same ascending order as the values that came before it.

Because Linear Interpolation is the default method, we didn't have to specify it while utilizing it. As a result, it will almost always be used in a time-series dataset.

Imputation by K-NN

A fundamental classification approach is the k-nearest-neighbors (kNN) algorithm. Class membership is the outcome of k-NN categorization.

An item's categorization is determined by how closely it resembles the points in the training set, with the object going to the class with the most members among its k closest neighbors. If k = 1, the item is assigned to the item's nearest neighbor class.

Finding the k's nearest neighbors to the observation with missing data and then imputing them based on the non-missing values in the neighborhood might help generate predictions about the missing values.

Imputation by Multivariate Imputation by Chained Equation (MICE)

MICE is a method for replacing missing data values in data collection via multiple imputations. You can start by making duplicate copies of the data set with missing values in one or more variables.

Final Thoughts

Exploratory data analysis is exciting, but sometimes it gets daunting while dealing with voluminous data. Indeed, it's hard to guess the missing values, but we can design a robust model with a logical approach. But, again, there's no thumb rule to it!?

Rajkumar Mathur

??Plastipreneur Mentor I Operational Excellence I Improving Productivity??I Soaring Profitability??I Enduring Manufacturing Plants I Packaging I Plastics I Compostable I Recycling I Student Empowerment I MoC Niti Aayog??

1 年

A best-in-class solution Babu. Keep helping #msmesector

要查看或添加评论,请登录

Babu Chakraborty的更多文章

社区洞察

其他会员也浏览了