登录查看更多内容

Top 10 Ways to deal with Missing Values in Python

Babu Chakraborty

Head of Marketing Technology | AI-Powered Digital Marketing Expert (MTech AI @ IITP) | Branding & Social Media Marketing Strategist

发布日期: 2022年10月11日

Python for data analysis is super powerful, but dealing with missing values can be a big challenge. There are several ways to deal with missing values, but each has pros and cons. Choosing the right approach for your data and analysis is the most important thing.

In this article, I am writing about the top 10 ways to deal with Missing Values in Python while you're doing exploratory data analysis. You can follow Babu Chakraborty on LinkedIn or connect with me for a catch-up call!

But let's understand the types of missing data you might find while exploring any dataset and how to classify them.

Missing Completely at Random (MCAR): The probability of missing data is unrelated to the precise value obtained or the collection of practical answers.
Missing at Random (MAR): The probability of missing responses is decided by collecting observed responses rather than the exact missing values expected to be reached.
Missing not at Random (MNAR): Other than the abovementioned categories, MNAR is the missing data. The MNAR data cases are a pain to deal with. Modelling the lost data is the only way to get a fair approximation of the parameters in this situation.

How to overcome Missing data in our dataset?

Let's explore Python's top 10 ways to deal with Missing Values.

Don't do anything

Don't do anything about the missing data. I think it's better to hand over total control to the algorithm over how it responds to the data. On the other hand, various algorithms react differently to missing data. Some algorithms, for example, identify the best imputation values for missing data based on training loss reduction. Take XGBoost, for instance.

In some cases, such as linear regression, an error will occur. It simply means that you'll have to deal with missing data either during the pre-processing phases or when the model fails, and we'll have to figure out what went wrong. This section is basically like a trial and error technique; depending on the reaction, we'll proceed.

Drop it if it is not in use?

Unless it's time series model or we are dealing with any data and time object, we can work by negating the missing data. My rationale is that a healthy data set with accurate values will give better results than a data set with imputed values (Some may disagree, though!).

Excluding observations with missing data is the following most easy approach. However, you risk missing some critical data points as a result. You may do this by using the Python pandas package's dropna() function to remove all the columns with missing values.

However, rather than eliminating all missing values from all columns, utilize your domain knowledge or seek the help of a domain expert to selectively remove the rows/columns with missing values that aren't relevant to the machine learning problem.

Imputation by Mean

Using this approach, you may compute the mean of a column's non-missing values and then replace the missing values in each column separately and independently of the others. The most significant disadvantage is that it can only be used with numerical data. It's a simple and fast method that works well with small numerical datasets.

However, there are certain limitations, such as the fact that feature correlations are ignored. In addition, it only works for a single column at a time. Furthermore, if the outlier treatment is skipped, a skewed mean value will almost certainly be substituted, lowering the model's overall quality.

Imputation by Median

Another technique of imputation that addresses the outlier problem in the previous method is to utilize median values. When sorted, it ignores the influence of outliers and updates the middle value that occurred in that column.

领英推荐

Python Data Structures

Yohannes Negussie 11 个月前

Handling Imbalanced Datasets by Oversampling and…

Lakshmi Prabha Ramesh 8 个月前

WHAT ARE DATA TYPES IN PYTHON?

bhanu sri 3 年前

Imputation by Most frequent values (mode)

This method may be applied to categorical variables with a finite set of values. To impute, you can use the most common value. For example, whether the available alternatives are nominal category values such as True/False or conditions such as normal/abnormal. It is especially true for ordinal categorical factors such as educational attainment. Some examples are pre-primary, primary, secondary, high school, graduation, and so on.

Unfortunately, because this method ignores feature connections, data bias is dangerous. If the category values aren't balanced, you're more likely to introduce bias into the data (class imbalance problem).

Imputation for Categorical values

When categorical columns have missing values, the most general category may be utilized to fill in the gaps. If missing values exist, a new variety can be created to replace them.

Last observation carried forward (LOCF)

It is a standard statistical approach for analyzing longitudinal repeated measures data when some follow-up observations are missing.

Linear Interpolation

It's the method of approximating a missing value by joining dots in increasing order along a straight line. In a nutshell, it calculates the unknown value in the same ascending order as the values that came before it.

Because Linear Interpolation is the default method, we didn't have to specify it while utilizing it. As a result, it will almost always be used in a time-series dataset.

Imputation by K-NN

A fundamental classification approach is the k-nearest-neighbors (kNN) algorithm. Class membership is the outcome of k-NN categorization.

An item's categorization is determined by how closely it resembles the points in the training set, with the object going to the class with the most members among its k closest neighbors. If k = 1, the item is assigned to the item's nearest neighbor class.

Finding the k's nearest neighbors to the observation with missing data and then imputing them based on the non-missing values in the neighborhood might help generate predictions about the missing values.

Imputation by Multivariate Imputation by Chained Equation (MICE)

MICE is a method for replacing missing data values in data collection via multiple imputations. You can start by making duplicate copies of the data set with missing values in one or more variables.

Final Thoughts

Exploratory data analysis is exciting, but sometimes it gets daunting while dealing with voluminous data. Indeed, it's hard to guess the missing values, but we can design a robust model with a logical approach. But, again, there's no thumb rule to it!?

Data-Driven Business Solutions

4,137 位关注者

Rajkumar Mathur

??Plastipreneur Mentor I Operational Excellence I Improving Productivity??I Soaring Profitability??I Enduring Manufacturing Plants I Packaging I Plastics I Compostable I Recycling I Student Empowerment I MoC Niti Aayog??

1 年

A best-in-class solution Babu. Keep helping #msmesector

1 次回应

要查看或添加评论，请登录

Babu Chakraborty的更多文章

How to Leverage Machine Learning for Real-Time Personalization in Digital Marketing

2024年9月13日

How to Leverage Machine Learning for Real-Time Personalization in Digital Marketing

In today's digital world, offering personalized experiences in real-time is key. But how do you use machine learning to…
Predictive Analytics: Revolutionizing Targeted Marketing in the Digital Age

2024年9月9日

Predictive Analytics: Revolutionizing Targeted Marketing in the Digital Age

The Challenge of Customer Engagement in a Saturated Market In today's hyper-competitive digital landscape, businesses…

2 条评论
Predictive Analytics: Unlocking Customer Behavior for Smarter Marketing Campaigns

2024年8月26日

Predictive Analytics: Unlocking Customer Behavior for Smarter Marketing Campaigns

Let's discover how predictive analytics and machine learning can revolutionize your marketing strategies. In this…
Revolutionizing Customer Service with NLP: AI-powered Chatbots

2024年8月16日

Revolutionizing Customer Service with NLP: AI-powered Chatbots

Hey #LinkedInFamily! ?? Today, let's dive into how #ArtificialIntelligence is transforming customer service through…
Revolutionizing Marketing Automation with Artificial Intelligence

2024年8月3日

Revolutionizing Marketing Automation with Artificial Intelligence

Marketing automation has transformed the way businesses approach customer engagement. However, with the rise of…
Transforming Customer Engagement through AI-Powered Segmentation

2024年7月25日

Transforming Customer Engagement through AI-Powered Segmentation

In today's fast-paced business environment, effective customer engagement remains a critical challenge. Companies…
Programmatic Marketing: The Data-Driven Key to Next-Level Digital Ads in 2024

2024年3月17日

Programmatic Marketing: The Data-Driven Key to Next-Level Digital Ads in 2024

Author: Babu Chakraborty #followme In today's dynamic digital landscape, advertisers are constantly seeking new and…
The Ethical Dilemmas of AI: Balancing Innovation and Responsibility

2024年3月1日

The Ethical Dilemmas of AI: Balancing Innovation and Responsibility

Artificial intelligence (AI) has permeated every facet of our lives, from the way we work to how we interact socially…

1 条评论
Deep Dive into Deep Multi-Layer Perceptron (MLP)

2024年2月28日

Deep Dive into Deep Multi-Layer Perceptron (MLP)

Hello, LinkedIn community! Today, let’s unravel the mysteries of the Deep Multi-Layer Perceptron (MLP), a fundamental…

3 条评论
How to Communicate Effectively with AI Tools: A Guide to Prompt Engineering

2024年2月19日

How to Communicate Effectively with AI Tools: A Guide to Prompt Engineering

Author: Babu Chakraborty MTech (AI), Indian Institute of Technology In our rapidly evolving digital landscape…

4 条评论

See all articles

Top 10 Ways to deal with Missing Values in Python

Babu Chakraborty

Head of Marketing Technology | AI-Powered Digital Marketing Expert (MTech AI @ IITP) | Branding & Social Media Marketing Strategist

How to overcome Missing data in our dataset?

Let's explore Python's top 10 ways to deal with Missing Values.

Don't do anything

Drop it if it is not in use?

Imputation by Mean

Imputation by Median

领英推荐

Imputation by Most frequent values (mode)

Imputation for Categorical values

Last observation carried forward (LOCF)

Linear Interpolation

Imputation by K-NN

Imputation by Multivariate Imputation by Chained Equation (MICE)

Final Thoughts

Data-Driven Business Solutions

4,137 位关注者

Babu Chakraborty的更多文章

社区洞察

其他会员也浏览了

Python Fundamental 01- print function (), variable, Data Types & comments. | Belayet Hossain.

Simple Python Script for Clustering Keywords [ Script Included ]

Principal Component Analysis in Python

Python Basics for Data Science

Counter in Python

Performing Univariate Analysis in Python

Some Common Python List and Tuple Questions

Python Tip #4: Understanding Collections — Tuples, Lists, Sets, and Dictionaries

Vectorized Functions in R and Python

?????? # 4 ???????????????????? ?????? ?????????? ???? ????????????: Basic Data Types in Python

How to overcome Missing data in our dataset?

Let's explore Python's top 10 ways to deal with Missing Values.

Don't do anything

Drop it if it is not in use?

Imputation by Mean

Imputation by Median

领英推荐

Imputation by Most frequent values (mode)

Imputation for Categorical values

Last observation carried forward (LOCF)

Linear Interpolation

Imputation by K-NN

Imputation by Multivariate Imputation by Chained Equation (MICE)

Final Thoughts

Data-Driven Business Solutions

4,137 位关注者

Babu Chakraborty的更多文章

How to Leverage Machine Learning for Real-Time Personalization in Digital Marketing

Predictive Analytics: Revolutionizing Targeted Marketing in the Digital Age

Predictive Analytics: Unlocking Customer Behavior for Smarter Marketing Campaigns

Revolutionizing Customer Service with NLP: AI-powered Chatbots

Revolutionizing Marketing Automation with Artificial Intelligence

Transforming Customer Engagement through AI-Powered Segmentation

Programmatic Marketing: The Data-Driven Key to Next-Level Digital Ads in 2024

The Ethical Dilemmas of AI: Balancing Innovation and Responsibility

Deep Dive into Deep Multi-Layer Perceptron (MLP)

How to Communicate Effectively with AI Tools: A Guide to Prompt Engineering

社区洞察

其他会员也浏览了

Python Fundamental 01- print function (), variable, Data Types & comments. | Belayet Hossain.

Simple Python Script for Clustering Keywords [ Script Included ]

Principal Component Analysis in Python

Python Basics for Data Science

Counter in Python

Performing Univariate Analysis in Python

Some Common Python List and Tuple Questions

Python Tip #4: Understanding Collections — Tuples, Lists, Sets, and Dictionaries

Vectorized Functions in R and Python

?????? # 4 ???????????????????? ?????? ?????????? ???? ????????????: Basic Data Types in Python