登录查看更多内容

When to Use Mean, Median, and Mode for Handling Missing Values in Data?

Syed Burhan Ahmed

AI Engineer | AI Co-Lead @ Global Geosoft | AI Junior @ UMT | Custom Chatbot Development | Ex Generative AI Instructor @ AKTI | Ex Peer Tutor | Generative AI | Python | NLP | Cypher | Prompt Engineering

发布日期: 2025年2月5日

When to Use Mean, Median, and Mode for Handling Missing Values in Data?

Introduction

Handling missing values is a critical step in data preprocessing, as incorrect handling can lead to biased models and inaccurate predictions. Three common statistical imputation techniques—Mean, Median, and Mode—are widely used to replace missing values, each suitable for different scenarios.

In this blog, we’ll explore when to use Mean, Median, or Mode for handling missing data, along with their pros and cons.

1?? Using Mean for Missing Values

? When to Use Mean?

? Data is numerical (continuous) (e.g., salary, height, temperature). ? The data distribution is normal (not skewed). ? You want to preserve the overall mean of the dataset.

?? Example:

A dataset contains the heights of 10 individuals, but one value is missing:

Person Height (cm) 1 170 2 165 3 175 4 180 5 168 6 172 7 160 8 178 9 169 10 ?

?? Mean Height Calculation:

Mean=170+165+175+180+168+172+160+178+1699=171.89Mean = \frac{170 + 165 + 175 + 180 + 168 + 172 + 160 + 178 + 169}{9} = 171.89

So, we replace the missing value with 171.89 cm.

? When NOT to Use Mean?

? When the data has outliers, as they distort the mean. ? When the distribution is skewed (e.g., income data).

2?? Using Median for Missing Values

? When to Use Median?

? Data is numerical but has outliers (e.g., income, house prices). ? The distribution is skewed (not symmetric). ? The dataset contains extreme values that could distort the mean.

?? Example:

Consider a dataset of monthly incomes of 10 employees, where one value is missing:

Employee Income ($) 1 2,000 2 2,500 3 3,000 4 3,200 5 3,500 6 4,000 7 50,000 8 3,300 9 3,700 10 ?

?? Mean Income Calculation:

Mean=2,000+2,500+3,000+3,200+3,500+4,000+50,000+3,300+3,7009=8,133.33Mean = \frac{2,000 + 2,500 + 3,000 + 3,200 + 3,500 + 4,000 + 50,000 + 3,300 + 3,700}{9} = 8,133.33

The mean is skewed due to the outlier (50,000).

?? Median Calculation (Sorted Data):

Sorted:?2,000,2,500,3,000,3,200,3,300,3,500,3,700,4,000,50,000\text{Sorted: } 2,000, 2,500, 3,000, 3,200, 3,300, 3,500, 3,700, 4,000, 50,000

Since there are 9 values, the median is the middle one (5th value): Median = 3,300

领英推荐

Why Process Behavior Charts are the Best Way to…

Mark Graban 3 个月前

Understanding Box and Whisker Plots: A Comprehensive…

Lean Manufacturing & Six Sigma Worldwide 6 个月前

The Data Analytics Mistake You’re Making

Muhammad Ishtiaq Khan 1 年前

So, we replace the missing value with $3,300 instead of the mean ($8,133.33$), which would have been misleading.

? When NOT to Use Median?

? If the dataset is normally distributed (mean and median will be similar).

3?? Using Mode for Missing Values

? When to Use Mode?

? Data is categorical (e.g., gender, city, product type). ? Data has discrete values (e.g., shoe sizes, survey ratings). ? You need to replace missing values with the most frequent value.

?? Example:

A dataset of preferred payment methods:

Customer Payment Method 1 Credit Card 2 Debit Card 3 PayPal 4 Credit Card 5 Debit Card 6 ? 7 Credit Card 8 PayPal

?? Most Frequent Value (Mode):

Credit Card = 3 times
Debit Card = 2 times
PayPal = 2 times

The most frequent value is Credit Card, so we replace the missing value with Credit Card.

? When NOT to Use Mode?

? If all values are equally frequent. ? If the dataset is numerical (mode isn’t useful for continuous numbers).

?? Summary: Choosing the Right Imputation Technique

Scenario Best Choice Why? Data is numerical & normally distributed Mean Preserves the overall data distribution. Data has outliers or skewed distribution Median Reduces the impact of extreme values. Data is categorical (e.g., colors, cities) Mode Keeps the most frequent value.

?? Python Code for Handling Missing Values

Here’s how you can handle missing values using pandas in Python:

import pandas as pd
import numpy as np

# Sample dataset
data = {'Age': [25, 30, 35, np.nan, 45, 50, np.nan],
        'Income': [3000, 3200, 4000, 50000, 3700, 4100, np.nan],
        'Gender': ['Male', 'Female', 'Male', np.nan, 'Female', 'Male', 'Female']}

df = pd.DataFrame(data)

# Filling missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)  # Using Mean
df['Income'].fillna(df['Income'].median(), inplace=True)  # Using Median
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)  # Using Mode

print(df)

?? Final Thoughts

Choosing the right method to handle missing values is crucial for building robust machine learning models. Using Mean, Median, or Mode depends on data type, distribution, and presence of outliers.

Understanding when to use each method ensures more accurate results and prevents misleading conclusions.

Which method do you use most often? ?? Let me know in the comments!

?? Follow me for more insights on AI, ML, and Data Science!

要查看或添加评论，请登录

Syed Burhan Ahmed的更多文章

1D Convolutional Neural Networks (1D-CNN): A Powerful Tool for Sequential Data

2025年2月9日

1D Convolutional Neural Networks (1D-CNN): A Powerful Tool for Sequential Data

When we think of Convolutional Neural Networks (CNNs), we often associate them with image processing. However, CNNs are…
Bidirectional LSTM (BiLSTM) in Deep Learning: A Powerful Sequential Model

2025年2月9日

Bidirectional LSTM (BiLSTM) in Deep Learning: A Powerful Sequential Model

Recurrent Neural Networks (RNNs) have been widely used for sequential data tasks, but their limitations—such as…
Understanding Gated Recurrent Units (GRU) in Deep Learning

2025年2月9日

Understanding Gated Recurrent Units (GRU) in Deep Learning

Recurrent Neural Networks (RNNs) revolutionized deep learning for sequential data, but they suffered from challenges…
Understanding Long Short-Term Memory (LSTM) Networks in Deep Learning

2025年2月9日

Understanding Long Short-Term Memory (LSTM) Networks in Deep Learning

Long Short-Term Memory (LSTM) networks have revolutionized the way we handle sequential data in deep learning. Whether…
Understanding Gradient Descent in Machine Learning

2025年2月8日

Understanding Gradient Descent in Machine Learning

Gradient descent is one of the most widely used optimization algorithms in machine learning and deep learning. It’s a…
Understanding Convolutional Neural Networks (CNNs) in Deep Learning

2025年2月8日

Understanding Convolutional Neural Networks (CNNs) in Deep Learning

Convolutional Neural Networks (CNNs) have revolutionized the field of computer vision and are the cornerstone of modern…
Understanding Artificial Neural Networks (ANN) in Machine Learning

2025年2月8日

Understanding Artificial Neural Networks (ANN) in Machine Learning

Artificial Neural Networks (ANNs) are a cornerstone of modern machine learning, enabling systems to learn from data in…
Understanding Recurrent Neural Networks (RNNs) in Deep Learning

2025年2月8日

Understanding Recurrent Neural Networks (RNNs) in Deep Learning

Recurrent Neural Networks (RNNs) are a powerful class of neural networks designed for sequential data. They have…
Understanding MSE, RMSE, MAE, and R2 Score in Machine Learning Model Evaluation

2025年2月8日

Understanding MSE, RMSE, MAE, and R2 Score in Machine Learning Model Evaluation

In machine learning, especially in regression tasks, model evaluation is a key aspect of understanding how well your…
Understanding the Confusion Matrix, True Positive, False Positive, True Negative, and False Negative in Machine Learning

2025年2月7日

Understanding the Confusion Matrix, True Positive, False Positive, True Negative, and False Negative in Machine Learning

In machine learning, especially in classification tasks, model evaluation plays a crucial role in understanding how…

See all articles

When to Use Mean, Median, and Mode for Handling Missing Values in Data?

Syed Burhan Ahmed

AI Engineer | AI Co-Lead @ Global Geosoft | AI Junior @ UMT | Custom Chatbot Development | Ex Generative AI Instructor @ AKTI | Ex Peer Tutor | Generative AI | Python | NLP | Cypher | Prompt Engineering

When to Use Mean, Median, and Mode for Handling Missing Values in Data?

Introduction

1?? Using Mean for Missing Values

? When to Use Mean?

?? Example:

? When NOT to Use Mean?

2?? Using Median for Missing Values

? When to Use Median?

?? Example:

领英推荐

? When NOT to Use Median?

3?? Using Mode for Missing Values

? When to Use Mode?

?? Example:

? When NOT to Use Mode?

?? Summary: Choosing the Right Imputation Technique

?? Python Code for Handling Missing Values

?? Final Thoughts

Syed Burhan Ahmed的更多文章

其他会员也浏览了

Should we hate pie charts? ??

Basic Terminologies in Time Series Forecasting - Chapter 2

Going Deeper into Time Series Analysis by Focusing on Frequency

A Friendly Introduction to Features in Time Series Data

How to plot data into one element instead of lots of graph bars. Box and whisker plot.

How to Present Data | When graphs when figures

What your graph says about you(r data)

DateAdd vs ParallelPeriod

KPI Paradigm Shift of PuMP? #3: Great measures come from evidence, not data.

Data Analytics and manipulation – Part 2

When to Use Mean, Median, and Mode for Handling Missing Values in Data?

Introduction

1?? Using Mean for Missing Values

? When to Use Mean?

?? Example:

? When NOT to Use Mean?

2?? Using Median for Missing Values

? When to Use Median?

?? Example:

领英推荐

? When NOT to Use Median?

3?? Using Mode for Missing Values

? When to Use Mode?

?? Example:

? When NOT to Use Mode?

?? Summary: Choosing the Right Imputation Technique

?? Python Code for Handling Missing Values

?? Final Thoughts

Syed Burhan Ahmed的更多文章

1D Convolutional Neural Networks (1D-CNN): A Powerful Tool for Sequential Data

Bidirectional LSTM (BiLSTM) in Deep Learning: A Powerful Sequential Model

Understanding Gated Recurrent Units (GRU) in Deep Learning

Understanding Long Short-Term Memory (LSTM) Networks in Deep Learning

Understanding Gradient Descent in Machine Learning

Understanding Convolutional Neural Networks (CNNs) in Deep Learning

Understanding Artificial Neural Networks (ANN) in Machine Learning

Understanding Recurrent Neural Networks (RNNs) in Deep Learning

Understanding MSE, RMSE, MAE, and R2 Score in Machine Learning Model Evaluation

Understanding the Confusion Matrix, True Positive, False Positive, True Negative, and False Negative in Machine Learning

其他会员也浏览了

Should we hate pie charts? ??

Basic Terminologies in Time Series Forecasting - Chapter 2

Going Deeper into Time Series Analysis by Focusing on Frequency

A Friendly Introduction to Features in Time Series Data

How to plot data into one element instead of lots of graph bars. Box and whisker plot.

How to Present Data | When graphs when figures

What your graph says about you(r data)

DateAdd vs ParallelPeriod

KPI Paradigm Shift of PuMP? #3: Great measures come from evidence, not data.

Data Analytics and manipulation – Part 2