When to Use Mean, Median, and Mode for Handling Missing Values in Data?

When to Use Mean, Median, and Mode for Handling Missing Values in Data?

Introduction

Handling missing values is a critical step in data preprocessing, as incorrect handling can lead to biased models and inaccurate predictions. Three common statistical imputation techniques—Mean, Median, and Mode—are widely used to replace missing values, each suitable for different scenarios.

In this blog, we’ll explore when to use Mean, Median, or Mode for handling missing data, along with their pros and cons.


1?? Using Mean for Missing Values

? When to Use Mean?

? Data is numerical (continuous) (e.g., salary, height, temperature). ? The data distribution is normal (not skewed). ? You want to preserve the overall mean of the dataset.

?? Example:

A dataset contains the heights of 10 individuals, but one value is missing:

Person Height (cm) 1 170 2 165 3 175 4 180 5 168 6 172 7 160 8 178 9 169 10 ?

?? Mean Height Calculation:

Mean=170+165+175+180+168+172+160+178+1699=171.89Mean = \frac{170 + 165 + 175 + 180 + 168 + 172 + 160 + 178 + 169}{9} = 171.89

So, we replace the missing value with 171.89 cm.

? When NOT to Use Mean?

? When the data has outliers, as they distort the mean. ? When the distribution is skewed (e.g., income data).


2?? Using Median for Missing Values

? When to Use Median?

? Data is numerical but has outliers (e.g., income, house prices). ? The distribution is skewed (not symmetric). ? The dataset contains extreme values that could distort the mean.

?? Example:

Consider a dataset of monthly incomes of 10 employees, where one value is missing:

Employee Income ($) 1 2,000 2 2,500 3 3,000 4 3,200 5 3,500 6 4,000 7 50,000 8 3,300 9 3,700 10 ?

?? Mean Income Calculation:

Mean=2,000+2,500+3,000+3,200+3,500+4,000+50,000+3,300+3,7009=8,133.33Mean = \frac{2,000 + 2,500 + 3,000 + 3,200 + 3,500 + 4,000 + 50,000 + 3,300 + 3,700}{9} = 8,133.33

The mean is skewed due to the outlier (50,000).

?? Median Calculation (Sorted Data):

Sorted:?2,000,2,500,3,000,3,200,3,300,3,500,3,700,4,000,50,000\text{Sorted: } 2,000, 2,500, 3,000, 3,200, 3,300, 3,500, 3,700, 4,000, 50,000

Since there are 9 values, the median is the middle one (5th value): Median = 3,300

So, we replace the missing value with $3,300 instead of the mean ($8,133.33$), which would have been misleading.

? When NOT to Use Median?

? If the dataset is normally distributed (mean and median will be similar).


3?? Using Mode for Missing Values

? When to Use Mode?

? Data is categorical (e.g., gender, city, product type). ? Data has discrete values (e.g., shoe sizes, survey ratings). ? You need to replace missing values with the most frequent value.

?? Example:

A dataset of preferred payment methods:

Customer Payment Method 1 Credit Card 2 Debit Card 3 PayPal 4 Credit Card 5 Debit Card 6 ? 7 Credit Card 8 PayPal

?? Most Frequent Value (Mode):

  • Credit Card = 3 times
  • Debit Card = 2 times
  • PayPal = 2 times

The most frequent value is Credit Card, so we replace the missing value with Credit Card.

? When NOT to Use Mode?

? If all values are equally frequent. ? If the dataset is numerical (mode isn’t useful for continuous numbers).


?? Summary: Choosing the Right Imputation Technique

Scenario Best Choice Why? Data is numerical & normally distributed Mean Preserves the overall data distribution. Data has outliers or skewed distribution Median Reduces the impact of extreme values. Data is categorical (e.g., colors, cities) Mode Keeps the most frequent value.


?? Python Code for Handling Missing Values

Here’s how you can handle missing values using pandas in Python:

import pandas as pd
import numpy as np

# Sample dataset
data = {'Age': [25, 30, 35, np.nan, 45, 50, np.nan],
        'Income': [3000, 3200, 4000, 50000, 3700, 4100, np.nan],
        'Gender': ['Male', 'Female', 'Male', np.nan, 'Female', 'Male', 'Female']}

df = pd.DataFrame(data)

# Filling missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)  # Using Mean
df['Income'].fillna(df['Income'].median(), inplace=True)  # Using Median
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)  # Using Mode

print(df)
        

?? Final Thoughts

Choosing the right method to handle missing values is crucial for building robust machine learning models. Using Mean, Median, or Mode depends on data type, distribution, and presence of outliers.

Understanding when to use each method ensures more accurate results and prevents misleading conclusions.

Which method do you use most often? ?? Let me know in the comments!

?? Follow me for more insights on AI, ML, and Data Science!


要查看或添加评论,请登录

Syed Burhan Ahmed的更多文章

其他会员也浏览了