登录查看更多内容

Handling Missing Data

Dr. Saurav Das

Research Director | Farming Systems Trial | Rodale Institute | Soil Health, Biogeochemistry of Carbon & Nitrogen, Environmental Microbiology, and Data Science | Outreach & Extension | Vibe coding

发布日期: 2023年9月27日

Data is the foundation of any analytical project. However, real-world data is often messy and incomplete. Missing data can introduce bias, reduce the power and efficiency of statistical methods/models, and complicate analyses. Handling missing data is therefore a critical step in ensuring the accuracy and reliability of your results.

Why Does Data Go Missing?

Before we go deep into the methods of handling missing data, it's important to understand why data might be missing. Some common reasons include:

Errors during data collection: Mistakes can occur during data entry or data transfer.
Non-responses: Survey respondents might skip certain questions.
Data decay: Over time, some data points might get lost or become inaccessible.
Systematic issues: Sometimes, missing data isn't random. For instance, a faulty sensor might not record any data.

Understanding the reason behind the missing data can guide you in selecting the best method to handle it.

Types of Missing Data

A. Missing Completely at Random (MCAR): The missing data has no relationship with any other data point. The causes of the missing data are random.

Scenario: A soil researcher collects samples from multiple sites to analyze soil pH. While transporting the samples to the lab, some of the containers accidentally opened, and the samples were lost.
Example: The loss of samples was purely accidental and had nothing to do with the pH or any other characteristic of the soil. This means the missingness of the pH data from these samples is completely random.

B. Missing at Random (MAR): The reason for the missing data is related to some other observed data but not the missing data itself.

Scenario: A research team is studying various soil properties, including moisture content, pH, and nutrient levels, across different altitudes. They notice that at higher altitudes, it's more challenging to measure moisture content because the equipment malfunctions in colder temperatures.
Example: The missingness of moisture content data is related to the altitude (which is observed data) but not directly to the moisture content itself. If we account for altitude in our analysis, we can potentially address the bias introduced by the missing moisture content data.

C. Missing Not at Random (MNAR): There's a specific reason for the missing data, and it's related to the missing data itself.

Scenario: A study is being conducted on soil salinity in areas near the coast. Some farmers, knowing that high salinity levels in their fields could reduce land value, decline to provide soil samples from salt-affected patches.
Example: Here, the missingness of the salinity data is directly related to the salinity level of the soil (the higher the salinity, the more likely the data is missing). The missingness mechanism is inherently related to the unobserved missing value itself, making it MNAR.

Techniques for Handling Missing Data

Deletion:Listwise: Remove any row with a missing value.Pairwise: Remove specific pairs of data where one or both are missing.Pros: Simple and quick.Cons: This can result in significant data loss.
Imputation:Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode of the column.Linear Regression Imputation: Predict missing values using linear regression.K-Nearest Neighbors (KNN) Imputation: Replace missing values with the mean or median of their k-nearest neighbors.Pros: Retains the original data size.Cons: Can introduce bias or reduce variance.
Interpolation: Useful for time series data. Predict missing values based on the values of surrounding data points. [I will discuss more about the pros/cons in the next post]
Multiple Imputation: Impute the missing data multiple times to create several complete datasets and then average the results. [I will discuss more about the pros/cons in the next post]
Using Algorithms that Support Missing Values: Some algorithms like XGBoost or certain tree-based algorithms can handle missing values without any preprocessing. [I will discuss more about the pros/cons in the next post]

Examples:

1. Missing Completely at Random (MCAR):

Scenario & Dataset:

10 soil samples were collected to measure pH.
Due to an accident, the pH values of 2 samples were lost.

# Hypothetical dataset
set.seed(123)
soil_pH <- data.frame(Sample_ID = 1:10, pH = c(runif(8, 5, 9), NA, NA))
print(soil_pH)

Solution:

Given the data is MCAR, we can use mean imputation.

soil_pH$pH[is.na(soil_pH$pH)] <- mean(soil_pH$pH, na.rm = TRUE) print(soil_pH)

2. Missing at Random (MAR):

Scenario & Dataset:

Soil properties are measured across different altitudes.
At higher altitudes, moisture content measurements are missing.

# Hypothetical dataset
altitude <- c(seq(100, 800, 100))
moisture_content <- c(20, 21, 22, 19, NA, NA, NA, NA)  
# Missing data at higher altitudes
soil_data_mar <- data.frame(Altitude = altitude, Moisture_Content = moisture_content)
print(soil_data_mar)

Solution:

Given the data is MAR and there's a trend (lower moisture with higher altitude), we can use linear regression imputation.

model <- lm(Moisture_Content ~ Altitude, data = soil_data_mar, na.action = na.exclude)
soil_data_mar$Moisture_Content[is.na(soil_data_mar$Moisture_Content)] <- predict(model, newdata = soil_data_mar[is.na(soil_data_mar$Moisture_Content), ])
print(soil_data_mar)

3. Missing Not at Random (MNAR):

Scenario & Dataset:

Soil salinity is measured in areas near the coast.
Farmers with high salinity levels decline to provide samples.

领英推荐

Handling Missing Data: How Missing Data Leads to Wrong…

Doug Rose 9 个月前

Common pitfalls in data analysis and how to avoid them

Centro (Ortnec) 3 个月前

Data Types: A Beginner's Guide

Global Tech Council 9 个月前

# Hypothetical dataset distance_from_coast <- c(seq(1, 10, 1)) salinity <- c(3, 5, 4, 6, 8, NA, NA, NA, NA, NA) # Missing data for fields closer to the coast soil_data_mnar <- data.frame(Distance = distance_from_coast, Salinity = salinity) print(soil_data_mnar)

Solution:

MNAR is tricky. One could use advanced methods like multiple imputations. But for simplicity, let's use KNN imputation.

library(DMwR) 
soil_data_mnar_imputed <- DMwR::knnImputation(soil_data_mnar, k = 3) print(soil_data_mnar_imputed)

Note: Solutions to missing data problems always come with their assumptions and limitations. Especially for MNAR, there's no perfect solution. In real-world scenarios, domain expertise and understanding of the context play a significant role in choosing the most appropriate method.

Few more examples:

A hypothetical dataset:

set.seed(123)

Soil_Depth <- runif(100, 0, 100)  # Depth in cm

Carbon_Content <- 0.5 * Soil_Depth + rnorm(100, mean=0, sd=5)

Carbon_Content[sample(1:100, 15)] <- NA  # Introducing missing values

soil_data <- data.frame(Soil_Depth, Carbon_Content)

visualizing the original dataset:

ggplot(soil_data, aes(Soil_Depth, Carbon_Content)) + 
  geom_point(na.rm = TRUE) + 
  geom_point(data = soil_data[is.na(soil_data$Carbon_Content), ], aes(Soil_Depth, Carbon_Content), color = "red") + 
  ggtitle("Soil Depth vs. Organic Carbon Content")

Listwise deletion:

soil_data_listwise <- soil_data[complete.cases(soil_data), ]

ggplot(soil_data_listwise, aes(Soil_Depth, Carbon_Content)) + geom_point() + ggtitle("Listwise Deletion")

Mean imputation:

soil_data_mean <- soil_data

soil_data_mean$Carbon_Content[is.na(soil_data_mean$Carbon_Content)] <- mean(soil_data_mean$Carbon_Content, na.rm = TRUE)

ggplot(soil_data_mean, aes(Soil_Depth, Carbon_Content)) + geom_point() + ggtitle("Mean Imputation")

Linear Regression Imputation:

model <- lm(Carbon_Content ~ Soil_Depth, data = soil_data, na.action = na.exclude)

soil_data_regression <- soil_data

soil_data_regression$Carbon_Content[is.na(soil_data_regression$Carbon_Content)] <- predict(model, newdata = soil_data_regression[is.na(soil_data_regression$Carbon_Content), ])

ggplot(soil_data_regression, aes(Soil_Depth, Carbon_Content)) + geom_point() + ggtitle("Linear Regression Imputation")

KNN:

soil_data_knn <- DMwR::knnImputation(soil_data, k = 5)

ggplot(soil_data_knn, aes(Soil_Depth, Carbon_Content)) + geom_point() + ggtitle("KNN Imputation")

___________________________________________________________________

I will update this post with a time series analysis and another test next week. Like temporal decomposition, additive models to predict the missing values, linear interpolations, and RNNs.

R for Soil Science

2,594 位关注者

Chandima Wekumbura

Soil and Environmental Chemist | Soil-Plant Interactions | Environmental Risk Mitigation | Chemical and Spectroscopic Analysis

1 年

Simple and detailed

1 次回应

Paul Benalcazar, Ph.D

Food Systems for First Nation Communities in Northern Ontario-Canada

1 年

Well said

1 次回应

查看更多评论

要查看或添加评论，请登录

Dr. Saurav Das的更多文章

Synthetic Data for Soil C Modeling

2025年2月9日

Synthetic Data for Soil C Modeling

Note: The article is not complete yet My all-time question is, do we need all and precise data from producers (maybe I…
Bootstrapping

2025年1月7日

Bootstrapping

1. Introduction to Bootstrapping Bootstrapping is a statistical resampling method used to estimate the variability and…
Ecosystem Service Dollar Valuation (Series - Rethinking ROI)

2024年12月24日

Ecosystem Service Dollar Valuation (Series - Rethinking ROI)

The valuation of ecosystem services in monetary terms represents a critical frontier in environmental economics…
Redefining ROI for True Sustainability

2024年8月28日

Redefining ROI for True Sustainability

It’s been a while since I last posted for Muddy Monday, but a few thoughts have been taking root in my mind, growing…
Linear Plateau in R

2024年5月22日

Linear Plateau in R

When working with data in fields such as agriculture, biology, and economics, it’s common to observe a response that…

2 条评论
R vs R-Studio

2024年3月29日

R vs R-Studio

R: R is a programming language and software environment for statistical computing and graphics. Developed by Ross Ihaka…

1 条评论
Backtransformation

2024年2月22日

Backtransformation

Backtransformation is the process of converting the results obtained from a transformed dataset back to the original…

3 条评论
Spectroscopic Methods and Use in Soil Organic Matter & Carbon Measurement

2024年1月30日

Spectroscopic Methods and Use in Soil Organic Matter & Carbon Measurement

Spectroscopic methods comprise a diverse array of analytical techniques that quantify how light interacts with a…

2 条评论
Regression & Classification

2024年1月30日

Regression & Classification

Regression and classification are two predictive modeling approaches in statistics and machine learning. Here's a brief…

2 条评论
Vectorization over loop

2024年1月17日

Vectorization over loop

Vectorization Vectorization in R refers to the practice of applying a function to an entire vector or array of data at…

See all articles

Handling Missing Data

Dr. Saurav Das

Research Director | Farming Systems Trial | Rodale Institute | Soil Health, Biogeochemistry of Carbon & Nitrogen, Environmental Microbiology, and Data Science | Outreach & Extension | Vibe coding

Why Does Data Go Missing?

Types of Missing Data

Techniques for Handling Missing Data

Examples:

1. Missing Completely at Random (MCAR):

2. Missing at Random (MAR):

3. Missing Not at Random (MNAR):

领英推荐

Few more examples:

R for Soil Science

2,594 位关注者

Dr. Saurav Das的更多文章

社区洞察

其他会员也浏览了

Buying vs. Scraping Data in 2023: Exploring the Pros and Cons for Your Data Acquisition Strategy

Facing Unrealistic Expectations And Tight Deadlines As A Data Analyst

February 2024: This Update Gives You More Control Over Data Analysis!

A Brilliant Example of Data Analysis

The power of baseline data in measuring impact

EDA Part II

A Brief Summary of Subjective Weighting Methods in MCDM

Concise Basic Stats - Part X: Distribution-free tests (Nonparametric Statistics)

Why Every Zimbabwean Data Journalist Should Master Pivot Tables

When the turkey learns that internal data needs to be validated by external data.

Why Does Data Go Missing?

Types of Missing Data

Techniques for Handling Missing Data

Examples:

1. Missing Completely at Random (MCAR):

2. Missing at Random (MAR):

3. Missing Not at Random (MNAR):

领英推荐

Few more examples:

R for Soil Science

2,594 位关注者

Dr. Saurav Das的更多文章

Synthetic Data for Soil C Modeling

Bootstrapping

Ecosystem Service Dollar Valuation (Series - Rethinking ROI)

Redefining ROI for True Sustainability

Linear Plateau in R

R vs R-Studio

Backtransformation

Spectroscopic Methods and Use in Soil Organic Matter & Carbon Measurement

Regression & Classification

Vectorization over loop

社区洞察

其他会员也浏览了

Buying vs. Scraping Data in 2023: Exploring the Pros and Cons for Your Data Acquisition Strategy

Facing Unrealistic Expectations And Tight Deadlines As A Data Analyst

February 2024: This Update Gives You More Control Over Data Analysis!

A Brilliant Example of Data Analysis

The power of baseline data in measuring impact

EDA Part II

A Brief Summary of Subjective Weighting Methods in MCDM

Concise Basic Stats - Part X: Distribution-free tests (Nonparametric Statistics)

Why Every Zimbabwean Data Journalist Should Master Pivot Tables

When the turkey learns that internal data needs to be validated by external data.