Data Cleaning
Dr. Saurav Das
Research Director | Farming Systems Trial | Rodale Institute | Soil Health, Biogeochemistry of Carbon & Nitrogen, Environmental Microbiology, and Data Science | Outreach & Extension
Data cleaning is an essential step in the data analysis process. Ensuring the quality and consistency of your data can lead to more reliable and interpretable results.
A. Removing all the NA columns from your dataset
# Remove columns that are entirely NA
cleaned_data <- data[, colSums(is.na(data)) < nrow(data)]
B. Removing all the character vector column
data_cleaned <- data[, !sapply(data, is.character)]
C. Filtering data with conditions
library(dplyr)
# Filtering rows where 'column_name' is equal to a specific value
filtered_data <- data %>% filter(column_name == "specific_value")
data %>%: This is a part of the pipe (%>%) operator provided by the magrittr package (which is also imported when you use dplyr). The pipe operator takes the left-hand side (in this case, data) and uses it as the first argument to the function on the right-hand side.
filter(column_name == "specific_value"): This is the filter() function. It's used to filter rows of a data frame (or tibble). In this case, it keeps only the rows where the value in the column_name is equal to "specific_value".
# Filtering rows where 'column_name' is greater than a certain threshold
filtered_data <- data %>% filter(column_name > 100)
filter(column_name > 100): This filters the rows based on a numerical condition. Only the rows where the value in column_name is greater than 100 will be retained in filtered_data.
# Combining conditions with & (and) or | (or)
filtered_data <- data %>% filter(column_name > 100 & another_column == "specific_value")
filter(column_name > 100 & another_column == "specific_value"): This is a compound condition. column_name > 100: This is the first condition that checks if the value in column_name is greater than 100. another_column == "specific_value": This is the second condition that checks if the value in another_column is equal to "specific_value". &: This is a logical AND operator. For a row to be retained in filtered_data, it must satisfy BOTH of the conditions. If you wanted to retain rows that satisfy either of the conditions (but not necessarily both), you would use the logical OR operator | instead of &.
D. Removing Duplicate Records
Using Base R
In Base R, you can use the duplicated() function to identify and remove duplicates.
Removing Duplicate Rows:
data <- data[!duplicated(data), ]
If you have a dataset with multiple columns and want to check duplicates based only on specific columns, you can specify those columns:
data <- data[!duplicated(data[c("column1", "column2")]), ]
Using dplyr
Removing Duplicate Rows:
领英推荐
library(dplyr) data <- data %>% distinct()
Removing Duplicates based on Specific Columns: If you only want to consider certain columns for detecting duplicates, you can specify those columns:
data <- data %>% distinct(column1, column2, .keep_all = TRUE)
Here, .keep_all = TRUE ensures that you keep all columns in your original dataset, but duplicates are identified based on only column 1 and column 2.
E. Correcting Data Types
Correcting data types is essential for data analysis, as wrong data types can lead to errors or incorrect results.
Convert to Numeric:
data$column_name <- as.numeric(data$column_name)
Convert to Character:
data$column_name <- as.character(data$column_name)
Convert to Factor:
data$column_name <- as.factor(data$column_name)
Convert to Date:
For this, you may need to specify the format if it's not the default "YYYY-MM-DD".
data$column_name <- as.Date(data$column_name, format="%d/%m/%Y")
Dealing with Issues during Conversion
Sometimes, you might encounter warnings or errors when converting types, especially when trying to convert character columns to numeric or date. Common issues include non-convertible characters or unexpected formats.
For instance, if a character column meant to be numeric has a non-numeric entry like "N/A", converting directly will produce NAs for the entire column. You'd need to handle these non-numeric values first, either by replacing or removing them.
Handle Non-Numeric Values before Conversion:
# Replace "N/A" with NA, then convert to numeric data$column_name[data$column_name == "N/A"] <- NA
data$column_name <- as.numeric(data$column_name)
Verify the Conversion
After conversion, it's good to check the structure of your data to ensure the conversion took place:
str(data)
This will show you the structure of your data frame, including the data type of each column.