Spot the difference - comparing tables in R
Samantha Bell
Veterinary Data Analysis | Dashboards & Reporting | LVT | E-commerce | Bioinformatics
Ever wondered how to compare code output without looking over each row and column by hand? This handy use of all_equal() from dplyr can save you time and ensure your reports are consistent!
What questions can all_equal() answer?
- Do the two tables have the same number of columns and rows?
- Are the column class types the same in both tables?
- Do the column names match? If not, which column names are problematic in each table?
- Is there different data content? If so, find out which rows in which tables contain differences.
(Documentation can be found here.)
1) Using a built-in dataset to create sample tables
For this example, we will make use of one of the datasets that are included with R, named ChickWeight.
To see a list of all default datasets that come with base R, run the command data()
ChickWeight is a table tracking the weights of chicks fed different diets, along with their ages.
We start by loading the dataset and the two packages we will use to manipulate it:
data(ChickWeight) # This creates a dataframe of the same name library(tidyverse) library(dplyr)
Let's pretend that our report is a summary of the number of chick records by age. If we have multiple employees who do not all use the same code to create the report, there is potential for different row or column format between tables. This can make quality control difficult.
We can create two different example summary tables which contain the same information, using different methods:
# 1 byTime1 <- ChickWeight %>% group_by(Time) %>% summarise(Number_of_chicks=n()) %>% mutate(Time = as.character(Time))
# 2 byTime2 <- tibble( # Set up tibble Number_of_chicks = 0, Time = as.factor(unique(ChickWeight$Time)) ) for(i in dim(byTime2)[1]:1){ # Add the count T <- byTime2$Time[i] byTime2$Number_of_chicks[i] <- as.numeric(dim(ChickWeight[ChickWeight$Time==T,])[1])
}
The resulting tables look like this:
2) Comparing the structure of the tables
Let's compare the information we can gain by running the comparison with defaults, with the options to ignore_row_order, ignore_row_order, or convert (will convert some column types):
Default
all_equal(byTime1, byTime2)
PRINTOUT : "Different types for column `Time`: character vs factor. Different types for column `Number_of_chicks`: integer vs double"
The default for all_equal() will tell the user if class types are not the same between the columns of the same names, but does not inform us that the two tables have columns in a different order.
Converting the column classes
# What if we convert the column types automatically? all_equal(byTime1, byTime2, convert = TRUE)
PRINTOUT : " TRUE"
Using the option convert=TRUE will ignore column classes, and return TRUE as long as the column names and content are the same.
This option does have some limitations. Currently this will only convert factor to character and integer to double.
Ignoring column or row order
# Does it matter what order the columns appear in? all_equal(byTime1, byTime2, convert = TRUE, ignore_col_order = FALSE)
PRINTOUT : "Same column names, but different order"
Using the options ignore_row_order=TRUE or ignore_row_order=TRUE will inform you of row/column order even when the cell content and column names are the same.
3) Comparing the content of the tables
Let's identify some occasions when the data itself may differ between tables.
Different values in a cell
byTime2$Number_of_chicks[3] <- 77 # Change one number in row 3 all_equal(byTime1, byTime2, convert = TRUE) # Identify which row has the difference
PRINTOUT: "Rows in x but not in y: 3. Rows in y but not in x: 3"
all_equal() will tell the user which row in each table number contains the difference.
In this case, we can learn what rows are in the 1st table that do not match in the second, and vice versa.
Extra columns
byTime2$Pairs_of_chicks <- byTime2$Number_of_chicks/2 # Add a new column all_equal(byTime1, byTime2, convert = TRUE) # Identify differences in column count
PRINTOUT: "different number of columns: 2 vs 3"
all_equal() will tell the user if the number of columns is not the same, and will not compare the tables further until the number of columns matches.
Even though we still have different data in row 3, all_equal() stops comparing after noticing that the number of columns is different. We must fix this error before checking for more differences
Different column names
byTime2$Number_of_chicks <- NULL # remove a column all_equal(byTime1, byTime2, convert = TRUE) # Identify mismatched column names
PRINTOUT: "not compatible: Cols in y but not x: `Pairs_of_chicks`. Cols in x but not y: `Number_of_chicks` "
all_equal() will tell the user if the names of columns are not the same, and will not compare the tables further until this is fixed.
Since the column names are different in the two tables all_equal() does not compare their rows, knowing that they are likely different variables. It conveniently provides us with the names of the offending columns.