Simplifying and Grouping Address Fields Using R
Samantha Bell
Veterinary Data Analysis | Dashboards & Reporting | LVT | E-commerce | Bioinformatics
Trying to group records by street address can be a daunting task. Although hotspot analyses are a key part of writing code across fields of study, a high number of users don't feel comfortable matching addresses when entry formats vary. Today we look at one way of simple grouping using str_match() and gsub() in R (If you are completely new to REGEX, take a look at this introduction and this cheat sheet.
The situation.
Imagine that you work as the data analyst for a chain of banana stands. You are given a file containing one row for each week of a 3 month period, with the address of the stand with the top number of sales within that week. Your boss asks that you provide a table of address locations and the number of times that they achieved top sale status. Sounds easy enough, right?
But when you open the file, you see a problem: the addresses are entered by hand each time, with the format varying depending on the data entry clerk. Sometimes street names contain "North", "South", "East", "West", and sometimes they do not. Sometimes there are periods. Sometimes "Avenue" was entered, and other times "Ave." was used. Hmm. what to do now?
A less ambitious analyst might be tempted to group the addresses by hand; it's not that long of a list, right? But not you. You know that writing good code now will save you time in the future, when you are inevitably asked to analyze much longer lists. You also know that the code you write can be modified to suit future projects with different goals. You have a thirst for knowledge and the ability to learn anything you set your mind to! So let's get started!
Think about what needs to be done.
To be successful, the code needs to make 5 things need to happen:
- Identify addresses with NSEW letters or words in them and grab the street number, NSEW, and street name.
- Reformat NSEW letters/words to one standard format.
- For addresses without NSEW letters or words in them, grab the street number and name.
- Leave off any trailing words, apartment numbers, or other qualifiers.
- Group the simplified addresses
Find addresses with NSEW, and capture a simplified version.
Steps 1 & 2 involve writing REGEX patterns that will differentiate between addresses with NSEW and those without. Use anchors, character classes and quantifiers (refresh your memory here).
We can write one flexible pattern for each type of address.
The first pattern will look for addresses containing NSEW. Let's break it down:
pat1 <- "^\\d+\\s+[NSEW].{0,5}\\s*\\w+"
"^" is an anchor, indicating we want what follows to be at the start of the character string.
"\\d" is a character class meaning numeric "digit", and "+" is a quantifier showing that we want at least 1 of what precedes it. Together they say, "At least one digit".
"\\s" is a character class meaning "space". Together with "+" this says "At least one space".
The square brackets around "[NSEW]" subsets this part of the expression, and allows for any of the characters within. This is saying to look for either N, S, E, or W.
"." is a character class meaning "any character". The brackets that follow contain a quantifier with the start and end count of how many times the item immediately before should occur. Together ".{0,5}" means "Zero to five characters, which can be any type". Since this comes right after our "(N|S|E|W)", this means a word starts with N, S, E, or W and then may have up to 5 letters following it (for the entire word) or could also have another character, such as a period. This allows for matching "North", "N", or "N." formatting.
The next part is another "\\s" space, but this time accompanied by the quantifier "*", stating it can occur at least zero times. This means we might not have another space, but it's ok if we do.
Finally, "\\w" is a character class meaning "word". Together with "+" this looks for "At least one character of a letter or number combination.
The entire expression says, "Starting at the beginning of the string, look for at least one number, followed by at least one space, followed by N, S, E, or W and optionally zero to five trailing characters, then an optional space, and finally at least one letter or number for the street name."
Once our table is loaded into R as a tibble named myData, we can use pattern #1 and str_match() to check each address. If it matches using str_match(), the function will return only the portion that matches exactly the pattern (no trailing information).
str_match() will return only the portion of the string that matches the pattern
str_match() will also give us what is in our parentheses - the NSEW- matches. So we only want index #1 of the resulting matches. We get this using [1]. The ignore.case option allows for variations in capitalization of the addresses. This reads, "For each row of the tibble, pull out string matches from each street address, using pattern #1 as a REGEX and ignoring capitalization. Give me only the 1st match".
for(i in 1:16){ print(str_match(myData$Street_Address[i], regex(pat1, ignore_case = TRUE))[1]) }
Now we have some addresses that look like this (NA for addresses without NSEW):
But we are not quite ready to group them just yet. First we need to reformat the NSEW pieces. Using the pipe "%>%" we can send out matches directly into gsub(), which will replace text for us. The way to use gsub() is:
gsub("pattern", "replacement", x = yourData, ignore.case = TRUE)
Adding this to the str_match() loop looks like this:
for(i in 1:16){ y <- str_match( myData$Street_Address[i], regex(pat1, ignore_case = TRUE))[1] %>% gsub("N\\s|N\\.\\s|North\\s", "North ", x = ., ignore.case = TRUE) %>% gsub("S\\s|S\\.\\s|South\\s", "South ", x = ., ignore.case = TRUE) %>% gsub("E\\s|E\\.\\s|East\\s", "East ", x = ., ignore.case = TRUE) %>% gsub("W\\s|W\\.\\s|West\\s", "West ", x = ., ignore.case = TRUE) print(y) }
And results in:
If an address does not contain NSEW, simplify it in a different way.
For addresses without a match to pattern #1, we create pattern #2.
pat2 <- "^\\d+\\s+\\w+"
Use what you learned from making the first pattern to read this one in plain English. You will find that it says roughly, ""Starting at the beginning of the string, look for at least one number, followed by at least one space, and finally at least one letter or number for the street name."
If we put the two patterns together, we can make a loop that looks first for pattern #1, and if not found will look for pattern #2. For pattern #1 matches, the NSEW characters are reformatted. The simple address is saved for each row.
myData$Address_Simple <- "" # Initiate empty column for(i in 1:dim(myData)[1]){ # for each row if(!is.na(myData$Street_Address[i])){ # if the address field is not NA myData$Address_Simple[i] <- ifelse( #Check for N,S,E,W type letter match (a match result that is not NA) !is.na( str_match(myData$Street_Address[i], regex(pat1, ignore_case = TRUE))[1]), # If found, grab the matched text (str_match(myData$Street_Address[i], regex(pat1, ignore_case = TRUE))[1] %>% # pipe the match into the reformatting code gsub("N\\s|N\\.\\s|North\\s", "North ", x = ., ignore.case = TRUE) %>% gsub("S\\s|S\\.\\s|South\\s", "South ", x = ., ignore.case = TRUE) %>% gsub("E\\s|E\\.\\s|East\\s", "East ", x = ., ignore.case = TRUE) %>% gsub("W\\s|W\\.\\s|West\\s", "West ", x = ., ignore.case = TRUE)), # If no match to NSEW, grab the simplified version using pattern #2 str_match(myData$Street_Address[i], regex(pat2, ignore_case = TRUE))) # If no address, NA simple address }else{"NA"} }
Now we have a completed list of simplified addresses!
Group the cleaned addresses.
Using tidyverse grouping commands, we can group the addresses by the lowercase version of their simplified address (to ignore capitalizations), and count the repeats:
hotspots <- myData %>% group_by(tolower(Address_Simple)) %>% summarise(n = n()) %>% arrange(desc(n))
This assigns the resulting count table to a variable named hotspots, and reads, take myData and group the rows by the lowercase version of the Address_Simple column, summarizing with a count of the number of occurrences. Arrange the resulting table in descending order."
The final result is a nice, clean table to show your boss the number of times each location appeared in the raw dataset:
Strategic and Analytical Global Health Professional!
4 年You are the real MVP. Know this!