登录查看更多内容

Simplifying and Grouping Address Fields Using R

Samantha Bell

Veterinary Data Analysis | Dashboards & Reporting | LVT | E-commerce | Bioinformatics

发布日期: 2021年2月15日

Trying to group records by street address can be a daunting task. Although hotspot analyses are a key part of writing code across fields of study, a high number of users don't feel comfortable matching addresses when entry formats vary. Today we look at one way of simple grouping using str_match() and gsub() in R (If you are completely new to REGEX, take a look at this introduction and this cheat sheet.

The situation.

Imagine that you work as the data analyst for a chain of banana stands. You are given a file containing one row for each week of a 3 month period, with the address of the stand with the top number of sales within that week. Your boss asks that you provide a table of address locations and the number of times that they achieved top sale status. Sounds easy enough, right?

But when you open the file, you see a problem: the addresses are entered by hand each time, with the format varying depending on the data entry clerk. Sometimes street names contain "North", "South", "East", "West", and sometimes they do not. Sometimes there are periods. Sometimes "Avenue" was entered, and other times "Ave." was used. Hmm. what to do now?

A less ambitious analyst might be tempted to group the addresses by hand; it's not that long of a list, right? But not you. You know that writing good code now will save you time in the future, when you are inevitably asked to analyze much longer lists. You also know that the code you write can be modified to suit future projects with different goals. You have a thirst for knowledge and the ability to learn anything you set your mind to! So let's get started!

Think about what needs to be done.

To be successful, the code needs to make 5 things need to happen:

Identify addresses with NSEW letters or words in them and grab the street number, NSEW, and street name.
Reformat NSEW letters/words to one standard format.
For addresses without NSEW letters or words in them, grab the street number and name.
Leave off any trailing words, apartment numbers, or other qualifiers.
Group the simplified addresses

Find addresses with NSEW, and capture a simplified version.

Steps 1 & 2 involve writing REGEX patterns that will differentiate between addresses with NSEW and those without. Use anchors, character classes and quantifiers (refresh your memory here).

We can write one flexible pattern for each type of address.

The first pattern will look for addresses containing NSEW. Let's break it down:

pat1 <- "^\\d+\\s+[NSEW].{0,5}\\s*\\w+"

"^" is an anchor, indicating we want what follows to be at the start of the character string.

"\\d" is a character class meaning numeric "digit", and "+" is a quantifier showing that we want at least 1 of what precedes it. Together they say, "At least one digit".

"\\s" is a character class meaning "space". Together with "+" this says "At least one space".

The square brackets around "[NSEW]" subsets this part of the expression, and allows for any of the characters within. This is saying to look for either N, S, E, or W.

"." is a character class meaning "any character". The brackets that follow contain a quantifier with the start and end count of how many times the item immediately before should occur. Together ".{0,5}" means "Zero to five characters, which can be any type". Since this comes right after our "(N|S|E|W)", this means a word starts with N, S, E, or W and then may have up to 5 letters following it (for the entire word) or could also have another character, such as a period. This allows for matching "North", "N", or "N." formatting.

The next part is another "\\s" space, but this time accompanied by the quantifier "*", stating it can occur at least zero times. This means we might not have another space, but it's ok if we do.

Finally, "\\w" is a character class meaning "word". Together with "+" this looks for "At least one character of a letter or number combination.

The entire expression says, "Starting at the beginning of the string, look for at least one number, followed by at least one space, followed by N, S, E, or W and optionally zero to five trailing characters, then an optional space, and finally at least one letter or number for the street name."

Once our table is loaded into R as a tibble named myData, we can use pattern #1 and str_match() to check each address. If it matches using str_match(), the function will return only the portion that matches exactly the pattern (no trailing information).

str_match() will return only the portion of the string that matches the pattern

str_match() will also give us what is in our parentheses - the NSEW- matches. So we only want index #1 of the resulting matches. We get this using [1]. The ignore.case option allows for variations in capitalization of the addresses. This reads, "For each row of the tibble, pull out string matches from each street address, using pattern #1 as a REGEX and ignoring capitalization. Give me only the 1st match".

for(i in 1:16){
  print(str_match(myData$Street_Address[i], regex(pat1, ignore_case = TRUE))[1])
}

Now we have some addresses that look like this (NA for addresses without NSEW):

But we are not quite ready to group them just yet. First we need to reformat the NSEW pieces. Using the pipe "%>%" we can send out matches directly into gsub(), which will replace text for us. The way to use gsub() is:

gsub("pattern", "replacement", x = yourData, ignore.case = TRUE)

Adding this to the str_match() loop looks like this:

for(i in 1:16){
   y <- str_match(
      myData$Street_Address[i],  regex(pat1, ignore_case = TRUE))[1] %>%
         gsub("N\\s|N\\.\\s|North\\s", "North ", x = ., ignore.case = TRUE) %>%
         gsub("S\\s|S\\.\\s|South\\s", "South ", x = ., ignore.case = TRUE) %>%
         gsub("E\\s|E\\.\\s|East\\s", "East ", x = ., ignore.case = TRUE) %>%
         gsub("W\\s|W\\.\\s|West\\s", "West ", x = ., ignore.case = TRUE)
   print(y)
}

And results in:

If an address does not contain NSEW, simplify it in a different way.

For addresses without a match to pattern #1, we create pattern #2.

pat2 <- "^\\d+\\s+\\w+"

Use what you learned from making the first pattern to read this one in plain English. You will find that it says roughly, ""Starting at the beginning of the string, look for at least one number, followed by at least one space, and finally at least one letter or number for the street name."

If we put the two patterns together, we can make a loop that looks first for pattern #1, and if not found will look for pattern #2. For pattern #1 matches, the NSEW characters are reformatted. The simple address is saved for each row.

myData$Address_Simple <- ""  # Initiate empty column

for(i in 1:dim(myData)[1]){ # for each row

  if(!is.na(myData$Street_Address[i])){ # if the address field is not NA

    myData$Address_Simple[i] <- ifelse( 

      #Check for N,S,E,W type letter match (a match result that is not NA)
      !is.na(
         str_match(myData$Street_Address[i], regex(pat1, ignore_case = TRUE))[1]), 
         # If found, grab the matched text
         (str_match(myData$Street_Address[i], 
            regex(pat1, ignore_case = TRUE))[1] %>% 
            # pipe the match into the reformatting code
         gsub("N\\s|N\\.\\s|North\\s", "North ", x = ., ignore.case = TRUE) %>%
         gsub("S\\s|S\\.\\s|South\\s", "South ", x = ., ignore.case = TRUE) %>%
         gsub("E\\s|E\\.\\s|East\\s", "East ", x = ., ignore.case = TRUE) %>%
         gsub("W\\s|W\\.\\s|West\\s", "West ", x = ., ignore.case = TRUE)), 

      # If no match to NSEW, grab the simplified version using pattern #2
      str_match(myData$Street_Address[i], regex(pat2, ignore_case = TRUE))) 

   # If no address, NA simple address
   }else{"NA"} 
}

Now we have a completed list of simplified addresses!

Group the cleaned addresses.

Using tidyverse grouping commands, we can group the addresses by the lowercase version of their simplified address (to ignore capitalizations), and count the repeats:

hotspots <- myData  %>%  group_by(tolower(Address_Simple)) %>% summarise(n = n()) %>% arrange(desc(n))

This assigns the resulting count table to a variable named hotspots, and reads, take myData and group the rows by the lowercase version of the Address_Simple column, summarizing with a count of the number of occurrences. Arrange the resulting table in descending order."

The final result is a nice, clean table to show your boss the number of times each location appeared in the raw dataset:

Happy Programming!

Chibuzor Babalola, MD, MPH

Strategic and Analytical Global Health Professional!

4 年

You are the real MVP. Know this!

1 次回应

要查看或添加评论，请登录

Samantha Bell的更多文章

Standardize and clean those phone numbers using the new CleanPhoneNumbers R package!

2022年2月18日

Standardize and clean those phone numbers using the new CleanPhoneNumbers R package!

Have some dirty phone numbers in your data? This package can help! THE TASK Many data analysts will encounter projects…
Grow your plot expertise in R with drag-and-drop from esquisse

2021年12月13日

Grow your plot expertise in R with drag-and-drop from esquisse

Ever felt overwhelmed by ggplot? Are you unsure of how to get started with building your own visuals in R? The esquisse…
UPDATE - Cleaning addresses (with a new package)

2021年6月8日

UPDATE - Cleaning addresses (with a new package)

If you have made use of code to simplify, clean, geocode, or round address coordinates, this package may be the one for…

2 条评论
Freshen up - Update your R version and packages from within R Studio!

2021年5月25日

Freshen up - Update your R version and packages from within R Studio!

Is it time for an update? If you can't remember the last time you updated R, the answer is most likely, "yes". Noticing…
Spot the difference - comparing tables in R

2021年5月17日

Spot the difference - comparing tables in R

Ever wondered how to compare code output without looking over each row and column by hand? This handy use of…
Tracking Progress in R

2021年3月15日

Tracking Progress in R

It sure does seem like "a watched pot never boils" when waiting for loops or mapped functions to complete many…
Tying it all together with stringr

2020年12月3日

Tying it all together with stringr

Manipulating strings and pulling patterns of text is a frequent coding task and can be a challenge. Among the many…
Exporting Multiple Pages to an Excel Workbook from R

2020年11月6日

Exporting Multiple Pages to an Excel Workbook from R

Reports exported from R language can become unwieldy as results quickly start to fill up your destination folders…
FUN FACT: find those duplicates!

2020年10月22日

FUN FACT: find those duplicates!

Using duplicated() in R I thought I would share this fun & helpful R function which can be used to easily find…

1 条评论
Understanding the Chronic Optimist in Your Life

2019年12月20日

Understanding the Chronic Optimist in Your Life

In a world becoming increasingly aware of everyday anxieties, those of us who approach life with perpetual optimism can…

See all articles

Simplifying and Grouping Address Fields Using R

Samantha Bell

Veterinary Data Analysis | Dashboards & Reporting | LVT | E-commerce | Bioinformatics

The situation.

Think about what needs to be done.

Find addresses with NSEW, and capture a simplified version.

If an address does not contain NSEW, simplify it in a different way.

Group the cleaned addresses.

Happy Programming!

Samantha Bell的更多文章

社区洞察

其他会员也浏览了

Understanding ListLen(var) vs. var.ListLen in ColdFusion

Unlocking the Power of Data: Why Data Scraping is Your Business's Secret Weapon

Stack Overflow’s 2019 Developer Survey

Building A Search Engine from Scratch - Part 2

Brownbook.net Data Scraping Services by WebsiteDataScraping.com

IllicoDB3 - Blobs

Ezlocal Business Directory Data Scraping

Goldenpages.bg Data Extraction

Ezlocal Business Directory Data Scraping

An island of truth: practical data advice from Facebook and?Airbnb

The situation.

Think about what needs to be done.

Find addresses with NSEW, and capture a simplified version.

If an address does not contain NSEW, simplify it in a different way.

Group the cleaned addresses.

Happy Programming!

Samantha Bell的更多文章

Standardize and clean those phone numbers using the new CleanPhoneNumbers R package!

Grow your plot expertise in R with drag-and-drop from esquisse

UPDATE - Cleaning addresses (with a new package)

Freshen up - Update your R version and packages from within R Studio!

Spot the difference - comparing tables in R

Tracking Progress in R

Tying it all together with stringr

Exporting Multiple Pages to an Excel Workbook from R

FUN FACT: find those duplicates!

Understanding the Chronic Optimist in Your Life

社区洞察

其他会员也浏览了

Understanding ListLen(var) vs. var.ListLen in ColdFusion

Unlocking the Power of Data: Why Data Scraping is Your Business's Secret Weapon

Stack Overflow’s 2019 Developer Survey

Building A Search Engine from Scratch - Part 2

Brownbook.net Data Scraping Services by WebsiteDataScraping.com

IllicoDB3 - Blobs

Ezlocal Business Directory Data Scraping

Goldenpages.bg Data Extraction

Ezlocal Business Directory Data Scraping

An island of truth: practical data advice from Facebook and?Airbnb