UPDATE - Cleaning addresses (with a new package)
Samantha Bell
Veterinary Data Analysis | Dashboards & Reporting | LVT | E-commerce | Bioinformatics
If you have made use of code to simplify, clean, geocode, or round address coordinates, this package may be the one for you!
You may have seen my previous article on getting started simplifying street address fields prior to grouping them. Since that time, I have worked on improving the code functionality and flexibility...which resulted in the creation of a package I have named "cleanAddresses".
In addition to the function for simplifying street addresses, I chose to include a function to geocode addresses as well as round the geocoded coordinates (to protect individuals private information).
Let's take a look
The package can be installed from github by making use of the "remotes" package
install.packages("remotes") # install remotes if needed library(remotes) remotes::install_github("bell-samantha/Packages/cleanAddresses") library(cleanAddresses)
Once installed, you can access all three of the functions within cleanAddresses:
- simplify_street()
- add_coord()
- round_coord()
simplify_street()
This function takes in a character vector of address text. and returns a same-length vector of simplified address text.
simplify_street(street, numWords)
myData$newField <- simplify_street(street = myData$rawStreetField, numWords = 2)
The entry data MUST start with street numbers but has the option to include or exclude City, State, and Zip fields. Any City, State, or Zip will be cut off in the simplified version. The result can be applied directly as a new column in a dataset if desired.
- The parameter "street" takes in the vector of character street names (usually a column from a dataset)
- The parameter "numWords" takes in the number of full words the user would like to allow to follow the street number and direction.
add_coords()
This function creates a simple address tibble that can be passed through censusxy::cxy_geocode() from the package "censusxy" to get x and y coordinates for each record.
This is best used after cleaning the street field with cleanAddresses::simplify_street(). Can be joined directly into a dataset.
- The parameter "identifier" takes in the column containing unique record ids
- The parameter "street" takes in the column containing street name and number
- The parameter "city" takes in the column containing city name
- The parameter "state" takes in the column containing state name
- The parameter "zip" takes in the column containing zip codes
myCoordinates <- add_coord( street = myData$newField, city = myData$cityField, state = myData$stateField, zip = myData$zipField, identifier = myData$Id )
round_coords()
This function takes in 2 vectors of coordinate values - one for lattitude and one for longitude. The result is a 4-column tibble the same length and order as the input values. The new coordinates can be accessed in column 3 and 4.
- The parameter "lattitude" takes in the vector of lattitude values (usually a column from a dataset).
- The parameter "longitude" takes in the vector of longitude values (usually a column from a dataset).
- The parameter "distance" takes in the user's choice of the number of degree decimal places to round the coordinate. 1 degree, or zero decimal places, rounds to an accuracy of approximately 111km. Each additional decimal place is 10 times more accurate in distance.
# To round coordinates to an accuracy of approximately 1.11km. # Returns 4 columns (two original, and two rounded) round_coord(lattitude = myData$lat, longitude = myData$long, distance = 2) # To get the rounded lattitude column only round_coord(lattitude = myData$lat, longitude = myData$long, distance = 2)[,4]
I hope this helps!
Let me know if you find this package useful. Perhaps sharing my thoughts will inspire you to make some functions of you own!
Feedback and suggestions are much appreciated, as this is always a work in progress :-)
Sr Epidemiologist at Ingham County | Co-Owner at Larder Data Consulting, LLC
3 å¹´Katie Larder We should try this.