Standardize and clean those phone numbers using the new CleanPhoneNumbers R package!
@mike_meyers via Unsplash

Standardize and clean those phone numbers using the new CleanPhoneNumbers R package!

Have some dirty phone numbers in your data? This package can help!

THE TASK

Many data analysts will encounter projects involving phone numbers at some point in their career. This might mean:

  • Grouping records by phone number
  • Using phone number field to deduplicate data
  • Creating a list of valid phone numbers to be contacted
  • And more...


THE COMPLICATIONS

But what if your data collection had no standard for phone number entry? When phone numbers are collected in free text format, you might end up with a variety of issues:

  • Parentheses, dashes, or periods being used at liberty
  • Letters or complete words being entered
  • Incomplete phone numbers
  • Numbers with too many digits
  • Repeated digits being entered when phone number is unknown


THE SOLUTION

The new R package named CleanPhoneNumbers has a function clean_numbers which will take care of all of this for you!

What does it do?

  • Removes and non-digits
  • Makes sure it either has 10 digits, or leads with a country code and has 11 digits
  • Checks for missing or empty number
  • Ensures that digits are not repeating - this happens when entry clerks repeat the same key multiple times to fill in a missing number with a dummy number. The code will discard any phone numbers which do not have at least 4 unique digits
  • Looks for "123456789" in order as a dummy number

How is it used?

Simply supply your column or vector of dirty numbers, and your preferred country code. clean_numbers() will return a vector of the same length which only contains your new clean numbers - the rest will be set to NA.

Let's take a look!

First, we need to install and load the package from GitHub using the remotes package

install.packages("remotes")
library(remotes)
remotes::install_github("bell-samantha/Packages/CleanPhoneNumbers")        

You will then need data in R which contains phone numbers. Let's try this example:

No alt text provided for this image


If this dataframe is loaded into R as "myNum", we can run the "dirty_phone" column through the clean_numbers() function to standardize and filter our messy data:


CleanPhoneNumbers::clean_numbers(phone = myNum$dirty_phone, country = 1)        
phone is the vector of phone numbers you want to clean
country is the code for the country which can appear as the first digit of numbers in your area

Assigning the results to a new column is as easy as this:

myNum$clean_phone <-
   CleanPhoneNumbers::clean_numbers(phone = myNum$dirty_phone, country = 1)        


THE RESULTS

Now we have a clean set of phone numbers!

No alt text provided for this image

Have fun using this simple method to clean and group your phone numbers :-)

HAPPY PROGRAMMING!

No alt text provided for this image


要查看或添加评论,请登录

Samantha Bell的更多文章

社区洞察

其他会员也浏览了