Address Cleansing , Fuzzy Matching and Levenshtein distance

Address Cleansing , Fuzzy Matching and Levenshtein distance

#python #dataengineering #standardization #datacleansing #fuzzy

Address cleansing and standardization will always get tricky and complicated when it comes to sources that provides?address information with :

  • ?Address lines without any order
  • Address lines with missing information
  • Misspelled address lines , especially Country or State

?

This makes the standardization/normalization solution more complex. When it comes to Python , there are different open source packages with different NLP/data science approaches implemented are available in PyPi repo for standardization purpose. Some of them are:

?

  • pycountry : pycountry is a Python package that provides data for countries and subdivisions (e.g. states, provinces, territories) from the ISO 3166 standard. This package can be used to lookup countries based on various criteria such as their name, alpha-2 code, and official language, state, province etc. For example , you can convert a source record with?country name as 'US' or 'USA' to 'United States'.
  • usaddress: This package uses machine learning to parse and standardize addresses in the United States.
  • pyap: This package uses a combination of heuristics and machine learning to parse and standardize addresses for several countries, including the United States, Canada, and the United Kingdom.
  • address-parser: This package uses a combination of heuristics and machine learning to parse and standardize addresses for several countries, including the United States, Canada, and the United Kingdom.
  • geopy: This package can be used to geocode addresses, which can be useful for address cleansing as it can verify that the address exists and is formatted correctly.
  • address: This package can be used to validate and standardize addresses, including postal codes, street names and house numbers, city, and country.
  • address-standardizer: This package can be used to standardize and validate addresses, including street names, house numbers, city, and postal codes.

?

?

But still all these are not at all 100% dependable and can result in false positive results many times. Also most of them are highly limited to?geographical information as it has as a reference to lookup datasets which it uses as a source of truth. But?it is common to use these packages to support some decision making while normalizing address data with the help of additional static lookup files

?

In one of my older organizations , we have used pycountry?with fuzzy matching to support our address standardization process . For implementing fuzzy matching , we have used fuzzywuzzy library . We check for possible match for a misspelled or improperly represented State/Province/Country name from source with standard values in pycountry?or extra standard values in our static lookup files and rewrite the value if an 80%?match found in fuzzy matching .

?

fuzzywuzzy uses Levenshtein distance algorithm (aka edit distance algorithm) which is a measure of the similarity between two strings. It is defined as the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other. The Levenshtein distance algorithm is widely used in text processing, including spelling correction, natural language processing, and bioinformatics.

?

For example, the Levenshtein distance between the words "kitten" and "sitting" is 3, because three edits are required to change one into the other. The Levenshtein distance algorithm has a time complexity of O(n*m) where n and m are the lengths of the two strings being compared (Can be implemented using dynamic programming approach)

Below is a very basic example showing how can we utilize pycountry +?fuzzywuzzy?+ usaddress together?. Please note that , in real implementations , this will become more complex with additional lookups and refinement.

No alt text provided for this image

要查看或添加评论,请登录

社区洞察

其他会员也浏览了