Address Cleansing , Fuzzy Matching and Levenshtein distance
Akhil Pathirippilly mana
Data Engineering / Data Warehousing / Azure / AWS /CDP/Databricks
Address cleansing and standardization will always get tricky and complicated when it comes to sources that provides?address information with :
?
This makes the standardization/normalization solution more complex. When it comes to Python , there are different open source packages with different NLP/data science approaches implemented are available in PyPi repo for standardization purpose. Some of them are:
?
?
?
领英推荐
But still all these are not at all 100% dependable and can result in false positive results many times. Also most of them are highly limited to?geographical information as it has as a reference to lookup datasets which it uses as a source of truth. But?it is common to use these packages to support some decision making while normalizing address data with the help of additional static lookup files
?
In one of my older organizations , we have used pycountry?with fuzzy matching to support our address standardization process . For implementing fuzzy matching , we have used fuzzywuzzy library . We check for possible match for a misspelled or improperly represented State/Province/Country name from source with standard values in pycountry?or extra standard values in our static lookup files and rewrite the value if an 80%?match found in fuzzy matching .
?
fuzzywuzzy uses Levenshtein distance algorithm (aka edit distance algorithm) which is a measure of the similarity between two strings. It is defined as the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other. The Levenshtein distance algorithm is widely used in text processing, including spelling correction, natural language processing, and bioinformatics.
?
For example, the Levenshtein distance between the words "kitten" and "sitting" is 3, because three edits are required to change one into the other. The Levenshtein distance algorithm has a time complexity of O(n*m) where n and m are the lengths of the two strings being compared (Can be implemented using dynamic programming approach)
Below is a very basic example showing how can we utilize pycountry +?fuzzywuzzy?+ usaddress together?. Please note that , in real implementations , this will become more complex with additional lookups and refinement.
Single pass approach?