登录查看更多内容

Address Cleansing , Fuzzy Matching and Levenshtein distance

Akhil Pathirippilly mana

Data Engineering / Data Warehousing / Azure / AWS /CDP/Databricks

发布日期: 2023年1月21日

#python #dataengineering #standardization #datacleansing #fuzzy

Address cleansing and standardization will always get tricky and complicated when it comes to sources that provides?address information with :

?Address lines without any order
Address lines with missing information
Misspelled address lines , especially Country or State

This makes the standardization/normalization solution more complex. When it comes to Python , there are different open source packages with different NLP/data science approaches implemented are available in PyPi repo for standardization purpose. Some of them are:

pycountry : pycountry is a Python package that provides data for countries and subdivisions (e.g. states, provinces, territories) from the ISO 3166 standard. This package can be used to lookup countries based on various criteria such as their name, alpha-2 code, and official language, state, province etc. For example , you can convert a source record with?country name as 'US' or 'USA' to 'United States'.
usaddress: This package uses machine learning to parse and standardize addresses in the United States.
pyap: This package uses a combination of heuristics and machine learning to parse and standardize addresses for several countries, including the United States, Canada, and the United Kingdom.
address-parser: This package uses a combination of heuristics and machine learning to parse and standardize addresses for several countries, including the United States, Canada, and the United Kingdom.
geopy: This package can be used to geocode addresses, which can be useful for address cleansing as it can verify that the address exists and is formatted correctly.
address: This package can be used to validate and standardize addresses, including postal codes, street names and house numbers, city, and country.
address-standardizer: This package can be used to standardize and validate addresses, including street names, house numbers, city, and postal codes.

Iyar Lin 4 年前

6 Steps to Utilize Python Sentiment Analysis for…

MyExamCloud 7 个月前

Dummy Variables & One Hot Encoding

Subash A 1 年前

But still all these are not at all 100% dependable and can result in false positive results many times. Also most of them are highly limited to?geographical information as it has as a reference to lookup datasets which it uses as a source of truth. But?it is common to use these packages to support some decision making while normalizing address data with the help of additional static lookup files

In one of my older organizations , we have used pycountry?with fuzzy matching to support our address standardization process . For implementing fuzzy matching , we have used fuzzywuzzy library . We check for possible match for a misspelled or improperly represented State/Province/Country name from source with standard values in pycountry?or extra standard values in our static lookup files and rewrite the value if an 80%?match found in fuzzy matching .

fuzzywuzzy uses Levenshtein distance algorithm (aka edit distance algorithm) which is a measure of the similarity between two strings. It is defined as the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other. The Levenshtein distance algorithm is widely used in text processing, including spelling correction, natural language processing, and bioinformatics.

For example, the Levenshtein distance between the words "kitten" and "sitting" is 3, because three edits are required to change one into the other. The Levenshtein distance algorithm has a time complexity of O(n*m) where n and m are the lengths of the two strings being compared (Can be implemented using dynamic programming approach)

Below is a very basic example showing how can we utilize pycountry +?fuzzywuzzy?+ usaddress together?. Please note that , in real implementations , this will become more complex with additional lookups and refinement.

Address Cleansing , Fuzzy Matching and Levenshtein distance

Akhil Pathirippilly mana

Data Engineering / Data Warehousing / Azure / AWS /CDP/Databricks

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

K-Nearest Neighbors (KNN) Algorithm in Python from Scratch

The Prometheus Effect: How to Prepare Text Features for Language Models in Python + R

AI DOES MACHINE LEARNING

Sigmoid Function Minimization and Maximization - Python

Scikit-Learn + Python + Linear Regression = Predicting the Sex of a Person

What's In a Random Number Generator Anyway?

LangChain: Function Calling

Taming the Forest: The Advent of Regularized Greedy Forest

Fake News Detection Using Python

Enhancing integration with GPT models using Structured Outputs

领英推荐

Data Skipping -> Same Objective - Similar Concept - Different Names and Implementation

2023年2月22日

Two major Azure SDK Python classes for automating the azure resource deployments

2023年2月16日

Azure - Service Principals v/s Managed Identities

2023年2月14日

Project Tungsten - Cache-aware Computation

2023年2月5日

Data pipeline architecture options with Azure

2023年1月20日

MySQL on Docker

2022年12月2日

Two interesting configs on AQE Join strategy conversion

2022年11月25日

Apache Spark : The Shuffle

2022年11月22日

ShuffleHashJoin - The what , why and when

2022年11月18日

Layman's view of data storage hierarchy in SQL Server

2022年11月1日