How to Use ElasticSearch for Cleaning Customer Addresses
Navdeep Dhuti
Product Engineering & Innovation Manager @ Carbon Underwriting | Entrepreneur | UK Delegate: G20 YEA Summit
Plot
There are many types of customer data to clean. In this article, I’ll focus on the most common one: addresses. There are many reasons why you would need to clean your postal addresses. Most of the time, it is because they are not consistent across different data sources. This makes it hard to merge your customer databases together and send out the right advertising messages to your customers. Here I show how to use ElasticSearch in order to solve this problem quickly and efficiently [This text is the augmented version of the auto-generated text via gpt-2].
Address cleansing is an old and common problem for almost every business dealing with Customer Addresses. Any platform interested in my home address will often show me 3 fields of input asking the first line, the second line and the postcode of my address. Often the first line is marked as “required” so I have to fill something in and mostly, I will skip the second line of address and also fill postcodes in whichever way I desire in case the postcode field is not present. It’s funny how too much flexibility on an interface later leads the business into cleaning the data mess created by the unstructured user inputs.
This gets challenging when you have multiple source systems consuming customer addresses in all shapes and forms such as a website, mobile app, APIs, chatbots, etc.
Also to note, this problem is massive for old businesses. The majority of these big sharks have decades of data points that require modern solutions to clean and transform them at scale as well as support the ongoing, “increment” needs of the business.
Initiation
For now, let’s stick to UK based addresses. To start with the cleaning process, a valid source of addresses is required to ensure 100% accuracy of the Customer Address. Apparently, Post Office provides an up to date export of UK based addresses as Post Office Address File (PAF). This is substantially a reliable source for cleansing.
Now, in order to get this exercise up and running in a reliable fashion a lookup mechanism is required, i.e. looking up the best match for a single record against 40+ million PAF addresses and repeating it for over a million addresses.?
The angels gave the idea to choose ElasticSearch! Quite reliable and trusted search engine. ElasticSearch is one of the most powerful search engines available. ElasticSearch is a full-text search engine built on top of Apache Lucene. It has all the features you would expect in a search engine like relevance ranking, hit highlighting, faceted search and a scroll-based query interface.
A vector representation of addresses is indexed onto 2 distributed nodes and a master node was used to orchestrate the searching. The solution is able to clean millions of addresses in a matter of minutes.
Moral
The moral of the story, do not reinvent the wheel!