How to Use ElasticSearch for Cleaning Customer Addresses
Photo by Oliver Hale on Unsplash

How to Use ElasticSearch for Cleaning Customer Addresses

Plot

There are many types of customer data to clean. In this article, I’ll focus on the most common one: addresses. There are many reasons why you would need to clean your postal addresses. Most of the time, it is because they are not consistent across different data sources. This makes it hard to merge your customer databases together and send out the right advertising messages to your customers. Here I show how to use ElasticSearch in order to solve this problem quickly and efficiently [This text is the augmented version of the auto-generated text via gpt-2].

No alt text provided for this image


Address cleansing is an old and common problem for almost every business dealing with Customer Addresses. Any platform interested in my home address will often show me 3 fields of input asking the first line, the second line and the postcode of my address. Often the first line is marked as “required” so I have to fill something in and mostly, I will skip the second line of address and also fill postcodes in whichever way I desire in case the postcode field is not present. It’s funny how too much flexibility on an interface later leads the business into cleaning the data mess created by the unstructured user inputs.

This gets challenging when you have multiple source systems consuming customer addresses in all shapes and forms such as a website, mobile app, APIs, chatbots, etc.

Also to note, this problem is massive for old businesses. The majority of these big sharks have decades of data points that require modern solutions to clean and transform them at scale as well as support the ongoing, “increment” needs of the business.

Initiation

For now, let’s stick to UK based addresses. To start with the cleaning process, a valid source of addresses is required to ensure 100% accuracy of the Customer Address. Apparently, Post Office provides an up to date export of UK based addresses as Post Office Address File (PAF). This is substantially a reliable source for cleansing.

Now, in order to get this exercise up and running in a reliable fashion a lookup mechanism is required, i.e. looking up the best match for a single record against 40+ million PAF addresses and repeating it for over a million addresses.?

The angels gave the idea to choose ElasticSearch! Quite reliable and trusted search engine. ElasticSearch is one of the most powerful search engines available. ElasticSearch is a full-text search engine built on top of Apache Lucene. It has all the features you would expect in a search engine like relevance ranking, hit highlighting, faceted search and a scroll-based query interface.

A vector representation of addresses is indexed onto 2 distributed nodes and a master node was used to orchestrate the searching. The solution is able to clean millions of addresses in a matter of minutes.

Moral

The moral of the story, do not reinvent the wheel!

要查看或添加评论,请登录

Navdeep Dhuti的更多文章

  • Ensuring Ethical AI Practices: Insights from the House of Lords Session on AI and Ethics

    Ensuring Ethical AI Practices: Insights from the House of Lords Session on AI and Ethics

    I had the pleasure of joining discussions at the House of Lords on AI and Ethics highlighting the growing concerns…

    2 条评论
  • Focus On Data Models Before Applications

    Focus On Data Models Before Applications

    “I want to build an app which shows a bunch of insights, sprinkling some predictions on top and potentially navigating…

    3 条评论
  • From Idea to Reality: The Role of Tools and Collaboration in Building a Strong Tech Foundation

    From Idea to Reality: The Role of Tools and Collaboration in Building a Strong Tech Foundation

    One can equip an army with enough weapons to have a great chance of winning a war but – as Sun Tzu described in the Art…

  • A Tale of Data Heroes

    A Tale of Data Heroes

    When building a start-up tech, every early contributor is a superhero. They work towards making something new…

  • We Are All Humans After All

    We Are All Humans After All

    "No matter how big or small, No matter what you do, what's your role, Take a deep breath and look around, We are all…

  • Building a Containerised Data Pipeline using Singer

    Building a Containerised Data Pipeline using Singer

    Talking of data extraction pipelines, there are 100s of solutions out there to build a simple lift and shift or a…

  • Building Blocks of Data Warehouse as Code

    Building Blocks of Data Warehouse as Code

    Introduction A Data warehouse is one of the core pieces for any enterprise's big data strategy and this infrastructure…

  • Learn why to code!

    Learn why to code!

    One thing always catches my eyes on the Internet, ‘Learn how to code’. Due to resources available globally over the…

  • Know your limits and be patient ...

    Know your limits and be patient ...

    It's not surprising how most of the people want to have superpowers which enables them to easily read mind of others…

  • “Organic is not sustainable!”

    “Organic is not sustainable!”

    I came across a post, posted by one of my friends Rachael where she says "This is incredible!! People are so scared of…

    3 条评论

社区洞察

其他会员也浏览了