登录查看更多内容

Record Linkage - Join based on String Similarity

Manoj Pal

Problem Solver | 11+ exp | Data Engineer | Analytics Engineer | Modeler | dbt | Python | SQL | Terraform | Machine Learning | DevOps | Certifications - Azure, Tableau, Snowflake, Oracle SQL and PLSQL Expert | Agile

发布日期: 2021年6月8日

Records linking is the act of connecting disparate sources of information about a single entity. In comparison, this can be compared to a normal Join operation; however, unlike join, record linkage does not require exact matches (common columns) between two datasets, instead, it uses string similarity to find near matches. It is useful for evaluating and mapping inconsistent data sources.

Record Linkage operation consists of 5 steps:

a) Load and clean data b) Generate Index Pairs c) Configure the Compare Object d) Score the pairs e) Link the data sources.

All the above steps can be achieved by using Python's 'recordlinkage' package.

This concept will be illustrated with the help of the restaurant's dataset - clean and dirty (incremental) versions. We can consider that the clean version as the production data and the dirty being the raw file that usually gets dumped daily in the organization's FTP server. And, our task is to load the new file data - ignoring the duplicates which are not so obvious.

a. Load and explore the datasets: As we can see from the below code snippet, both the data frames structure are same

Dataset:
a. Clean: https://assets.datacamp.com/production/repositories/5737/datasets/e993edc10bc1f1ae458a560a9c41617bc2e259ae/restaurants_L2.csv

b. Dirty: https://assets.datacamp.com/production/repositories/5737/datasets/b2d11cce6759875e60d27b5a7062747b7ada77d0/restaurants_L2_dirty.csv

import pandas as pd

def load_data():
    restaurants = pd.read_csv('restaurant_csv', index_col=0)
    restaurants_dirty = pd.read_csv('restaurant_dirty_csv', index_col=0)

    return restaurants, restaurants_dirty

Explore:

>restaurants.info()
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    336 non-null    object
 1   addr    336 non-null    object
 2   city    336 non-null    object
 3   phone   336 non-null    int64 
 4   type    336 non-null    object
dtypes: int64(1), object(4)
memory usage: 15.8+ KB

>restaurants_dirty.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 82 entries, 0 to 81
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    82 non-null     object
 1   addr    82 non-null     object
 2   city    82 non-null     object
 3   phone   82 non-null     int64

 4   type    82 non-null     object

dtypes: int64(1), object(4)

memory usage: 3.8+ KB

The below output indicates that there are not matching restaurant names in both of the datasets.

>restaurants[restaurants.name.isin(restaurants_dirty.name)]

Out[15]: 
Empty DataFrame
Columns: [name,addr, city, phone, type]

Index: []

However, after verifying the restaurant names - it becomes clear that there are indeed the same restaurants present in both of the datasets with misspelled information like name in this case. It will be a tedious and brute-force approach to manually identify and replace such data since we do not have some primary key like restaurant_id to identify each restaurant uniquely.

>restaurants.name[0]
Out[23]: "arnie morton's of chicago"

>restaurants_dirty.name[40]

Out[24]: "arni morton's of chicago"

Therefore, string similarity should work here instead of exact matches to identify the same records. e.g: The string similarity score for the above 2 strings is 98/100.

>from fuzzywuzzy import fuzz
>fuzz.WRatio(restaurants.name[0], restaurants_dirty.name[40])

Out[26]: 98

Now, we can easily find a similar string across both the records of the datasets and add a threshold to identify the duplicate ones to ignore them.

b. Generating Index Pairs: This is where RecordLinkage becomes handy, which maps each record of the first dataset with every record of the second dataset. This is called Index-Pairs - they are generally multilevel indices, where the first level indicates the row index of the first source and 2nd level indicates the second source. They are created using recordlinkage's Index().

However, this will work for small datasets but will create millions if not billions of index pairs from large datasets. To avoid this heavy operation and optimize the index-pair generation process, we should find a column that can mitigate and narrow down the mapping of the pairs. This process is Blocking. In this case, we can use the city column as a blocking column as its value will remain constant(if this is not the case, then first we should normalize the blocking column value)

import recordlinkage as rl

def generate_pairs(restaurants, restaurants_dirty):
    # Generate Pairs
    indexer = rl.Index()
    indexer.block(['city'])
    
    pairs = indexer.index(restaurants, restaurants_dirty)
    
    # print(pairs)

    return pairs

print(pairs[:5])
MultiIndex([(0,  3),
            (0, 26),
            (0, 40),
            (0, 43),
            (0, 50)],

c. Configure the Compare Object: Now that we have the index pairs, we can start comparing them. To do this, we have to configure the record linkage's Compare object. There are 2 types of comparison that we can configure in the Compare object: exact and string similarity(WRatio). We use the city column for exact matches, and the restaurant type, address, and restaurant name for string matches - and the threshold of 0.8 indicates we will only consider matches where the WRatio value is above 80/100.

def configure_compare_object():
    compare_cl = rl.Compare()
    compare_cl.exact('city', 'city', label='city')
    compare_cl.string('type', 'type', label='cuisine_type', threshold=0.8)
    compare_cl.string('addr', 'addr', label='address', threshold=0.8)
    compare_cl.string('name', 'name', label='restaurant_name', threshold=0.8)

   
 return compare_cl

d. Score the pairs - to find the potential matches: To score the index pairs, we will use compute() of compare object, and the resultset will include 1- match and 0- not match for each column used while creating the Compare object - like city, type, addr, and name. e.g: Index 0 of restaurant df, and Index 50 of restaurant_dirty df has 2 column values matching out of 4.

# 3. Score and fetch the potential matches - Generate Potential matches

pot_matches = compare_cl.compute(pairs, restaurants, restaurants_dirty)

>print(pot_matches[:5])
      city  cuisine_type  address  restaurant_name
0 3      1           0.0      0.0              0.0
  26     1           0.0      0.0              0.0
  40     1           1.0      1.0              1.0
  43     1           1.0      0.0              0.0
  50     1           1.0      0.0              0.0

Now, to say if a record or a row is duplicate we can say that if 3 column value matches(out of 4) then the row is duplicate. To perform this filtration, we can perform a row-wise sum and check whether their sum is >=3.

dup_rows = pot_matches[pot_matches.sum(axis=1) >= 3]

>print(dup_rows[:5])
      city  cuisine_type  address  restaurant_name
0 40     1           1.0      1.0              1.0
2 74     1           1.0      1.0              1.0
4 53     1           1.0      1.0              1.0
5 65     1           1.0      1.0              1.0

6 73     1           1.0      1.0              1.0

# Fetch the restaurants_dirty duplicate row index - using get_level_values(1) as restaurants_dirty index is at 2nd level in the multi-level indices

dup_rows_index = dup_rows.index.get_level_values(1)

>print(dup_rows_index[:5])

Int64Index([40, 74, 53, 65, 73], dtype='int64')

Note: Verifying duplicates - we can see that duplicates have identified now.

def verify_duplicates(dup_rows, restaurants, restaurants_dirty):
    # Verify the Duplicate rows/matches - Print Duplicate rows
    orig_dup_rows_index = dup_rows.index.get_level_values(0)
    dirty_dup_rows_index = dup_rows.index.get_level_values(1)

    print(restaurants[restaurants.index.isin(orig_dup_rows_in
dex)][['name']])

    print(restaurants_dirty[restaurants_dirty.index.isin(dirty_dup_rows_index)][['name']])

> name    arnie morton's of chicago
Name: 0, dtype: object

name    arni morton's of chicago

Name: 40, dtype: object

e. Link the Dataframe: Now that we have the duplicate row index, the fresh restaurant values can be appended from the restaurants_dirty data frame and the corresponding values can be linked to the production restaurants data frame.

# Isolate dup rows from dirty df
dup_rows_index = dup_rows.index.get_level_values(1)
restaurants_new = restaurants_dirty[~restaurants_dirty.index.isin(dup_rows_index)]

def link_df(restaurants, restaurants_dirty, restaurants_new):
    # Link DF
    full_restaurants = restaurants.append(restaurants_new)
    print(restaurants.shape, restaurants_new.shape)
    print(full_restaurants.shape)

Although the record linkage topic appears complex at first, it is quite intuitive once you understand why joins fail in the first place, and how the Python-Levenshtein Minimum Edit Distance algorithm works. The 5-step record linkage process can be used to merge or join two data frames with similar data values from the same entity.

It would be greatly appreciated if you could provide some constructive criticism. Thank you for reading this article.

Please refer entire code below for reference.

import recordlinkage as rl
from dataset import dataset


# Step0
def load_data():
    restaurant_csv= "https://assets.datacamp.com/production/repositories/5737/datasets/e993edc10bc1f1ae458a560a9c41617bc2e259ae/restaurants_L2.csv"

    restaurant_dirty_csv = "https://assets.datacamp.com/production/repositories/5737/datasets/b2d11cce6759875e60d27b5a7062747b7ada77d0/restaurants_L2_dirty.csv"

    restaurants = pd.read_csv(restaurant_csv, index_col=0)
    restaurants_dirty = pd.read_csv(restaurant_dirty_csv, index_col=0)

    return restaurants, restaurants_dirty


# Step1
def generate_pairs(restaurants, restaurants_dirty):
    # Generate Pairs
    indexer = rl.Index()
    indexer.block(['city'])
    # indexer.block(['city', 'phone'])
    pairs = indexer.index(restaurants, restaurants_dirty)
    # print(pairs)
    return pairs


# Step2
def configure_compare_object():
    compare_cl = rl.Compare()
    compare_cl.exact('city', 'city', label='city')
    compare_cl.string('type', 'type', label='cuisine_type', threshold=0.8)
    compare_cl.string('addr', 'addr', label='address', threshold=0.8)
    compare_cl.string('name', 'name', label='restaurant_name', threshold=0.8)
    return compare_cl


# Step4
def treat_duplicates(pot_matches, restaurants_dirty):
    # Find duplicate rows
    dup_rows = pot_matches[pot_matches.sum(axis=1) >= 3]
    dup_rows_index = dup_rows.index.get_level_values(1)

    # Isolate dup rows from dirty df
    restaurants_new = restaurants_dirty[~ restaurants_dirty.index.isin(dup_rows_index)]

    return dup_rows, restaurants_new


# Step4.1
def verify_duplicates(dup_rows, restaurants, restaurants_dirty):
    # Verify the Duplicate rows/matches - Print Duplicate rows
    orig_dup_rows_index = dup_rows.index.get_level_values(0)
    dirty_dup_rows_index = dup_rows.index.get_level_values(1)

    print(restaurants[restaurants.index.isin(orig_dup_rows_index)][['name']])
    print(restaurants_dirty[restaurants_dirty.index.isin(dirty_dup_rows_index)][['name']])


# Step5
def link_df(restaurants, restaurants_dirty, restaurants_new):
    # Link DF
    full_restaurants = restaurants.append(restaurants_new)
    print(restaurants.shape, restaurants_new.shape)
    print(full_restaurants.shape)


def run():
    # 0. Load Dataset
    restaurants, restaurants_dirty = load_data()

    # 1. Generate Pairs
    pairs = generate_pairs(restaurants, restaurants_dirty)

    # 2. Configure Compare Object
    compare_cl = configure_compare_object()

    # 3. Score and fetch the potential matches - Generate Potential matches
    pot_matches = compare_cl.compute(pairs, restaurants, restaurants_dirty)
    # print(pot_matches[pot_matches.sum(axis=1) >= 3])

    # 4. Treat Duplicates and return unique df
    dup_rows, restaurants_new = treat_duplicates(pot_matches, restaurants_dirty)
    # verify_duplicates(dup_rows, restaurants, restaurants_dirty)

    # 5. Link new Dataframe
    link_df(restaurants, restaurants_dirty, restaurants_new)


run()

Manoj Pal的更多文章

Natural Language Processing: Patty Watson(SEVP & CIO, TSYS) Interview

2019年3月31日

Natural Language Processing: Patty Watson(SEVP & CIO, TSYS) Interview

Interview Source: https://www.apptio.

2 条评论

Record Linkage - Join based on String Similarity

Manoj Pal

Problem Solver | 11+ exp | Data Engineer | Analytics Engineer | Modeler | dbt | Python | SQL | Terraform | Machine Learning | DevOps | Certifications - Azure, Tableau, Snowflake, Oracle SQL and PLSQL Expert | Agile

Manoj Pal的更多文章

社区洞察

其他会员也浏览了

All Data and AI Weekly #177 - 17-Feb-2025

Pandas - Sort DataFrame

Change the data type of columns in Pandas

Investigating the effect of Company Announcements on their Share Price following COVID-19 (using the S&P 500)

Time-Series-Analysis-with-Statsmodels - Chapter 3

+30 Useful Operations in Pandas ??

A complete Exploratory Data Analysis guide with Python

Improved Role Assignment Report REST API

Python vs. R vs. SAS

Finding the Minimum Depth of a Binary Tree

Manoj Pal的更多文章

Natural Language Processing: Patty Watson(SEVP & CIO, TSYS) Interview

社区洞察

其他会员也浏览了

All Data and AI Weekly #177 - 17-Feb-2025

Pandas - Sort DataFrame

Change the data type of columns in Pandas

Investigating the effect of Company Announcements on their Share Price following COVID-19 (using the S&P 500)

Time-Series-Analysis-with-Statsmodels - Chapter 3

+30 Useful Operations in Pandas ??

A complete Exploratory Data Analysis guide with Python

Improved Role Assignment Report REST API

Python vs. R vs. SAS

Finding the Minimum Depth of a Binary Tree