Record Linkage - Join based on String Similarity
Records linking is the act of connecting disparate sources of information about a single entity. In comparison, this can be compared to a normal Join operation; however, unlike join, record linkage does not require exact matches (common columns) between two datasets, instead, it uses string similarity to find near matches. It is useful for evaluating and mapping inconsistent data sources.
Record Linkage operation consists of 5 steps:
a) Load and clean data b) Generate Index Pairs c) Configure the Compare Object d) Score the pairs e) Link the data sources.
All the above steps can be achieved by using Python's 'recordlinkage' package.
This concept will be illustrated with the help of the restaurant's dataset - clean and dirty (incremental) versions. We can consider that the clean version as the production data and the dirty being the raw file that usually gets dumped daily in the organization's FTP server. And, our task is to load the new file data - ignoring the duplicates which are not so obvious.
a. Load and explore the datasets: As we can see from the below code snippet, both the data frames structure are same
Dataset: a. Clean: https://assets.datacamp.com/production/repositories/5737/datasets/e993edc10bc1f1ae458a560a9c41617bc2e259ae/restaurants_L2.csv b. Dirty: https://assets.datacamp.com/production/repositories/5737/datasets/b2d11cce6759875e60d27b5a7062747b7ada77d0/restaurants_L2_dirty.csv
import pandas as pd def load_data(): restaurants = pd.read_csv('restaurant_csv', index_col=0) restaurants_dirty = pd.read_csv('restaurant_dirty_csv', index_col=0) return restaurants, restaurants_dirty Explore:
>restaurants.info() # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 336 non-null object 1 addr 336 non-null object 2 city 336 non-null object 3 phone 336 non-null int64 4 type 336 non-null object dtypes: int64(1), object(4) memory usage: 15.8+ KB >restaurants_dirty.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 82 entries, 0 to 81 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 82 non-null object 1 addr 82 non-null object 2 city 82 non-null object 3 phone 82 non-null int64
4 type 82 non-null object
dtypes: int64(1), object(4)
memory usage: 3.8+ KB
The below output indicates that there are not matching restaurant names in both of the datasets.
>restaurants[restaurants.name.isin(restaurants_dirty.name)]
Out[15]: Empty DataFrame Columns: [name,addr, city, phone, type]
Index: []
However, after verifying the restaurant names - it becomes clear that there are indeed the same restaurants present in both of the datasets with misspelled information like name in this case. It will be a tedious and brute-force approach to manually identify and replace such data since we do not have some primary key like restaurant_id to identify each restaurant uniquely.
>restaurants.name[0] Out[23]: "arnie morton's of chicago" >restaurants_dirty.name[40]
Out[24]: "arni morton's of chicago"
Therefore, string similarity should work here instead of exact matches to identify the same records. e.g: The string similarity score for the above 2 strings is 98/100.
>from fuzzywuzzy import fuzz >fuzz.WRatio(restaurants.name[0], restaurants_dirty.name[40])
Out[26]: 98
Now, we can easily find a similar string across both the records of the datasets and add a threshold to identify the duplicate ones to ignore them.
b. Generating Index Pairs: This is where RecordLinkage becomes handy, which maps each record of the first dataset with every record of the second dataset. This is called Index-Pairs - they are generally multilevel indices, where the first level indicates the row index of the first source and 2nd level indicates the second source. They are created using recordlinkage's Index().
However, this will work for small datasets but will create millions if not billions of index pairs from large datasets. To avoid this heavy operation and optimize the index-pair generation process, we should find a column that can mitigate and narrow down the mapping of the pairs. This process is Blocking. In this case, we can use the city column as a blocking column as its value will remain constant(if this is not the case, then first we should normalize the blocking column value)
import recordlinkage as rl def generate_pairs(restaurants, restaurants_dirty): # Generate Pairs indexer = rl.Index() indexer.block(['city']) pairs = indexer.index(restaurants, restaurants_dirty) # print(pairs)
return pairs
print(pairs[:5]) MultiIndex([(0, 3), (0, 26), (0, 40), (0, 43), (0, 50)],
)
c. Configure the Compare Object: Now that we have the index pairs, we can start comparing them. To do this, we have to configure the record linkage's Compare object. There are 2 types of comparison that we can configure in the Compare object: exact and string similarity(WRatio). We use the city column for exact matches, and the restaurant type, address, and restaurant name for string matches - and the threshold of 0.8 indicates we will only consider matches where the WRatio value is above 80/100.
def configure_compare_object(): compare_cl = rl.Compare() compare_cl.exact('city', 'city', label='city') compare_cl.string('type', 'type', label='cuisine_type', threshold=0.8) compare_cl.string('addr', 'addr', label='address', threshold=0.8) compare_cl.string('name', 'name', label='restaurant_name', threshold=0.8)
return compare_cl
d. Score the pairs - to find the potential matches: To score the index pairs, we will use compute() of compare object, and the resultset will include 1- match and 0- not match for each column used while creating the Compare object - like city, type, addr, and name. e.g: Index 0 of restaurant df, and Index 50 of restaurant_dirty df has 2 column values matching out of 4.
# 3. Score and fetch the potential matches - Generate Potential matches
pot_matches = compare_cl.compute(pairs, restaurants, restaurants_dirty) >print(pot_matches[:5]) city cuisine_type address restaurant_name 0 3 1 0.0 0.0 0.0 26 1 0.0 0.0 0.0 40 1 1.0 1.0 1.0 43 1 1.0 0.0 0.0 50 1 1.0 0.0 0.0
Now, to say if a record or a row is duplicate we can say that if 3 column value matches(out of 4) then the row is duplicate. To perform this filtration, we can perform a row-wise sum and check whether their sum is >=3.
dup_rows = pot_matches[pot_matches.sum(axis=1) >= 3]
>print(dup_rows[:5]) city cuisine_type address restaurant_name 0 40 1 1.0 1.0 1.0 2 74 1 1.0 1.0 1.0 4 53 1 1.0 1.0 1.0 5 65 1 1.0 1.0 1.0
6 73 1 1.0 1.0 1.0 # Fetch the restaurants_dirty duplicate row index - using get_level_values(1) as restaurants_dirty index is at 2nd level in the multi-level indices dup_rows_index = dup_rows.index.get_level_values(1)
>print(dup_rows_index[:5])
Int64Index([40, 74, 53, 65, 73], dtype='int64')
Note: Verifying duplicates - we can see that duplicates have identified now.
def verify_duplicates(dup_rows, restaurants, restaurants_dirty): # Verify the Duplicate rows/matches - Print Duplicate rows orig_dup_rows_index = dup_rows.index.get_level_values(0) dirty_dup_rows_index = dup_rows.index.get_level_values(1) print(restaurants[restaurants.index.isin(orig_dup_rows_in dex)][['name']])
print(restaurants_dirty[restaurants_dirty.index.isin(dirty_dup_rows_index)][['name']])
> name arnie morton's of chicago Name: 0, dtype: object name arni morton's of chicago
Name: 40, dtype: object
e. Link the Dataframe: Now that we have the duplicate row index, the fresh restaurant values can be appended from the restaurants_dirty data frame and the corresponding values can be linked to the production restaurants data frame.
# Isolate dup rows from dirty df dup_rows_index = dup_rows.index.get_level_values(1) restaurants_new = restaurants_dirty[~restaurants_dirty.index.isin(dup_rows_index)] def link_df(restaurants, restaurants_dirty, restaurants_new): # Link DF full_restaurants = restaurants.append(restaurants_new) print(restaurants.shape, restaurants_new.shape) print(full_restaurants.shape)
Although the record linkage topic appears complex at first, it is quite intuitive once you understand why joins fail in the first place, and how the Python-Levenshtein Minimum Edit Distance algorithm works. The 5-step record linkage process can be used to merge or join two data frames with similar data values from the same entity.
It would be greatly appreciated if you could provide some constructive criticism. Thank you for reading this article.
Please refer entire code below for reference.
import recordlinkage as rl from dataset import dataset # Step0 def load_data(): restaurant_csv= "https://assets.datacamp.com/production/repositories/5737/datasets/e993edc10bc1f1ae458a560a9c41617bc2e259ae/restaurants_L2.csv" restaurant_dirty_csv = "https://assets.datacamp.com/production/repositories/5737/datasets/b2d11cce6759875e60d27b5a7062747b7ada77d0/restaurants_L2_dirty.csv" restaurants = pd.read_csv(restaurant_csv, index_col=0) restaurants_dirty = pd.read_csv(restaurant_dirty_csv, index_col=0) return restaurants, restaurants_dirty # Step1 def generate_pairs(restaurants, restaurants_dirty): # Generate Pairs indexer = rl.Index() indexer.block(['city']) # indexer.block(['city', 'phone']) pairs = indexer.index(restaurants, restaurants_dirty) # print(pairs) return pairs # Step2 def configure_compare_object(): compare_cl = rl.Compare() compare_cl.exact('city', 'city', label='city') compare_cl.string('type', 'type', label='cuisine_type', threshold=0.8) compare_cl.string('addr', 'addr', label='address', threshold=0.8) compare_cl.string('name', 'name', label='restaurant_name', threshold=0.8) return compare_cl # Step4 def treat_duplicates(pot_matches, restaurants_dirty): # Find duplicate rows dup_rows = pot_matches[pot_matches.sum(axis=1) >= 3] dup_rows_index = dup_rows.index.get_level_values(1) # Isolate dup rows from dirty df restaurants_new = restaurants_dirty[~ restaurants_dirty.index.isin(dup_rows_index)] return dup_rows, restaurants_new # Step4.1 def verify_duplicates(dup_rows, restaurants, restaurants_dirty): # Verify the Duplicate rows/matches - Print Duplicate rows orig_dup_rows_index = dup_rows.index.get_level_values(0) dirty_dup_rows_index = dup_rows.index.get_level_values(1) print(restaurants[restaurants.index.isin(orig_dup_rows_index)][['name']]) print(restaurants_dirty[restaurants_dirty.index.isin(dirty_dup_rows_index)][['name']]) # Step5 def link_df(restaurants, restaurants_dirty, restaurants_new): # Link DF full_restaurants = restaurants.append(restaurants_new) print(restaurants.shape, restaurants_new.shape) print(full_restaurants.shape) def run(): # 0. Load Dataset restaurants, restaurants_dirty = load_data() # 1. Generate Pairs pairs = generate_pairs(restaurants, restaurants_dirty) # 2. Configure Compare Object compare_cl = configure_compare_object() # 3. Score and fetch the potential matches - Generate Potential matches pot_matches = compare_cl.compute(pairs, restaurants, restaurants_dirty) # print(pot_matches[pot_matches.sum(axis=1) >= 3]) # 4. Treat Duplicates and return unique df dup_rows, restaurants_new = treat_duplicates(pot_matches, restaurants_dirty) # verify_duplicates(dup_rows, restaurants, restaurants_dirty) # 5. Link new Dataframe link_df(restaurants, restaurants_dirty, restaurants_new) run()