Easy data reconciliation in Python

Easy data reconciliation in Python

In this post, I would like to share all the common methods in python for data reconciliation. Consider scenarios like below: 1. Finding symmetric differences, 2. Combining data with overlaps. All the code below are for illustration purposes only, but the fundamentals can be scaled to more complex datasets.

import pandas as pd

import numpy as np

# turn jupyter notebook warnings off

import warnings

warnings.filterwarnings("ignore")

Working example one: Two dataframes with reconciliation column key containing non-duplicated elements

df1 = pd.DataFrame({'tradeid':range(5),'profit':range(1000,2000,200)})

df2 = pd.DataFrame({'tradeid':range(2,7,1), 'stock':   ['APL','MST','JNJ','TSL','BAB']})

display(df1)

display(df2)
No alt text provided for this image

Solution one: Since ‘tradeid’ is always unique, drop_duplicates() is the best way to spot the “exclusive-or” differences between two dataframes

pd.concat([df1,df2],axis=0).drop_duplicates(subset='tradeid',keep=False)
No alt text provided for this image

Solution two: Since ‘tradeid’ is always unique, we can also use a set function to find our differences

diff = set(df1.tradeid).symmetric_difference(set(df2.tradeid)) 

Then use .isin() method to select the rows from a list. Also python set is not searchable by hash values; thus converting to list

pd.concat([df1.loc[df1['tradeid'].isin(list(diff)),:],df2.loc[df2['tradeid'].isin(list(diff)),:]])
No alt text provided for this image

Solution three: Similar to solution two, this time we are finding symmetric differences using np.array

diff = np.setxor1d(np.array(df1.tradeid), np.array(df2.tradeid)) 

pd.concat([df1.loc[df1['tradeid'].isin(list(diff)),:],df2.loc[df2['tradeid'].isin(list(diff)),:]])
No alt text provided for this image


Working example two: Two dataframes with reconciliation column key containing duplicated elements

df1 = pd.DataFrame({'tradeid':[0,2,2,3,4],'profit':range(1000,2000,200)})

df2 = pd.DataFrame({'tradeid':[2,3,4,5,5], 'stock':['APL','MST','JNJ','TSL','BAB']})

display(df1)

display(df2)
No alt text provided for this image

Solution: Use the pandas merge function with the indicator flag

df3 = df1.merge(df2, on='tradeid',how='outer',indicator=True)

display(df3)

df3 = df3.loc[df3["_merge"] != 'both', :]

# cleanup to make df3 our final result

del df3["_merge"]

df3
No alt text provided for this image


Working example three: Finding symmetric differences for 2 lists containing duplicates

list1 = [0,1,1,2,3]

list2 = [0,1,3,4,4] 

Solution is embedded using functional programming

list3 = list(filter(lambda x:x not in list2, list1))

list4 = list(filter(lambda x:x not in list1, list2))

list3 + list4 

Out: [2, 4, 4]

p.s. list comprehension does not work here somehow [list1.remove(x) for x in list1 if x in list2]


Working example four: Combining data with overlaps

d1 = pd.DataFrame({'tradeid':[0,1,2],'profit':[1000,np.nan,2000],'stock':['APL','JNJ',np.nan]})

d2 = pd.DataFrame({'tradeid':[3,1,2],'profit':[1000,1500,np.nan],'stock':['APL',np.nan,np.nan]})

display(d1)

display(d2) 

Subsequent dataset will combine and overwrite the original

oned2.combine_first(d1)
No alt text provided for this image


要查看或添加评论,请登录

Changyang Xu的更多文章

  • 世界八大定律

    世界八大定律

    第一,墨菲定律(Murphy's Law) 你越害怕,某件事就越可能发生 第二,吉德林定律 (Gidling's Law) 如果你能把一个问题写清楚,那么它就已经被解决了一半 第三,吉尔伯特定律(Gilbert's Law)…

  • 一些重要的投资理念

    一些重要的投资理念

    - 目录 - 1. 在投资领域,显著收益往往源自于少数几次对极端风险的承担。因此,若要稳定地实现复利增长,需要采取长期满仓持有的策略。这种策略要求投资者长期持有资产,同时避免对单一标的或特定风格过度暴露。 2.

  • 这一排企业家,每个都是我们要学习的榜样

    这一排企业家,每个都是我们要学习的榜样

    中共中央总书记、国家主席、中央军委主席习近平2025年2月17日上午在京出席民营企业座谈会。在听取民营企业负责人代表发言后,习近平发表了重要讲话。李强、丁薛祥出席座谈会,王沪宁主持。 第一张图从左至右排序: 曾毓群—宁德时代,汽车动力电池…

  • 招聘信息——投融资董事长助理岗

    招聘信息——投融资董事长助理岗

    任职岗位: 某国内头部芯片上市公司(市值超400亿元)现招聘一名投融资方向的董事长助理,协助公司在投融资领域的相关工作。 岗位职责: -…

  • 什么是比《华尔街之狼》高级一万倍的金融营销?

    什么是比《华尔街之狼》高级一万倍的金融营销?

    金融产品永远不是用来赚钱的。能白手起家,赚到几个亿的人,需要你教他怎么赚钱?!普通大众需要的是收益,而高净值客户的需求是:深层次服务(高净值不差钱,节省能节省的时间成本)、定制化需求、历史留名、守住财产。…

  • DeepSeek抄袭了谁?

    DeepSeek抄袭了谁?

    大纲 DeepSeek是什么 2. DeepSeek成就为什么应该理性看待 3.

  • 掘金之心2025新年目标

    掘金之心2025新年目标

    祝掘金之心读者新年快乐,未来可期! 从我创办社群之初已经过了整整7年。虽不完美,但是我对进步的欲望始终没有停止过。 与普通传递信息的金融博主不同,我真心希望我的读者与我一同进步,不管思想上,或是财富上。我无比认同Ray…

  • 你喝可乐么?

    你喝可乐么?

    前言…

  • 选择比努力重要嘛?

    选择比努力重要嘛?

    石头象征着重要的事物,比如家庭、朋友、健康;而沙子代表着工作应酬、消遣等琐事。如果每天的24小时被不重要的事情填满,那么那些真正重要的目标就永远无法实现。…

  • 金融圈的"崇拜链"

    金融圈的"崇拜链"

    人脉心得分享:从I人到建立百人交流 2024年只剩下最后一个月了。抛开策略和业绩,我想和大家聊聊人脉这件事,分享一些自己的经历和心得,供大家参考学习。…

社区洞察