Four Methods for Merging Dataframes with Pandas
Hey my fellow data scientist! When you use Python to do data cleaning work, there is always a procedure you can not avoid. That is combining multiple data spreadsheets together in one place. Because a single dateset is easier for further analysis. To achieve that purpose, we will use pd.merge() function, which extends concat() function with the ability to align rows using multiple columns.
To demonstrate , let us use the two data files that I created to practice. You may also download the code and data from here.
Part 1. Read the three csv files in Python
As is shown above, we name these two files ‘df1’ and ‘df2’
Part 2. Using 4 merge methods for dataframes combination
We can see that df1 has three columns, and df2 has six columns. What we want to do is merging these two dataframes into one linking Name, Grade,Country, Course,Zip_Code and Major to a given StudentID. That is what we want to combine the dastasheets and do so aligning the StudentID column. We will do this with pd.merge() function
Before we merge datasheets, we need to know the four merge methods below:
1) LEFT Merge
If we use ‘left’ merge, we will keep every row in the left dataframe.
As I explained, since we kept every rows in the left datasheet, NaN values appeared in the right side in the table above after the merge.
2) RIGHT Merge
If we use ‘right’ merge, we will keep every row in the right dataframe.
As it shown above, since we kept every rows in the right datasheet, two rows were gone and there are no NaN values exist after the merge.
3) INNER Merge
Pandas uses ‘inner’ merge by default. This will keep only the common values in both the left and right dataframes. In order to compare the difference with OUTER Merge, this time we will merge on two columns: Student_ID and Zip_Code.
This time, because the first two records have different Zip_Codes in df1 and df2, so they were eventually not merged although their Student_ID are the same. Also in the new datasheet, all the common values are joined together, there are no extra Zip_code columns and no NaN values either.
4) OUTER Merge
Last but not the least, let us see how ‘outer’ merge works:
Cool! We got extra two rows in our new datasheet, and you may also see that Zip_code values merged together. ‘Outer’ merge can combine all the rows for left and right dataframes with NaN values together. Please note, the method ‘Indicator’ is to show where each row comes from.
To summarize, LEFT Merge and Right Merge is interchangeable depending on how you define your "left" and "right". In the above examples, if you have ways to deal with the NaN values, keeping all rows in the longer list will allow you to get a full dataset. Whereas if for the following procedures you will need to remove NaN values anyway, you can keep all rows in the shorter list to clean your dataset while you do the merge. The INNER Merge is helpful when you just need to keep records identical in merging columns, and to the contrast the OUTER Merge is most useful when you need to do some type of data audit.
End here, we have learned how to merge dataframes by using four methods. you can use these magic to combine your data whatever you want for your deep analysis. A clear and tidy datasheet will improve your working efficiency!