登录查看更多内容

Data Analysis with Pandas: DataFrame Merging Methods You Must Master

Benjamin Bennett Alexander

发布日期: 2024年8月3日

One of the most common operations performed by data analysts and scientists is the merging of data. This is because the data that is required for analysis is usually found on different datasets. Pandas, the powerful data manipulation library in Python, provides robust tools for merging data. If you have done some merging operations in SQL, you will find these operations familiar. Understanding how to effectively merge DataFrames can significantly enhance your data processing capabilities. In this article, we will explore four (inner, left, right, and outer) merging operations that you can perform with pandas. Let's get started.

The Inner Join Operation

Before we dive into the mechanics of the inner join, let's load the two datasets (cars_sales_real.csv and car_reviews_real.csv) that we are going to use to demonstrate the different merging operations. You can download the datasets here to follow along.

Now let's load the cars_reviews_real.csv dataset:

Now that we have our two tables, let's dive into the inner join operation. The inner join in pandas works similarly to the inner join in SQL. It returns only the rows where there are matching values in both DataFrames based on the specified join key. It essentially finds the intersection of the two DataFrames. We can carry out an inner join operation on these two tables on the 'Car_ID' column. Here is an example:

We are joining the two DataFrames on the 'Car_ID' column. Both DataFrames have this column. The 'how' parameter determines the type of join (inner). You can see that only the rows where there are matching values in both DataFrames based on the specified join key ('Car_ID') have been returned. You may also have noticed that the duplicate column that is in both tables (Model) has the default suffixes _x and _y appended to the column name. We now have Model_x and Model_y as column names. This operation is necessary because a DataFrame cannot have two columns with the same name. If you do not want to use default suffixes, you can use the suffixes parameter to specify the left and right suffixes to be appended to the overlapping columns. See below:

In this code, you can see that we have specified the suffixes to be appended to the columns with the same name. We now have 'Model_reviews' and 'Model_sales' in the DataFrame.

If we want to handle these duplicate columns without using suffixes, we can pass them to the left_on and right_on parameters. Here is an example:

Here, by specifying left_on = [Car_ID', 'Model'] and right_on = ['Car_ID', 'Model'] in the join, we are telling pandas to join the DataFrames on these two columns. This effectively handles duplicate columns, as the join will only match rows where both 'Car_ID' and 'Model' are identical. So you can see that we end up with one 'Model' column instead of two, as in the previous example.

The example above is the object-oriented style, where we are calling the merge method on the car_reviews DataFrame. Another method we can use to carry out the inner join is by using a function-based style, using the pd.merge function directly. See below:

In this code, the 'on' parameter specifies the column to join on. You can see that both approaches are valid and produce identical results. An inner join is appropriate when you want to combine information from two DataFrames based on a shared key, but only for rows where there's a match in both DataFrames.

Build the Confidence to Tackle Data Analysis Projects

Ready to go in and do some real data analysis? The main purpose of this book is to ensure that you develop data analysis skills with Python by tackling challenges. By the end, you should be confident enough to take on any data analysis project with Python. Start your journey with "50 Days of Data Analysis with Python: The Ultimate Challenge Book for Beginners."

领英推荐

SQL Optimization, Data Science Portfolios, and Other…

Towards Data Science 7 个月前

Dataprep - An Auto_EDA library

360DigiTMG 1 年前

Mastering Pandas for Data Engineers: A 60-Day Data…

ITVersity, Inc. 2 个月前

Other Resources

Want to learn Python fundamentals the easy way? Check out Master Python Fundamentals: The Ultimate Guide for Beginners.

Challenge yourself with Python challenges. Check out 50 Days of Python: A Challenge a Day.

100 Python Tips and Tricks, Python Tips and Tricks: A Collection of 100 Basic & Intermediate Tips & Tricks.

The Left Join Operation

If you have two tables and you want to return all the rows from the left table and matching rows from the right table, you can perform the left join operation. Let's say we want to return all the rows from the car_reviews table and the matching row from the car_sales table. We are going to pass the car_reviews table as the left table and the car_sales table as the right table. Here is how we perform the left join operation using pandas:

Here, all the rows from the left table (car_reviews) have been returned. If you look at the 'Year' and 'Sales' columns, you can see that the bottom rows have been filled with NaN values. These rows represent values in the right table (car_sales), where there is no match in the car_reviews table.

The Right Join Operation

The right join works similarly to the left join. The difference is that the right join will return all the rows from the right table and the matching rows from the left table. If there are no matches in the left table, they will be filled with NaN values. Let's take the previous example and make it into a right join operation:

Have you seen what is happening here? The right join operation returns all the rows from the right table (car_sales). The 'Review_Score' and 'Review_Count' columns are in the car_reviews table. These columns have NaN values in the bottom rows because those rows have no match in the car_sales table. So basically, the right join keeps all rows from the right DataFrame (car_sales) and joins with matching rows from the left DataFrame (car_reviews).

The Outer Join Operation

The last type of join we are going to explore is the outer join. This join keeps all rows from both DataFrames, filling in missing values with NaN. Here is an example:

Here, all rows from both tables have been returned. The rows that do not have match values in the other table have been filled with NaN values. There are essentially four rows in each table that have no match values in the other table. These rows have been filled with NaN values. Basically, a full outer join is most appropriate when you want to preserve all information from both DataFrames, regardless of whether there's a match.

The most important thing to note about the outer join is that it will result in a large DataFrame especially if there are many unmatched records. You can see above that the resulting DataFrame has more rows than the two DataFrames we merged. The downside is that processing and analyzing large datasets can be computationally expensive. In most cases, left or right joins are more efficient and provide sufficient information.

Conclusion

Merging DataFrames is an essential skill for anyone working with data. By exploring these merging techniques that pandas provides, you can enhance your data processing capabilities. Pandas support inner, left, right, and outer joins. Understanding how each join operation works will ensure that you pick the right join method for your operation. Like most operations in data analysis, it is only by practicing that you will effectively learn how to use these merging methods. Thanks for reading.

Newsletter Sponsorship

You can reach a highly engaged audience of over 290,000 tech-savvy subscribers and grow your brand with a newsletter sponsorship. Contact me at [email protected] today to learn more about the sponsorship opportunities.

Python, Data Analytics & AI

347,333 位关注者

Johnny Gray

Walmart Associate at Walmart

7 个月

I'll keep this in mind

Sabeeh Fatima

Computer Science Student at Namal University | Millennium Fellow Class of 2024 |Cybersecurity Enthusiast | Proficient in C++

7 个月

It was easy and understandable, helped me a lot.

Sulliaman Sesay

Sales Manager at Saebbam

7 个月

Just notify me in all areas of subjects and give me all details of understanding and support and help me thanks ??

查看更多评论

要查看或添加评论，请登录

Benjamin Bennett Alexander的更多文章

A Deep Dive into SQL Recursive Queries

2025年3月8日

A Deep Dive into SQL Recursive Queries

Build the Confidence to Tackle Data Analysis Projects [40% OFF] To build a successful data analysis project, one must…

7 条评论
Stop! Avoid These Habits When Writing Python Loops

2025年3月6日

Stop! Avoid These Habits When Writing Python Loops

Announcement: Master Python Fundamentals [40% OFF] Learning Python. Trying to learn Python in 2025? This resource will…

12 条评论
How to Structure a Winning Data Analysis Project Report

2025年3月1日

How to Structure a Winning Data Analysis Project Report

Build the Confidence to Tackle Data Analysis Projects To build a successful data analysis project, one must have skills…

9 条评论
Master Python Classes: Object-Oriented Programming Crash Course

2025年2月27日

Master Python Classes: Object-Oriented Programming Crash Course

What I have discovered about Python is that many people learning Python struggle to wrap their heads around the concept…

9 条评论
50 Days of Data Analysis: Analyzing Data with NumPy

2025年2月22日

50 Days of Data Analysis: Analyzing Data with NumPy

Master the Skills Required in Data Analysis and Machine Learning Start a transformative journey with "50 Days of Data…

8 条评论
Four Machine Learning Questions that Every Data Analyst Must Answer

2025年2月20日

Four Machine Learning Questions that Every Data Analyst Must Answer

Master the Skills Required in Data Analysis and Machine Learning Start a transformative journey with "50 Days of Data…

19 条评论
Things You Probably Didn’t Know About the ORDER BY Clause

2025年2月15日

Things You Probably Didn’t Know About the ORDER BY Clause

Start a transformative journey with "50 Days of Data Analysis with Python." Dive into the world of Python libraries…

9 条评论
Humanizing Data: Tiankai Feng on AI, Music, and the Key to Data Success

2025年2月13日

Humanizing Data: Tiankai Feng on AI, Music, and the Key to Data Success

Beyond all the systems, tables, and algorithms, it’s still people who are calling the shots. The best way to learn…

5 条评论
Manipulating Pandas DataFrame Columns Like a Pro: 5 Essential Techniques

2025年2月8日

Manipulating Pandas DataFrame Columns Like a Pro: 5 Essential Techniques

Start a transformative journey with "50 Days of Data Analysis with Python." Dive into the world of Python libraries…

6 条评论
Stop Falling Victim to these Common Python Traps

2025年2月6日

Stop Falling Victim to these Common Python Traps

Python Course Trying to learn Python in 2025? Over 100 videos (more to come) have already been added to the Master…

9 条评论

See all articles

Data Analysis with Pandas: DataFrame Merging Methods You Must Master

Benjamin Bennett Alexander

The Inner Join Operation

Build the Confidence to Tackle Data Analysis Projects

领英推荐

Other Resources

The Left Join Operation

The Right Join Operation

The Outer Join Operation

Conclusion

Newsletter Sponsorship

Python, Data Analytics & AI

347,333 位关注者

Benjamin Bennett Alexander的更多文章

社区洞察

其他会员也浏览了

Unlocking Pandas: Listing Column Names and a Solid Foundation for Data Analysis

Data Lifecycle Management with Pandas: A Short Course Overview

Start your journey as a Data Engineer and Data Scientist

Real-World Applications: Harnessing Tools for Data Manipulation and Decision Support

Databricks notebooks

Data Analytics Tools You Need To Know in 2023

The Comprehensive Guide for Aspiring Data Professionals

Top 7 Data Analytics Tools in 2023

Joining the Data Industry in 2025

Your 6-Month Journey to a Job-Winning Data Science Portfolio

The Inner Join Operation

Build the Confidence to Tackle Data Analysis Projects

领英推荐

Other Resources

The Left Join Operation

The Right Join Operation

The Outer Join Operation

Conclusion

Newsletter Sponsorship

Python, Data Analytics & AI

347,333 位关注者

Benjamin Bennett Alexander的更多文章

A Deep Dive into SQL Recursive Queries

Stop! Avoid These Habits When Writing Python Loops

How to Structure a Winning Data Analysis Project Report

Master Python Classes: Object-Oriented Programming Crash Course

50 Days of Data Analysis: Analyzing Data with NumPy

Four Machine Learning Questions that Every Data Analyst Must Answer

Things You Probably Didn’t Know About the ORDER BY Clause

Humanizing Data: Tiankai Feng on AI, Music, and the Key to Data Success

Manipulating Pandas DataFrame Columns Like a Pro: 5 Essential Techniques

Stop Falling Victim to these Common Python Traps

社区洞察

其他会员也浏览了

Unlocking Pandas: Listing Column Names and a Solid Foundation for Data Analysis

Data Lifecycle Management with Pandas: A Short Course Overview

Start your journey as a Data Engineer and Data Scientist

Real-World Applications: Harnessing Tools for Data Manipulation and Decision Support

Databricks notebooks

Data Analytics Tools You Need To Know in 2023

The Comprehensive Guide for Aspiring Data Professionals

Top 7 Data Analytics Tools in 2023

Joining the Data Industry in 2025

Your 6-Month Journey to a Job-Winning Data Science Portfolio