Data Merging in Pandas: Left & Right Joins with Real-World Use Cases

Data Merging in Pandas: Left & Right Joins with Real-World Use Cases

How to Efficiently Merge Datasets and Handle Missing Data in Pandas

Merging datasets is a crucial step in data analysis and data engineering. Often, we need to combine information from multiple tables while ensuring that no critical data is lost. In this article, we’ll dive into Left Joins and Right Joins in Pandas, explaining how they work, their key differences, and when to use them.

By the end of this guide, you’ll be able to merge datasets efficiently, handle missing records, and apply real-world data quality checks using Pandas' merge function.



Why This Module is Important

Merging datasets correctly ensures data consistency and accuracy for reporting, visualization, and machine learning. Here’s why Left and Right Joins are essential:

? Including All Parent or Child Records – Left Joins retain all records from the left dataset, while Right Joins retain all records from the right dataset.

? Identifying Missing or Orphaned Data – Helps in finding unmatched records and improving data quality.

? Handling Missing Data Efficiently – Understanding how to manage NaN values ensures complete and reliable datasets.

? Generating Meaningful Reports – Enables analysis of inactive sales reps, orphaned sales transactions, and other key business insights.

In this module, we’ll introduce various join techniques, giving you a solid foundation for working with relational datasets.


Understanding Left and Right Joins in Pandas

Left Join (how='left')

  • Includes all rows from the left table and only the matching rows from the right table.
  • Unmatched rows from the right table are filled with NaN values.
  • Example Use Case: Getting a complete list of sales reps, including those who haven’t made any sales.

Right Join (how='right')

  • Includes all rows from the right table and only the matching rows from the left table.
  • Unmatched rows from the left table are filled with NaN values.
  • Example Use Case: Getting a full list of sales transactions, including those without an assigned sales rep.


You can download the datasets from the following GitHub link: GitHub Datasets

Step-by-Step Implementation in Pandas

First, let’s create two sample DataFrames to demonstrate Left and Right Joins:

import pandas as pd

# Sales Representatives Table (Parent)
sales_reps = pd.DataFrame({
    "rep_id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "region": ["North", "South", "West"]
})

# Sales Data Table (Child)
sales = pd.DataFrame({
    "sale_id": [101, 102, 103, 104],
    "sale_rep_id": [1, 2, None, 4],  # Includes a missing and an unmatched ID
    "sale_amount": [500, 1000, 750, 1200]
})        

Performing a Left Join

import pandas as pd

# Performing a LEFT JOIN
left_join = pd.merge(
    sales_reps,
    sales,
    left_on="rep_id",
    right_on="sale_rep_id",
    how="left"
)        

Expected Output:

?? Observations:

? All sales reps are included, even if they don’t have sales.

? Sales Rep ID 3 appears, but the sale_id and sale_amount fields are NaN.

Performing a Right Join

import pandas as pd

# Performing a RIGHT JOIN
right_join = pd.merge(
    sales_reps,
    sales,
    left_on="rep_id",
    right_on="sale_rep_id",
    how="right"
)        

Expected Output:

?? Observations:

? All sales transactions are included, even if there’s no corresponding sales rep.

? Sales Rep ID 4 and NaN appear, but the name and region fields are NaN.


Practical Use Cases

1?? Identifying Sales Reps Without Sales (Inactive Reps)

To find sales reps who haven’t made any sales, we can filter for NaN values in sale_id:

inactive_reps = left_join[left_join["sale_id"].isnull()]  

print(inactive_reps)          

Expected Output:

2?? Detecting Orphaned Sales Transactions

To find sales transactions that don’t have an assigned sales rep, we can filter for NaN values in rep_id:

orphaned_sales = right_join[right_join["rep_id"].isnull()]  

print(orphaned_sales)          

Expected Output:


Best Practices for Left and Right Joins

? Choose the Right Join Type:

  • Use Left Join when the parent dataset (e.g., sales reps) is more important.
  • Use Right Join when the child dataset (e.g., sales transactions) is more important.

? Inspect NaN Values:

  • Check for missing values using .isnull() to identify unmatched rows.

? Handle Missing Data:

  • Replace NaN values with "Unknown" where necessary.

right_join.fillna({"name": "Unknown", "region": "Unknown"}, inplace=True)        

Expected Output:



Practice Assignment

?? Want to practice? Attempt the Left and Right Joins in Pandas Assignment ?? Click Here.

?? Need help? Leave a comment, and we’ll assist you!


What’s Next?

Now that we have explored Left and Right Joins, we will move on to Full Outer Joins in Pandas.

Next Up: Outer Joins – Merging Complete Datasets While Retaining All Data

We’ll see how to combine datasets while keeping all records from both sources, filling in missing values where necessary.



Click ?? to Enroll in the Python for Beginners: Learn Python with Hands-on Projects. It only costs $10 and you can reach out to us for $10 Coupon.

Conclusion

In this guide, we covered:

? How Left and Right Joins work and when to use them.

? The difference between retaining unmatched rows from left or right datasets.

? How to handle missing values (NaN) effectively.

? Real-world use cases such as finding inactive reps and orphaned sales.

By mastering these techniques, you’ll be able to confidently merge and integrate data for data analysis, reporting, and machine learning workflows.


?? Engage With Us!

? Authored by Siva Kalyan Geddada , Abhinav Sai Penmetsa

?? Share this article with anyone interested in data engineering, Python, or data analysis. ?? Have questions or need help? Comment below! Let's discuss.

?? Follow us for more hands-on data science tutorials!

要查看或添加评论,请登录

ITVersity, Inc.的更多文章

其他会员也浏览了