Real-World Data Merging: Inner Joins & Aggregations in Pandas

Real-World Data Merging: Inner Joins & Aggregations in Pandas

Mastering Data Integration with CSV Files in Pandas

Merging datasets is a key skill in data analysis and engineering. In real-world scenarios, data often comes from multiple sources, and understanding how to integrate and analyze this information is crucial.

In this article, we’ll cover:

  • How to merge real-world datasets using CSV files in Pandas.
  • Performing an Inner Join to combine sales reps and sales data.
  • Aggregating sales totals by sales representatives for insights.
  • Handling missing data & calculating commissions efficiently.

By the end, you’ll be able to confidently integrate and process large datasets, making your analysis more effective and actionable.



Why This Module is Important

Merging and aggregating data enables:

  • Combining multiple datasets to create a structured and unified view.
  • Linking transactional data with relevant details for enhanced insights.
  • Aggregating and summarizing data for trend analysis and reporting.
  • Handling real-world challenges like missing data, performance, and filtering.

In this module, we introduce practical dataset merging techniques, helping you work with real-world CSV files efficiently.


Merging CSV Files Using Inner Joins

The Real-World Scenario

We are working with two datasets:

1?? Sales Representatives Data (sales_reps_data.csv)

2?? Toyota Sales Data (toyota_sales_data.csv)

The goal is to merge these datasets to match sales reps with their sales data and compute aggregated sales totals.


You can download the datasets from the following GitHub link: GitHub Datasets

Step-by-Step Implementation in Pandas

1?? Loading CSV Files as Pandas DataFrames

We are working with two datasets:

  • Sales Representatives Data (sales_reps_data.csv) – Contains information about sales representatives.
  • Toyota Sales Data (toyota_sales_data.csv) – Contains sales transaction details.

import pandas as pd

# Load the CSV files
sales_reps_data = pd.read_csv("data/car_sales/sales_reps_data.csv")
toyota_sales_data = pd.read_csv("data/car_sales/toyota_sales_data.csv")

# Display the column names of the sales representatives dataset
sales_reps_columns = sales_reps_data.columns  
print("Sales Representatives Data Columns:", sales_reps_columns) 

# Display the column names of the Toyota sales dataset
toyota_sales_columns = toyota_sales_data.columns  
print("Toyota Sales Data Columns:", toyota_sales_columns)         

2?? Performing an Inner Join

We match each sales record with the corresponding sales rep by joining on:

  • rep_id from sales_reps_data.csv
  • sale_rep_id from toyota_sales_data.csv

# Merge DataFrames  
merged_data = pd.merge(
    sales_reps_data,
    toyota_sales_data,
    left_on="rep_id",
    right_on="sale_rep_id",
    how="inner"
)

print(merged_data.shape)  # Check dimensions
print(merged_data)         

?? Observations:

  • Only matching records are included.
  • Sales Rep IDs without a match are excluded.


3?? Aggregating Sales by Sales Representatives

After merging, let’s calculate the total sales amount per sales rep:
# Group the merged data by sales representative details and sum the sales amount
sales_by_rep = merged_data. \
    groupby(["rep_id", "first_name", "last_name", "region"])["sale_amount"]. \
    sum()

print(sales_by_rep)  

# Group the merged data again, summing the sales amount, but this time reset the index
sales_by_rep = merged_data. \
    groupby(["rep_id", "first_name", "last_name", "region"])["sale_amount"]. \  
    sum(). \ 
    reset_index()

print(sales_by_rep)        

4?? Adding a Commission Calculation

Now, let's calculate the commission earned by each sales rep using the commission_percentage field.
merged_data["commission_earned"] = merged_data["sale_amount"] * (merged_data["commission_pct"].fillna(0))

print(merged_data[["rep_id", "first_name", "last_name", "sale_amount", "commission_pct", "commission_earned"]]        

?? Handling Missing Data:

? Missing commission percentages are set to 0 using .fillna(0).

? The new column calculates the commission earned per sale.


5?? Handling Missing Data in Sales Records

Missing data can cause problems in calculations. We should fill missing values:
merged_data["commission_pct"].fillna(0, inplace=True)

merged_data["sale_amount"].fillna(0, inplace=True)        

This ensures data integrity when performing calculations.


6?? Full Data Processing Pipeline

Here’s a complete example of how the process works from start to finish:
# Load CSV files
sales_reps = pd.read_csv("data/car_sales/sales_reps_data.csv")  
toyota_sales = pd.read_csv("data/car_sales/toyota_sales_data.csv")  

# Merge DataFrames using Inner Join
merged_data = pd.merge(
    sales_reps_data,
    toyota_sales_data,
    left_on="rep_id",
    right_on="sale_rep_id",
    how="inner"
)

# Fill Missing Values
merged_data["commission_pct"].fillna(0, inplace=True)

# Calculate Commission Earned
merged_data["commission_earned"] = round(
    merged_data["sale_amount"] * merged_data["commission_pct"], 2
)

# Group Sales by Rep
commission_earned_by_rep = merged_data. \
    groupby(["rep_id", "email"])["commission_earned"]. \
    sum(). \
    reset_index()

# Display Results
print(commission_earned_by_rep)

merged_data.isnull().sum()        

This script fully integrates & aggregates sales data into a structured report!


Best Practices for Data Merging & Aggregation

Always inspect data after joins

  • Use .shape and .isnull().sum() to verify row counts and missing data.

Group and aggregate data carefully

  • Ensure numeric fields are correctly aggregated.

Handle missing data

  • Use .fillna() to replace NaN values before performing calculations.


Practice Assignment

?? Want to practice? Attempt the Real-World Data Joining and Aggregation in Pandas Assignment ?? Click Here.

?? Need help? Leave a comment, and we’ll assist you!


What’s Next?

In the next lecture, we will explore Data Transformations in Pandas. Data transformation is an essential step in the data analysis workflow, helping to clean, standardize, and modify data to make it more useful for analysis.


Click ?? to Enroll in the Python for Beginners: Learn Python with Hands-on Projects. It only costs $10 and you can reach out to us for $10 Coupon.

Conclusion

In this guide, we covered:

  • How to merge real-world datasets using CSV files in Pandas.
  • Inner joins for integrating sales reps and sales data.
  • Aggregating sales totals by reps for business insights.
  • Calculating commissions and handling missing data efficiently.


Conclusion: Join or Merge Operations Using Pandas

Congratulations on completing the module on working with multiple DataFrames, specifically focusing on join and merge operations using Pandas. You’ve taken a crucial step toward mastering data integration and transformation techniques, which are essential for real-world data engineering and analytics.

Key Takeaways:

Throughout this module, we explored various techniques for combining datasets efficiently:

  • Overview of Merging and Joining: Explored inner, left, right, and outer joins, along with parent-child relationships between datasets.
  • Inner Joins: Merged datasets while retaining only matching rows, with use cases like matching sales data with sales reps.
  • Left and Right Joins: Retained unmatched rows from either parent or child datasets, identifying inactive entities or orphaned records.
  • Outer Joins: Combined all rows from both datasets, highlighting unmatched records for reconciliation.
  • Concatenation: Stacked datasets vertically (rows) or horizontally (columns) for seamless integration.
  • Real-World Application: Applied these techniques to CSV data for aggregation, grouped analysis, and dataset reconciliation.

Keep in Mind:

Working with multiple DataFrames is a core skill in data engineering and analytics. These techniques are directly applicable in key areas like:

  • Data Integration – Merging data from multiple sources to create a unified view
  • Reporting – Preparing datasets for advanced analysis and visualization
  • Reconciliation – Identifying missing or unmatched records for quality assurance

By mastering these skills, you will be better equipped to handle complex data workflows and support data-driven decision-making.

Thank you for following along! I hope you found it valuable and engaging. See you in the next module, where we’ll continue building your expertise in Pandas.


?? Engage With Us!

? Authored by Siva Kalyan Geddada , Abhinav Sai Penmetsa

?? Share this article with anyone interested in data engineering, Python, or data analysis. ?? Have questions or need help? Comment below! Let's discuss.

?? Follow us for more hands-on data science tutorials!


要查看或添加评论,请登录

ITVersity, Inc.的更多文章

社区洞察

其他会员也浏览了