登录查看更多内容

Real-World Data Merging: Inner Joins & Aggregations in Pandas

ITVersity, Inc.

making IT resourceful (???????? ?????????????????????? ????????)

发布日期: 2025年2月14日

Mastering Data Integration with CSV Files in Pandas

Merging datasets is a key skill in data analysis and engineering. In real-world scenarios, data often comes from multiple sources, and understanding how to integrate and analyze this information is crucial.

In this article, we’ll cover:

How to merge real-world datasets using CSV files in Pandas.
Performing an Inner Join to combine sales reps and sales data.
Aggregating sales totals by sales representatives for insights.
Handling missing data & calculating commissions efficiently.

By the end, you’ll be able to confidently integrate and process large datasets, making your analysis more effective and actionable.

Why This Module is Important

Merging and aggregating data enables:

Combining multiple datasets to create a structured and unified view.
Linking transactional data with relevant details for enhanced insights.
Aggregating and summarizing data for trend analysis and reporting.
Handling real-world challenges like missing data, performance, and filtering.

In this module, we introduce practical dataset merging techniques, helping you work with real-world CSV files efficiently.

Merging CSV Files Using Inner Joins

The Real-World Scenario

We are working with two datasets:

1?? Sales Representatives Data (sales_reps_data.csv)

2?? Toyota Sales Data (toyota_sales_data.csv)

The goal is to merge these datasets to match sales reps with their sales data and compute aggregated sales totals.

You can download the datasets from the following GitHub link: GitHub Datasets

Step-by-Step Implementation in Pandas

1?? Loading CSV Files as Pandas DataFrames

We are working with two datasets:

Sales Representatives Data (sales_reps_data.csv) – Contains information about sales representatives.
Toyota Sales Data (toyota_sales_data.csv) – Contains sales transaction details.

import pandas as pd

# Load the CSV files
sales_reps_data = pd.read_csv("data/car_sales/sales_reps_data.csv")
toyota_sales_data = pd.read_csv("data/car_sales/toyota_sales_data.csv")

# Display the column names of the sales representatives dataset
sales_reps_columns = sales_reps_data.columns  
print("Sales Representatives Data Columns:", sales_reps_columns) 

# Display the column names of the Toyota sales dataset
toyota_sales_columns = toyota_sales_data.columns  
print("Toyota Sales Data Columns:", toyota_sales_columns)

2?? Performing an Inner Join

We match each sales record with the corresponding sales rep by joining on:

rep_id from sales_reps_data.csv
sale_rep_id from toyota_sales_data.csv

# Merge DataFrames  
merged_data = pd.merge(
    sales_reps_data,
    toyota_sales_data,
    left_on="rep_id",
    right_on="sale_rep_id",
    how="inner"
)

print(merged_data.shape)  # Check dimensions
print(merged_data)

?? Observations:

Only matching records are included.
Sales Rep IDs without a match are excluded.

3?? Aggregating Sales by Sales Representatives

After merging, let’s calculate the total sales amount per sales rep:

# Group the merged data by sales representative details and sum the sales amount
sales_by_rep = merged_data. \
    groupby(["rep_id", "first_name", "last_name", "region"])["sale_amount"]. \
    sum()

print(sales_by_rep)  

# Group the merged data again, summing the sales amount, but this time reset the index
sales_by_rep = merged_data. \
    groupby(["rep_id", "first_name", "last_name", "region"])["sale_amount"]. \  
    sum(). \ 
    reset_index()

print(sales_by_rep)

4?? Adding a Commission Calculation

Now, let's calculate the commission earned by each sales rep using the commission_percentage field.

merged_data["commission_earned"] = merged_data["sale_amount"] * (merged_data["commission_pct"].fillna(0))

print(merged_data[["rep_id", "first_name", "last_name", "sale_amount", "commission_pct", "commission_earned"]]

?? Handling Missing Data:

? Missing commission percentages are set to 0 using .fillna(0).

? The new column calculates the commission earned per sale.

领英推荐

PANDAS PROFILING

360DigiTMG 1 年前

From Raw Data to Actionable Insights: The Role of…

Iain Brown PhD 11 个月前

Data Cleaning Techniques to Improve Your Analysis…

Walter Shields 1 个月前

5?? Handling Missing Data in Sales Records

Missing data can cause problems in calculations. We should fill missing values:

merged_data["commission_pct"].fillna(0, inplace=True)

merged_data["sale_amount"].fillna(0, inplace=True)

This ensures data integrity when performing calculations.

6?? Full Data Processing Pipeline

Here’s a complete example of how the process works from start to finish:

# Load CSV files
sales_reps = pd.read_csv("data/car_sales/sales_reps_data.csv")  
toyota_sales = pd.read_csv("data/car_sales/toyota_sales_data.csv")  

# Merge DataFrames using Inner Join
merged_data = pd.merge(
    sales_reps_data,
    toyota_sales_data,
    left_on="rep_id",
    right_on="sale_rep_id",
    how="inner"
)

# Fill Missing Values
merged_data["commission_pct"].fillna(0, inplace=True)

# Calculate Commission Earned
merged_data["commission_earned"] = round(
    merged_data["sale_amount"] * merged_data["commission_pct"], 2
)

# Group Sales by Rep
commission_earned_by_rep = merged_data. \
    groupby(["rep_id", "email"])["commission_earned"]. \
    sum(). \
    reset_index()

# Display Results
print(commission_earned_by_rep)

merged_data.isnull().sum()

This script fully integrates & aggregates sales data into a structured report!

Best Practices for Data Merging & Aggregation

Always inspect data after joins

Use .shape and .isnull().sum() to verify row counts and missing data.

Group and aggregate data carefully

Ensure numeric fields are correctly aggregated.

Handle missing data

Use .fillna() to replace NaN values before performing calculations.

Practice Assignment

?? Want to practice? Attempt the Real-World Data Joining and Aggregation in Pandas Assignment ?? Click Here.

?? Need help? Leave a comment, and we’ll assist you!

What’s Next?

In the next lecture, we will explore Data Transformations in Pandas. Data transformation is an essential step in the data analysis workflow, helping to clean, standardize, and modify data to make it more useful for analysis.

Click ?? to Enroll in the Python for Beginners: Learn Python with Hands-on Projects. It only costs $10 and you can reach out to us for $10 Coupon.

Conclusion

In this guide, we covered:

How to merge real-world datasets using CSV files in Pandas.
Inner joins for integrating sales reps and sales data.
Aggregating sales totals by reps for business insights.
Calculating commissions and handling missing data efficiently.

Conclusion: Join or Merge Operations Using Pandas

Congratulations on completing the module on working with multiple DataFrames, specifically focusing on join and merge operations using Pandas. You’ve taken a crucial step toward mastering data integration and transformation techniques, which are essential for real-world data engineering and analytics.

Key Takeaways:

Throughout this module, we explored various techniques for combining datasets efficiently:

Overview of Merging and Joining: Explored inner, left, right, and outer joins, along with parent-child relationships between datasets.
Inner Joins: Merged datasets while retaining only matching rows, with use cases like matching sales data with sales reps.
Left and Right Joins: Retained unmatched rows from either parent or child datasets, identifying inactive entities or orphaned records.
Outer Joins: Combined all rows from both datasets, highlighting unmatched records for reconciliation.
Concatenation: Stacked datasets vertically (rows) or horizontally (columns) for seamless integration.
Real-World Application: Applied these techniques to CSV data for aggregation, grouped analysis, and dataset reconciliation.

Keep in Mind:

Working with multiple DataFrames is a core skill in data engineering and analytics. These techniques are directly applicable in key areas like:

Data Integration – Merging data from multiple sources to create a unified view
Reporting – Preparing datasets for advanced analysis and visualization
Reconciliation – Identifying missing or unmatched records for quality assurance

By mastering these skills, you will be better equipped to handle complex data workflows and support data-driven decision-making.

Thank you for following along! I hope you found it valuable and engaging. See you in the next module, where we’ll continue building your expertise in Pandas.

?? Engage With Us!

? Authored by Siva Kalyan Geddada , Abhinav Sai Penmetsa

?? Share this article with anyone interested in data engineering, Python, or data analysis. ?? Have questions or need help? Comment below! Let's discuss.

?? Follow us for more hands-on data science tutorials!

Real-World Data Merging: Inner Joins & Aggregations in Pandas

ITVersity, Inc.

making IT resourceful (???????? ?????????????????????? ????????)

Mastering Data Integration with CSV Files in Pandas

Why This Module is Important

Merging CSV Files Using Inner Joins

The Real-World Scenario

1?? Sales Representatives Data (sales_reps_data.csv)

2?? Toyota Sales Data (toyota_sales_data.csv)

Step-by-Step Implementation in Pandas

1?? Loading CSV Files as Pandas DataFrames

2?? Performing an Inner Join

3?? Aggregating Sales by Sales Representatives

4?? Adding a Commission Calculation

领英推荐

5?? Handling Missing Data in Sales Records

6?? Full Data Processing Pipeline

Best Practices for Data Merging & Aggregation

Practice Assignment

What’s Next?

Conclusion

Conclusion: Join or Merge Operations Using Pandas

Key Takeaways:

Keep in Mind:

?? Engage With Us!

AI, Data and Cloud Updates

2,451 位关注者

ITVersity, Inc.的更多文章

社区洞察

其他会员也浏览了

The 16 Best Data Science and Machine Learning Platforms for 2021

My ALX Data Analytics Journey

SAS Data Science Certification Course - Aspire Techsoft Academy

Introduction to Data Science

Introduction to Data Science

Roadmap to Becoming a Data Analyst: Navigate Your Path to Success

Data Merging in Pandas: Left & Right Joins with Real-World Use Cases

Data Science: The 10 Commandments for Performing a Data Science Project

Deep Dive into CSV Data Import with Pandas: Exploring read_csv()

10 Best Data Science Tools for Non-Programmers

Mastering Data Integration with CSV Files in Pandas

Why This Module is Important

Merging CSV Files Using Inner Joins

The Real-World Scenario

1?? Sales Representatives Data (sales_reps_data.csv)

2?? Toyota Sales Data (toyota_sales_data.csv)

Step-by-Step Implementation in Pandas

1?? Loading CSV Files as Pandas DataFrames

2?? Performing an Inner Join

3?? Aggregating Sales by Sales Representatives

4?? Adding a Commission Calculation

领英推荐

5?? Handling Missing Data in Sales Records

6?? Full Data Processing Pipeline

Best Practices for Data Merging & Aggregation

Practice Assignment

What’s Next?

Conclusion

Conclusion: Join or Merge Operations Using Pandas

Key Takeaways:

Keep in Mind:

?? Engage With Us!

AI, Data and Cloud Updates

2,451 位关注者

ITVersity, Inc.的更多文章

Top Resources & Job Updates - Edition - Mar 21, 2025

The Power of Generative AI: What It Is, Why You Should Learn It, and How It’s Changing the World

Descriptive vs Inferential Statistics in Pandas: How to Analyze and Interpret Data Effectively

Introduction to Fundamentals of Statistics for Data Analysis

Monthly Sales Commission Analysis with Pandas - A Complete Workflow

Mastering Advanced Chaining Techniques in Pandas

Efficient Data Processing with Pandas: Chaining Transformations

Adding and Updating Columns in Pandas: A Step-by-Step Guide

Mastering Row-Level Transformations in Pandas with apply()

Advanced Custom Aggregation Functions in Pandas

社区洞察

其他会员也浏览了

The 16 Best Data Science and Machine Learning Platforms for 2021

My ALX Data Analytics Journey

SAS Data Science Certification Course - Aspire Techsoft Academy

Introduction to Data Science

Introduction to Data Science

Roadmap to Becoming a Data Analyst: Navigate Your Path to Success

Data Merging in Pandas: Left & Right Joins with Real-World Use Cases

Data Science: The 10 Commandments for Performing a Data Science Project

Deep Dive into CSV Data Import with Pandas: Exploring read_csv()

10 Best Data Science Tools for Non-Programmers