Data Analysis with Python: Concatenating Datasets with Pandas

Benjamin Bennett Alexander

发布日期: 2024年8月10日

In last week's article, we looked at how we can use pandas to merge DataFrames. The merge() function is not the only pandas function that we can use to combine DataFrames. Among its many other amazing features, pandas provides the concat() function, which stands out as a vital tool for merging datasets. In this article, we’ll explore the ins and outs of concatenating DataFrames using the concat() function. We will explore both vertical and horizontal concatenation. We will demonstrate how this simple yet powerful function can help you analyze data from different sources.

Vertical Concatenation (Stacking)

The first type of concatenation that we are going to explore is called vertical concatenation. Vertical concatenation adds DataFrames on top of each other, increasing the number of rows. First, let's load the two datasets that we are going to use for this article:

Here we have two DataFrames: car_sales and car_reviews. To concatenate the two DataFrames using pandas, we use the concat() function. We pass the two DataFrames to the function as a list. By default, pandas concatenates along the vertical axis (i.e., axis = 0). This means that when we use the concat() function without specifying the axis parameter, pandas will stack the DataFrames on top of each other, combining the rows. See below:

You can see that the two DataFrames have been joined vertically. The car_reviews DataFrame has been added to the end of the car_sales DataFrame. This type of concatenation, also known as concatenating DataFrames along the rows (using axis=0), works similarly to appending elements to a list. When you vertically concatenate two or more DataFrames, you are stacking them on top of each other, much like how elements are added one after another in a list using the append() method. The resulting DataFrame has the length of the two DataFrames combined. The combined DataFrame has all the columns from the two DataFrames. The intersection of the columns and rows that are not shared by the two DataFrames has been given NaN values. You may have noticed that the resulting DataFrame preserves the original indices (highlighted in red) of the DataFrames being concatenated. For this reason, we have overlapping or identical indices in the DataFrame. There are two ways we can deal with overlapping indices:

Using MultiIndexing

One effective strategy that we can use when dealing with overlapping indices is to use MultiIndexing. With MultiIndexing, we create a hierarchical index in the DataFrame, which can help keep track of the source of each row after concatenation, avoiding the ambiguity that arises from overlapping indices. The concat() function has a keys parameter that we can use to create a MultiIndex. For example, we can pass the names of the DataFrames that we are concatenating as arguments to the keys parameter, and they will become the first level of the MultiIndex. See below:

Here, we have added Sales and Reviews as the first level of this MultiIndex. The original indices (0, 4) become the second level of the MultiIndex. The first level of the MultiIndex indicate the source of the DataFrames that we have concatenated. The MultiIndex will make it easy to access the values in the DataFrame. Let's say we want to access the sales value of the 'Ford Focus'. Here is how we do it:

You can see that MultiIndexing makes it easy to access values from the concatenated DataFrame. MultiIndexing keeps the structure of your data organized, which is especially useful when dealing with complex datasets or merging data from multiple sources.

2. Using the ignore_index Parameter

Another method that we can use to deal with overlapping indices is to use the ignore_index parameter of the concat() function. The default value of this parameter is False. If we set the value to True, then pandas will ignore the indices of the DataFrames being concatenated, and the new DataFrame will have a continuous integer index starting from 0, rather than maintaining the original indices of the individual DataFrames. See below:

Here, because we have set the ignore_index parameter to True, the indices of the two DataFrames being concatenated have been ignored. We now have a new DataFrame with a new continuous index from 0 - 9 (highlighted in red). This gets rid of duplicate indices.

When should you use ignore_index=True? If the original indices of the DataFrames do not carry significant meaning, or if you're combining a large number of DataFrames, it's often a good idea to reset them for simplicity.

Benjamin Bennett Alexander 7 个月前

Data Analysis With Python: 5 pandas Column Operations…

Benjamin Bennett Alexander 10 个月前

The lambda() and more

Can Arslan 1 年前

There is Only One Way to Master Data Analysis: Practice, Practice, and Practice.

Dedicate 50 Days of the last half of 2024 to improving your data analysis skill with Python. Start your journey here: 50 Days of Data Analysis with Python: The Ultimate Challenge Book for Beginners

Horizontal Concatenation (Column-Wise Concatenation)

Horizontal concatenation refers to combining two or more DataFrames side-by-side along the columns rather than stacking them on top of each other. The concatenation is done on axis 1. This is the axis for columns. Since the default value of the axis parameter is zero, we must explicitly set the value to 1 for this operation to work. Horizontal Concatenation will result in a new DataFrame. Here is how we can carryout a horizontal concatenation of the two DataFrames:

To perform a side-by-side concatenation, we set the axis to 1. This indicates that we are joining the two DataFrames on the columns. The resulting DataFrame has all columns from both car_sales and car_reviews, with each row corresponding to the same index in both DataFrames. If the DataFrames have different indexes, the horizontal concatenation will align them based on their indices. Missing values will be filled with NaN. Let's see what happens when we concatenate two DataFrames with different indices. First, we are going to change the index of the car_reviews DataFrame. We will set the Car_ID column as the index.

Now that we have set the 'Car_Id' column as the index, let's concatenate the two DataFrames again:

You can see in the output how pandas behaves when we concatenate two DataFrames on the columns. If one DataFrame has an index that is not present in the other, pandas will insert NaN for missing values in the output DataFrame. By default, the concatenation operation in pandas works like an "outer join," meaning it includes all indices from both DataFrames, filling in NaN where necessary. If you want to keep only indices that are common to both DataFrames, by dropping rows that don't have a match in both DataFrames, you can use the join parameter of the concat() function to set the join method to 'inner'. See below:

In this code, we have changed the join type from the default outer to inner. In the output, we only have indices that are common to both DataFrames. This eliminates NaN values.

Conclusion

Data often comes from from multiple sources. Learning to combine these different datasets, either by stacking them vertically (adding more rows) or horizontally (adding more columns), is an essential skill skill for data analysis. The concat() function is one the most important functions in pandas that you must strive to master if you work with data. Try Day 19 challenge from "50 Days of Data analysis with Python" to test your knowledge on concatenating and melting DataFrames. Thanks for reading.

Newsletter Sponsorship

You can reach a highly engaged audience of over 300,000 tech-savvy subscribers and grow your brand with a newsletter sponsorship. Contact me at [email protected] today to learn more about the sponsorship opportunities.

Python, Data Analytics & AI

327,530 位关注者

Ayush Jaiswal

Data Science Professional || Expertise in Python, SQL, Advanced Excel, Power BI, & Process Optimization

1 个月

Interesting

Bandana Dhungana

MS Computational Data Analytics at Georgia Tech

1 个月

Useful tips

David Ashman

Model Implementation Analyst - Financial Risk & Modelling

1 个月

My apologies, I haven't really got a strong understanding of Python yet, so I am still learning. Reading this article, I am not really not sure what is achieved by this table (does Python have limited joining functionality??). While Out[10] is close, it still isn't really giving the correct result, as Car_ID for records 12-14 are showing as NaN yet the initial table Out[3] does have these values.

Sandeep Kumar

"BTech Student in AI & Computer Science "| python, sql | Data structure | computer science fundamentals, math, statistics , communication, business understanding |

1 个月

50 Days of Data Analysis with Python: The Ultimate Challenge Book for Beginners,I have a question. Does this book cover all the topics of Python?

EMMANUEL BAAH

1 个月

Very informative

查看更多评论

要查看或添加评论，请登录

查看全部

Data Analysis with Python: Concatenating Datasets with Pandas

Benjamin Bennett Alexander

Vertical Concatenation (Stacking)

领英推荐

There is Only One Way to Master Data Analysis: Practice, Practice, and Practice.

Horizontal Concatenation (Column-Wise Concatenation)

Conclusion

Newsletter Sponsorship

Python, Data Analytics & AI

327,530 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

AIML26-Pandas and Python Tips and Tricks for Data Science and Data Analysis

Data Manipulation with Pandas

Implementing Real-Time Data Analysis with Python and Pandas: A Comprehensive Guide

40 intresting Python packages; Not necessarily the most popular one

Mastering Data Analysis with Pandas Series: A Comprehensive Guide with Examples

Querying Snowflake w/ Python.

Getting Started with NumPy

Comprehensive Guide to Pandas DataFrame Row Operations

Python for Finance in Excel — Moving Averages Chart

Unleashing the Power of Python: Data Manipulation with Lambda, Apply, Group By, and Pivot?Tables

Vertical Concatenation (Stacking)

领英推荐

There is Only One Way to Master Data Analysis: Practice, Practice, and Practice.

Horizontal Concatenation (Column-Wise Concatenation)

Conclusion

Newsletter Sponsorship

Python, Data Analytics & AI

327,530 位关注者

Pandas Vs. SQL: String Formatting and Preprocessing Data

2024年10月5日

Data Analysis with Python: Handling Missing Values with Pandas and Scikit-Learn

2024年9月28日

Master Python Fundamentals: Learn Functions and Classes Today

2024年9月25日

Data Analysis with Python: Stop Reading and Start Doing (Analyzing Financial Data)

2024年9月21日

Master Python Fundamentals: Encapsulation in User-Defined Classes

2024年9月14日

Breaking Into Data Analytics: Alex the Analyst Shares His Journey and Insights

2024年9月11日

50 Days of Data Analysis With Python: One Year Later

2024年9月7日

The Accidental Analyst: How Thais Cooke Found Her Passion in Data

2024年9月4日

How to Leverage Pandas GroupBy for Data Analysis

2024年8月31日

From Occupational Therapy to Data Analytics: Lauren Rosenthal's Journey and Insights

2024年8月28日

社区洞察

其他会员也浏览了

AIML26-Pandas and Python Tips and Tricks for Data Science and Data Analysis

Data Manipulation with Pandas

Implementing Real-Time Data Analysis with Python and Pandas: A Comprehensive Guide

40 intresting Python packages; Not necessarily the most popular one

Mastering Data Analysis with Pandas Series: A Comprehensive Guide with Examples

Querying Snowflake w/ Python.

Getting Started with NumPy

Comprehensive Guide to Pandas DataFrame Row Operations

Python for Finance in Excel — Moving Averages Chart

Unleashing the Power of Python: Data Manipulation with Lambda, Apply, Group By, and Pivot?Tables