Data Analysis with Python: Concatenating Datasets with Pandas
In last week's article, we looked at how we can use pandas to merge DataFrames. The merge() function is not the only pandas function that we can use to combine DataFrames. Among its many other amazing features, pandas provides the concat() function, which stands out as a vital tool for merging datasets. In this article, we’ll explore the ins and outs of concatenating DataFrames using the concat() function. We will explore both vertical and horizontal concatenation. We will demonstrate how this simple yet powerful function can help you analyze data from different sources.
Vertical Concatenation (Stacking)
The first type of concatenation that we are going to explore is called vertical concatenation. Vertical concatenation adds DataFrames on top of each other, increasing the number of rows. First, let's load the two datasets that we are going to use for this article:
Here we have two DataFrames: car_sales and car_reviews. To concatenate the two DataFrames using pandas, we use the concat() function. We pass the two DataFrames to the function as a list. By default, pandas concatenates along the vertical axis (i.e., axis = 0). This means that when we use the concat() function without specifying the axis parameter, pandas will stack the DataFrames on top of each other, combining the rows. See below:
You can see that the two DataFrames have been joined vertically. The car_reviews DataFrame has been added to the end of the car_sales DataFrame. This type of concatenation, also known as concatenating DataFrames along the rows (using axis=0), works similarly to appending elements to a list. When you vertically concatenate two or more DataFrames, you are stacking them on top of each other, much like how elements are added one after another in a list using the append() method. The resulting DataFrame has the length of the two DataFrames combined. The combined DataFrame has all the columns from the two DataFrames. The intersection of the columns and rows that are not shared by the two DataFrames has been given NaN values. You may have noticed that the resulting DataFrame preserves the original indices (highlighted in red) of the DataFrames being concatenated. For this reason, we have overlapping or identical indices in the DataFrame. There are two ways we can deal with overlapping indices:
One effective strategy that we can use when dealing with overlapping indices is to use MultiIndexing. With MultiIndexing, we create a hierarchical index in the DataFrame, which can help keep track of the source of each row after concatenation, avoiding the ambiguity that arises from overlapping indices. The concat() function has a keys parameter that we can use to create a MultiIndex. For example, we can pass the names of the DataFrames that we are concatenating as arguments to the keys parameter, and they will become the first level of the MultiIndex. See below:
Here, we have added Sales and Reviews as the first level of this MultiIndex. The original indices (0, 4) become the second level of the MultiIndex. The first level of the MultiIndex indicate the source of the DataFrames that we have concatenated. The MultiIndex will make it easy to access the values in the DataFrame. Let's say we want to access the sales value of the 'Ford Focus'. Here is how we do it:
You can see that MultiIndexing makes it easy to access values from the concatenated DataFrame. MultiIndexing keeps the structure of your data organized, which is especially useful when dealing with complex datasets or merging data from multiple sources.
2. Using the ignore_index Parameter
Another method that we can use to deal with overlapping indices is to use the ignore_index parameter of the concat() function. The default value of this parameter is False. If we set the value to True, then pandas will ignore the indices of the DataFrames being concatenated, and the new DataFrame will have a continuous integer index starting from 0, rather than maintaining the original indices of the individual DataFrames. See below:
Here, because we have set the ignore_index parameter to True, the indices of the two DataFrames being concatenated have been ignored. We now have a new DataFrame with a new continuous index from 0 - 9 (highlighted in red). This gets rid of duplicate indices.
When should you use ignore_index=True? If the original indices of the DataFrames do not carry significant meaning, or if you're combining a large number of DataFrames, it's often a good idea to reset them for simplicity.
领英推荐
There is Only One Way to Master Data Analysis: Practice, Practice, and Practice.
Dedicate 50 Days of the last half of 2024 to improving your data analysis skill with Python. Start your journey here: 50 Days of Data Analysis with Python: The Ultimate Challenge Book for Beginners
Horizontal Concatenation (Column-Wise Concatenation)
Horizontal concatenation refers to combining two or more DataFrames side-by-side along the columns rather than stacking them on top of each other. The concatenation is done on axis 1. This is the axis for columns. Since the default value of the axis parameter is zero, we must explicitly set the value to 1 for this operation to work. Horizontal Concatenation will result in a new DataFrame. Here is how we can carryout a horizontal concatenation of the two DataFrames:
To perform a side-by-side concatenation, we set the axis to 1. This indicates that we are joining the two DataFrames on the columns. The resulting DataFrame has all columns from both car_sales and car_reviews, with each row corresponding to the same index in both DataFrames. If the DataFrames have different indexes, the horizontal concatenation will align them based on their indices. Missing values will be filled with NaN. Let's see what happens when we concatenate two DataFrames with different indices. First, we are going to change the index of the car_reviews DataFrame. We will set the Car_ID column as the index.
Now that we have set the 'Car_Id' column as the index, let's concatenate the two DataFrames again:
You can see in the output how pandas behaves when we concatenate two DataFrames on the columns. If one DataFrame has an index that is not present in the other, pandas will insert NaN for missing values in the output DataFrame. By default, the concatenation operation in pandas works like an "outer join," meaning it includes all indices from both DataFrames, filling in NaN where necessary. If you want to keep only indices that are common to both DataFrames, by dropping rows that don't have a match in both DataFrames, you can use the join parameter of the concat() function to set the join method to 'inner'. See below:
In this code, we have changed the join type from the default outer to inner. In the output, we only have indices that are common to both DataFrames. This eliminates NaN values.
Conclusion
Data often comes from from multiple sources. Learning to combine these different datasets, either by stacking them vertically (adding more rows) or horizontally (adding more columns), is an essential skill skill for data analysis. The concat() function is one the most important functions in pandas that you must strive to master if you work with data. Try Day 19 challenge from "50 Days of Data analysis with Python" to test your knowledge on concatenating and melting DataFrames. Thanks for reading.
Newsletter Sponsorship
You can reach a highly engaged audience of over 300,000 tech-savvy subscribers and grow your brand with a newsletter sponsorship. Contact me at [email protected] today to learn more about the sponsorship opportunities.
Data Science Professional || Expertise in Python, SQL, Advanced Excel, Power BI, & Process Optimization
1 个月Interesting
MS Computational Data Analytics at Georgia Tech
1 个月Useful tips
Model Implementation Analyst - Financial Risk & Modelling
1 个月My apologies, I haven't really got a strong understanding of Python yet, so I am still learning. Reading this article, I am not really not sure what is achieved by this table (does Python have limited joining functionality??). While Out[10] is close, it still isn't really giving the correct result, as Car_ID for records 12-14 are showing as NaN yet the initial table Out[3] does have these values.
"BTech Student in AI & Computer Science "| python, sql | Data structure | computer science fundamentals, math, statistics , communication, business understanding |
1 个月50 Days of Data Analysis with Python: The Ultimate Challenge Book for Beginners,I have a question. Does this book cover all the topics of Python?
Junior Data Scientist | YYAS Scholar | Ahaspora Network Mentee| Global Mentorship Initiative Alumni | Aspire Institute Alumni| Open to Work
1 个月Very informative