Data Analysis With Python: 5 pandas Column Operations for Data Analysts
Start a transformative journey with "50 Days of Data Analysis with Python ." Dive into the world of Python libraries, conquer real-world scenarios, and master the art of data analysis. This immersive guide offers 300+ hands-on challenges, to help you become a proficient data analyst. (Click Here To Grab a Copy) .
Use the November code to snag a copy for half the price.
Code: ANALYTICS100
Interested in sponsoring this newsletter dedicated to Python, Data Analysis, and AI with over 155k subscribers? Reach me on?[email protected]
Introduction
Data analysts rely heavily on Python’s pandas library for efficient data manipulation and analysis. Pandas provides a wealth of functionalities to handle datasets, and one of its strengths lies in its ability to perform operations on columns effectively. In this article, we’ll explore five essential pandas column operations that every data analyst should know. These operations are renaming columns, changing the column order, creating a MultiIndex of columns, adding multiple columns, and dropping multiple columns.
1. Renaming Columns
When you are dealing with structured data, there are going to be instances where you may need to rename your columns. Renaming columns makes the dataset more understandable, especially when dealing with complex or abbreviated column names. Pandas provides a very simple and intuitive way to rename DataFrame columns. Below, I have a DataFrame that we are going to use to demonstrate how we can rename columns.
In this code, we have created a DataFrame from four lists using the pandas DataFrame() function. Let’s say we have decided to rename the "Profession" column to “Job_Title.” Here is how we can do it using the rename() method:
The rename() method has a parameter called columns. This parameter takes a dictionary object as an argument. In the code above, we have passed the dictionary as an argument. In this dictionary, the key (professional) is the current name of the column, and the value (Job_Title) is the name we want to replace it with. Since we want to modify the DataFrame, we set the inplace parameter to True. You can see in the output that the name of the column has now changed to "Job Title."
What if we want to rename multiple columns? To rename multiple columns in a DataFrame we can use the rename() method. Let’s say we want to rename the columns "Names", to "First_Name" and "Salary" to "Salary_Per_Year." Instead of using the columns parameter, I will show you how you can use the mapper parameter to rename multiple columns at the same time. Here is the code below:
In this code, the mapper dictionary maps old column names to new column names. The keys are the old names, and the values are the new names. We pass this mapper dictionary as an argument to the mapper parameter. The axis=1 parameter is specified to indicate that the mapping is along columns (axis=1). We set inplace=True because we are modifying the original DataFrame in place.
2. Changing the Columns Order
When analyzing data, we may want to change the order in which the columns appear in the DataFrame. Reordering columns in a logical sequence can enhance the readability of the dataset, especially when related columns are grouped together. We will continue to work with the DataFrame above. Now, let's say we want to change the order of the columns. We want the "First_Name" column to be followed by the "Job_Title" and "Salary_Per_Year" columns, respectively. This means that "Age" will become the last column. Here is how we can change the order:
领英推荐
In this code, first we create a list with a desired order of columns. Then we pass this list to the DataFrame. You can see in the output that the order of the columns has changed.
3. Creating a MultiIndex of Columns
Using pandas, we can also set multiple columns as indexes of the DataFrame. Understanding and working with MultiIndex can be powerful for handling complex datasets where data needs to be organized and queried at multiple levels. Let’s say we want to set the "First_Name" and "Age" columns as a MultiIndex. To set multiple columns as an index, we can use the set_index() method. We can pass the names of the columns (First_Name and Age) that we want to set as MultiIndex as arguments to this method. Here is the code below:
In this code, the set_index() method is used to set the "First_Name" and "Age" columns as a MultiIndex. The inplace=True parameter modifies the original DataFrame in place. You can see in the output that the two columns have been set as a MultiIndex of the DataFrame.
To reset the index and convert the MultiIndex back to columns, you can use the reset_index() method. Here is the code below:
4. Adding Multiple Columns
It is a very common practice to add new columns to a DataFrame during analysis. Adding new columns can enhance the dataset with additional information, derived features, or calculated values that are relevant to the analysis. Let’s continue working with our DataFrame above. We want to add two columns to the DataFrame: the favorite car column and the favorite fruit column. One way we can perform this operation is by using the pandas assign() method. With the assign() method, we can add multiple columns to a DataFrame in a chained manner. Each keyword argument in assign() represents a new column. Here’s how we can add the two columns using assign():
You can see in the output that we have added two columns to the DataFrame. The assign() method returns a new modified DataFrame with the new columns in addition to all the existing columns. Note that for this method to work, the length of the added columns must match the indexes in the DataFrame.
5. Dropping Multiple Columns
Another common column operation that you must know is how to drop columns. Sometimes columns that contain redundant or irrelevant information may be dropped to streamline the dataset. Dropping columns can be an essential step in data preprocessing to ensure the quality of the dataset. Using pandas, we can drop multiple columns using the drop() method. Let’s say we want to drop the "First_name" and "Age" columns. Here is how we can do it using the drop() method:
You can see that we have dropped the two columns from the DataFrame.
Conclusion
These are some of the common column operations that you can perform on a DataFrame using pandas. These column operations are essential for data analysts to transform and prepare datasets for analysis. The operation you choose will depend on the specific goals of the analysis and the nature of the dataset being analyzed. Thank you for reading this article. You can download the code in this article from GitHub . Please like, share, and subscribe?to this newsletter if you are not yet a subscriber. You can also follow me on?LinkedIn, where I share more Python related content.
Python Question of the week
Share your answers below.
--
9 个月B is the correct answer
--
9 个月Thanks very much sir for the article, it's really rich and educating.
Answer is B for sure!
Lecturer (Applied Mathematics) at Nelson Mandela University
11 个月Answer is B; price and Price are the same variable