Data Manipulation with Pandas

Data Manipulation with Pandas


Ways of Manipulating Data with Pandas in Python with Examples

Pandas is a popular data manipulation library for Python. It provides data structures for efficiently storing and manipulating large datasets. In this blog post, we will introduce some of the most common ways of manipulating data with Pandas, along with examples.


Subset Selection

Subset selection is the process of selecting a subset of rows or columns from a DataFrame based on certain criteria. Here are some examples:


Selecting a Single Column

To select a single column from a DataFrame, you can use the column name as an index. For example:


import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
        'age': [23, 19, 42, 31, 24],
        'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data)
print(df['name'])        

Output:

0     John
1     Mary
2    Peter
3     Jeff
4     Lisa
Name: name, dtype: object        

Selecting Multiple Columns

To select multiple columns from a DataFrame, you can pass a list of column names as an index. For example:


import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
        'age': [23, 19, 42, 31, 24],
        'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data)
print(df[['name', 'age']])        

Output:

name  age
0   John   23
1   Mary   19
2  Peter   42
3   Jeff   31
4   Lisa   24        

Selecting Rows Based on a Condition

To select rows from a DataFrame based on a condition, you can use boolean indexing. For example, to select all rows where the age is greater than 30:


import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
        'age': [23, 19, 42, 31, 24],
        'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data)
print(df[df['age'] > 30])        

Output:

name  age    country
2  Peter   42  Australia
3   Jeff   31        USA        

Loc and Iloc in Pandas: Explained with Examples

Two of the most commonly used methods for data selection and manipulation in Pandas are?loc?and?iloc. Let's introduce these methods and provide examples of how they can be used.


Loc

loc?is a label-based data selection method in Pandas. It allows you to select rows and columns from a DataFrame based on their labels. Here are some examples:


Selecting a Single Row

To select a single row based on its label, you can use the row label as an index. For example:


import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
        'age': [23, 19, 42, 31, 24],
        'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e'])
print(df.loc['a'])        

Output:

name       John
age          23
country     USA
Name: a, dtype: object        

Selecting Multiple Rows

To select multiple rows based on their labels, you can pass a list of row labels as an index. For example:


import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
        'age': [23, 19, 42, 31, 24],
        'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e'])
print(df.loc[['a', 'c', 'e']])        

Output:

name  age    country
a   John   23        USA
c  Peter   42  Australia
e   Lisa   24     Canada        

Selecting a Single Column

To select a single column based on its label, you can use the column name as an index. For example:


import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
        'age': [23, 19, 42, 31, 24],
        'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e'])
print(df.loc[:, 'name'])        

Output:

a     John
b     Mary
c    Peter
d     Jeff
e     Lisa
Name: name, dtype: object        

Selecting Multiple Columns

To select multiple columns based on their labels, you can pass a list of column names as an index. For example:


import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
        'age': [23, 19, 42, 31, 24],
        'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e'])
print(df.loc[:, ['name', 'age']])        

Output:

name  age
a   John   23
b   Mary   19
c  Peter   42
d   Jeff   31
e   Lisa   24        

Selecting Rows and Columns

To select both rows and columns based on their labels, you can pass both row and column labels as indexes. For example:


import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
        'age': [23, 19, 42, 31, 24],
        'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e'])
print(df.loc[['a', 'c', 'e'], ['name', 'age']])        

Output:

name  age
a   John   23
c  Peter   42
e   Lisa   24        

Iloc

iloc?is an integer-based data selection method in Pandas. It allows you to select rows and columns from a DataFrame based on their integer positions. Here are some examples:


Selecting a Single Row

To select a single row based on its integer position, you can use the row index as an index. For example:


import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
        'age': [23, 19, 42, 31, 24],
        'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data)
print(df.iloc[0])        

Output:

name       John
age          23
country     USA
Name: 0, dtype: object        

Selecting Multiple Rows

To select multiple rows based on their integer positions, you can pass a list of row indexes as an index. For example:


import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
        'age': [23, 19, 42, 31, 24],
        'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data)
print(df.iloc[[0, 2, 4]])        

Output:

name  age country
0  John   23     USA
2  Peter  42  Australia
4  Lisa   24  Canada        

Selecting a Single Column

To select a single column based on its integer position, you can use the column index as an index. For example:


import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa'],
        'age': [23, 19, 42, 31, 24],
        'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada']}
df = pd.DataFrame(data)
print(df.iloc[:, 1])        

Groupby and Data Aggregation in Pandas

One of the most important functionalities of Pandas is its ability to group data and perform data aggregation. In this article, Let's 's introduce the?groupby?method and provide examples of how it can be used for data aggregation.


Groupby

The?groupby?method in Pandas allows you to group rows of a DataFrame based on the values in one or more columns. Once the rows are grouped, you can perform various operations on the groups, such as counting, summing, averaging, and more.


Here’s an example of how to use?groupby?to group rows of a DataFrame based on the values in one column:

import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa', 'Ann', 'Bob', 'Alice'],
        'age': [23, 19, 42, 31, 24, 26, 29, 35],
        'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada', 'USA', 'USA', 'Australia'],
        'gender': ['M', 'F', 'M', 'M', 'F', 'F', 'M', 'F']}
df = pd.DataFrame(data)
grouped = df.groupby('country')        

In this example, we create a DataFrame with four columns:?name,?age,?country, and?gender. Then, we use the?groupby?method to group the rows of the DataFrame based on the values in the?country?column. The result is a?DataFrameGroupBy?object that contains the groups.

Data Aggregation

Once you have grouped the rows of a DataFrame, you can perform various operations on the groups. These operations are collectively known as data aggregation. Here are some examples of how to perform data aggregation using the?groupby?method:


Counting Rows in Each Group

To count the number of rows in each group, you can use the?size?method:


grouped.size()        

Output:

country
Australia    2
Canada       2
USA          4
dtype: int64        

This returns a Pandas?Series?object that contains the number of rows in each group.

Summing Values in Each Group

To sum the values in each group, you can use the?sum?method:


grouped.sum()        

Output:

age
country
Australia   77
Canada      43
USA        107        

This returns a DataFrame that contains the sum of the?age?column for each group.

Averaging Values in Each Group

To calculate the average of the values in each group, you can use the?mean?method:


grouped.mean()        

Output:

age
country
Australia  38.500000
Canada     21.500000
USA        26.750000        

This returns a DataFrame that contains the average of the?age?column for each group.

Maximum and Minimum Values in Each Group

To find the maximum and minimum values in each group, you can use the?max?and?min?methods:


grouped.max()        

Output:

age gender    name
country
Australia   42      M   Peter
Canada      24      F    Mary
USA         35      M  Robert
grouped.min()        

Output:

age gender   name
country
Australia   35      F  Alice
Canada      19      F   Lisa
USA         23      F    Ann        

These methods return DataFrames that contain the maximum and minimum values for each column in each group.

Applying Custom Functions

You can also apply custom functions to the groups using the?apply?method. For example, to find the range of ages in each group, you can define a custom function that calculates the range and apply it to the?age?column:


def age_range(group):
    return group.max() - group.min()
grouped['age'].apply(age_range)        

Output:

country
Australia     7
Canada        5
USA          12
Name: age, dtype: int64        

This returns a Pandas?Series?object that contains the age range for each group.

Pivot Tables in Pandas: A Comprehensive Guide with Examples

One of the most powerful functionalities of Pandas is its ability to create pivot tables. In this article, let's introduce pivot tables and provide examples of how they can be used for data analysis.


What are Pivot Tables?

A pivot table is a table that summarizes and aggregates data from another table. It allows you to group, filter, and aggregate data based on different criteria. Pivot tables are especially useful for analyzing large datasets with many variables.


Here’s an example of how to create a pivot table in Pandas:

import pandas as pd
data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa', 'Ann', 'Bob', 'Alice'],
        'age': [23, 19, 42, 31, 24, 26, 29, 35],
        'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada', 'USA', 'USA', 'Australia'],
        'gender': ['M', 'F', 'M', 'M', 'F', 'F', 'M', 'F']}
df = pd.DataFrame(data)
pivot = pd.pivot_table(df, values='age', index='country', columns='gender', aggfunc='mean')
print(pivot)        

In this example, we create a DataFrame with four columns:?name,?age,?country, and?gender. Then, we create a pivot table using the?pd.pivot_table()?method. We specify the?values?to aggregate (age), the?index?column (country), the?columns?to group by (gender), and the?aggfunc?to use for aggregation (mean). The result is a pivot table that shows the average age of males and females in each country.

Pivot Table Examples

Pivot Table with Multiple Aggregations

You can also create pivot tables with multiple aggregations. For example, to show the total and average age of males and females in each country:


pivot = pd.pivot_table(df, values='age', index='country', columns='gender', aggfunc=['sum', 'mean'])
print(pivot)        

Output:

sum           mean
gender       F    M          F     M
country
Australia   35   42       35.0  42.0
Canada      43   24       21.5  24.0
USA         49  107       24.5  35.7        

This pivot table shows the sum and mean age of males and females in each country.

Pivot Table with Multiple Indexes

You can also create pivot tables with multiple indexes. For example, to show the average age of males and females in each country and state:


data = {'name': ['John', 'Mary', 'Peter', 'Jeff', 'Lisa', 'Ann', 'Bob', 'Alice'],
        'age': [23, 19, 42, 31, 24, 26, 29, 35],
        'country': ['USA', 'Canada', 'Australia', 'USA', 'Canada', 'USA', 'USA', 'Australia'],
        'state': ['NY', 'BC', 'NSW', 'CA', 'ON', 'NY', 'CA', 'VIC'],
        'gender': ['M', 'F', 'M', 'M', 'F', 'F', 'M', 'F']}
df = pd.DataFrame(data)

pivot = pd.pivot_table(df, values='age', index=['country', 'state'], columns='gender', aggfunc='mean')
print(pivot)        

Output:

gender            F     M
country   state
Australia NSW   NaN  42.0
          VIC  35.0   NaN
Canada    BC   19.0   NaN
          ON   24.0   NaN
USA       CA    NaN  31.0
          NY   26.0  23.0        

This pivot table shows the average age of males and females in each country and state.

Pivot Table with Custom Aggregation Function

You can also create pivot tables with custom aggregation functions. For example, to show the range of ages of males and females in each country:


def age_range(group):
    return group.max() - group.min()
pivot = pd.pivot_table(df, values='age', index='country', columns='gender', aggfunc=age_range)

print(pivot)        

Output:

gender      F   M
country
Australia NaN  7.0
Canada    5.0  NaN
USA       8.0  12.0        

This pivot table shows the age range of males and females in each country.

Conclusion

In this document, we have introduced Pandas, a popular data manipulation library for Python. We have covered its basic functionalities, such as data structures, creating DataFrames, and reading and writing data. We have also discussed some common ways of manipulating data with Pandas, such as subset selection and the?loc?and?iloc?methods. Overall, Pandas is a powerful tool for working with data in Python, and its capabilities go far beyond what we have covered here. If you are working with data, Pandas is definitely worth exploring further.


#python #pandas #dataanalysis #dataanalytics #datamanipulation #pivottables #dataframes

Arman Tunga

AI Engineer | Generative AI | LLM | LLMOps | Computer Vision

1 年

Congrats Can, great examples I loved it! ??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了