登录查看更多内容

Pandas like an SQL

sakshyam ghimire

Senior Software Engineer

发布日期: 2023年7月30日

Everyone is pretty much aware about the fact that pandas is an amazing python library for data manipulation, but more often than not people associate pandas with vast universe of machine learning overlooking it's simpler use cases. So in this article I will be using pandas like an small SQL database.

DataFrame in pandas

In Pandas, a DataFrame is a two-dimensional tabular data structure, similar to a spreadsheet or SQL table

For the simple use case of this article, I downloaded a CSV file from "https://www.kaggle.com/" which features "id, name, age, bloodgroup, city, occupation" columns. Let's load the file as pandas dataframe.

Rows and Columns in pandas

Accessing rows and columns in pandas is quite easy but understanding in what possible ways can square brackets be used to access data sometimes become tricky. Let's see in what possible ways can we use square bracket.

Accessing Columns

In pandas you can use square bracket to specify what columns you want to select.

data[['id','name','occupation']]

An equivalent of SQL statement where you write something like

select id, name, occupation from  <table-name>;

Filtering Rows

The square brackets can also be used to filter rows based on certain given condition. The following snippet will provide only data where age is greater than 30.

condition = (data['age'] > 30)
data[condition]

Needless to say you can apply multiple filter condition o filter for required data. Something like

condition = (data['age'] > 30) | ((data['id'] == 1) & (data['name'] == 'John'))
data[condition]

Will result you

An equivalent SQL statement for this would look something like

SELECT *
FROM your_table
WHERE (age > 30) OR (id = 1 AND name = 'John');

Slicing Rows

The square brackets can also be used to select designated number of rows, something very similar to limit and offset on SQL query.

data.head(5) # select top 5 rows
data[1:5] # select rows from index 1 to 5
data.tail(5) # select last 5 rows

An equivalent SQL query for slicing would look something like

SELECT *
FROM your_table
LIMIT 4 OFFSET 1;

Updating Values

Updating values in pandas is easy as well, but be sure to use .loc method.

领英推荐

Understanding Pandas DataFrames: A Complete Guide with…

ITVersity, Inc. 1 个月前

Unlocking Pandas: Listing Column Names and a Solid…

ITVersity, Inc. 1 个月前

Loading Data into Snowflake using Snowpark DataFrames

Factspan 8 个月前

update_condition = (data['name'] == 'John')
data.loc[update_condition, 'occupation'] = 'Doctor' # to update single row
# to update multiple rows
data.loc[update_condition, ['occupation','age']] = ['Doctor', 40]

Here we can see I used .loc to access the data, previously we were accessing data using only data[update_condition], so what's the difference you might ask.

Well when accessing data via data[update_condition] you will get copy of data but by using .loc you will get the reference of data thus updating the data will update in actual dataframe. We will get to know about the parameters in .loc in the upcoming sections.

An equivalent SQL query for this would look like,

UPDATE your_table
SET occupation = 'Doctor', age = 40
WHERE name='John';

Indexing in pandas

Pandas identify each row as unique row with help of index. The concept of index is very similar to primary key in SQL however it is noteworthy that indexed column in pandas can have duplicate values although not recommended. If indexed column has duplicates and you strictly want to enforce uniqueness you can use '.drop_duplicate' function. When you load a data as pandas dataframe, by default pandas indexes each row which can be checked with 'data.index'

You can see numbers prior to id starting from 0, these number are the indexes in pandas.

You can change the default index in pandas by using the function 'set_index'.

# inplace true will update existing dataframe
data.set_index('name', inplace=True)

You can also have composite key as indexed column,

You can reset index method to reset the indexed column in pandas, this will again by default apply integer based indexing of 'RangeIndex' type.

# you can use drop=True parameter here however that will remove the 
# previously indexed column from dataframe itself
data.reset_index()

.loc and .iloc

The .loc and .iloc indexers in pandas are used to access and manipulate data in a DataFrame or Series.

The basic syntax for .loc and .iloc is, dataframe.loc[row_selection, column_selection] however the .iloc indexer is primarily integer-based and is used to access data by specifying row and column integer positions.

On the first look the functionality between .loc/.iloc and the bracket notation ([]) seems same however, using .loc and .iloc is essential when you want to set values like we did in update operation.

# loc is more generic and accepts convenient row_selector, column_selector
data.loc[data['age']>30, ['name','age']]

# iloc is used for integet based data access
data.iloc[[1,3,4,5], [1,2]]

Often times the we even see people not specifying a clear condition on .loc, in such cases it is worth remembering that if row_selection doesn't have column name for the filter, the value will be matched on the indexed column.

This works because pandas will look for the indexes having value 1,2,3,4 which clarifies the row selection parameter in implicit cases.

In conclusion, leveraging the power of pandas as an SQL tool will be useful in multiple scenarios like, caching entire database, filtering out relevant data from huge CSV etc. Throughout this article, we explored how pandas, a versatile Python library, can be used effectively to handle and manipulate data, providing a seamless SQL-like experience.

要查看或添加评论，请登录

sakshyam ghimire的更多文章

Understanding Python GIL

2024年2月25日

Understanding Python GIL

Python developers often encounter the term "GIL," which is often considered a major drawback of the Python programming…
Delving into memory inspection with python

2023年10月1日

Delving into memory inspection with python

In this article, we set out to investigate Python's complex memory management mechanisms. Our aim is to shed light on…
Demystifying OpenSearch Queries

2023年6月11日

Demystifying OpenSearch Queries

For those who are unaware of what OpenSearch is, OpenSearch is an open-source search and analytics engine that is…
Mathematics and Big O notation

2022年7月26日

Mathematics and Big O notation

Wouldn't it be great if your program works with best efficiency and minimum complexity? But as a programmer how exactly…
Implementing Pub-Sub in Golang

2022年6月18日

Implementing Pub-Sub in Golang

Publish-Subscribe (Pub-Sub) is an messaging pattern where sender of message (Publisher) announce event to multiple…
Concurrency States In Golang

2022年4月28日

Concurrency States In Golang

It’s well-known that writing concurrent code is challenging. Things generally takes a few rounds to get it operating as…

See all articles

Pandas like an SQL

sakshyam ghimire

Senior Software Engineer

DataFrame in pandas

Rows and Columns in pandas

领英推荐

Indexing in pandas

.loc and .iloc

sakshyam ghimire的更多文章

社区洞察

其他会员也浏览了

Ten more random useful things in R you may not know about

Z-Order: Visualization and Implementation

Data Tools: a love-hate affair

Practice Window Functions for Data Analysis with SQLite and Jupyter Notebook

Seaborn

Data Wars: Tools and Technologies Compared – Which Ones to Invest Your Time In?

Pandas - Sort DataFrame

DUCKDB: You Should Be Using This Yesterday

Pandas Data Types: A Detailed Explanation

DataFrame in pandas

Rows and Columns in pandas

领英推荐

Indexing in pandas

.loc and .iloc

sakshyam ghimire的更多文章

Understanding Python GIL

Delving into memory inspection with python

Demystifying OpenSearch Queries

Mathematics and Big O notation

Implementing Pub-Sub in Golang

Concurrency States In Golang

社区洞察

其他会员也浏览了

Ten more random useful things in R you may not know about

Z-Order: Visualization and Implementation

Data Tools: a love-hate affair

Practice Window Functions for Data Analysis with SQLite and Jupyter Notebook

Seaborn

Data Wars: Tools and Technologies Compared – Which Ones to Invest Your Time In?

Pandas - Sort DataFrame

DUCKDB: You Should Be Using This Yesterday

Pandas Data Types: A Detailed Explanation