Pandas like an SQL

Pandas like an SQL

Everyone is pretty much aware about the fact that pandas is an amazing python library for data manipulation, but more often than not people associate pandas with vast universe of machine learning overlooking it's simpler use cases. So in this article I will be using pandas like an small SQL database.

DataFrame in pandas

In Pandas, a DataFrame is a two-dimensional tabular data structure, similar to a spreadsheet or SQL table

For the simple use case of this article, I downloaded a CSV file from "https://www.kaggle.com/" which features "id, name, age, bloodgroup, city, occupation" columns. Let's load the file as pandas dataframe.

No alt text provided for this image


Rows and Columns in pandas

Accessing rows and columns in pandas is quite easy but understanding in what possible ways can square brackets be used to access data sometimes become tricky. Let's see in what possible ways can we use square bracket.

  • Accessing Columns

In pandas you can use square bracket to specify what columns you want to select.

data[['id','name','occupation']]         

An equivalent of SQL statement where you write something like

select id, name, occupation from  <table-name>;        
No alt text provided for this image

  • Filtering Rows

The square brackets can also be used to filter rows based on certain given condition. The following snippet will provide only data where age is greater than 30.

condition = (data['age'] > 30)
data[condition]        

Needless to say you can apply multiple filter condition o filter for required data. Something like

condition = (data['age'] > 30) | ((data['id'] == 1) & (data['name'] == 'John'))
data[condition]        

Will result you

No alt text provided for this image

An equivalent SQL statement for this would look something like

SELECT *
FROM your_table
WHERE (age > 30) OR (id = 1 AND name = 'John');        

  • Slicing Rows

The square brackets can also be used to select designated number of rows, something very similar to limit and offset on SQL query.

data.head(5) # select top 5 rows
data[1:5] # select rows from index 1 to 5
data.tail(5) # select last 5 rows        

An equivalent SQL query for slicing would look something like

SELECT *
FROM your_table
LIMIT 4 OFFSET 1;        

  • Updating Values

Updating values in pandas is easy as well, but be sure to use .loc method.

update_condition = (data['name'] == 'John')
data.loc[update_condition, 'occupation'] = 'Doctor' # to update single row
# to update multiple rows
data.loc[update_condition, ['occupation','age']] = ['Doctor', 40]         

Here we can see I used .loc to access the data, previously we were accessing data using only data[update_condition], so what's the difference you might ask.

Well when accessing data via data[update_condition] you will get copy of data but by using .loc you will get the reference of data thus updating the data will update in actual dataframe. We will get to know about the parameters in .loc in the upcoming sections.

An equivalent SQL query for this would look like,

UPDATE your_table
SET occupation = 'Doctor', age = 40
WHERE name='John';        

Indexing in pandas

Pandas identify each row as unique row with help of index. The concept of index is very similar to primary key in SQL however it is noteworthy that indexed column in pandas can have duplicate values although not recommended. If indexed column has duplicates and you strictly want to enforce uniqueness you can use '.drop_duplicate' function. When you load a data as pandas dataframe, by default pandas indexes each row which can be checked with 'data.index'

No alt text provided for this image

You can see numbers prior to id starting from 0, these number are the indexes in pandas.

No alt text provided for this image

You can change the default index in pandas by using the function 'set_index'.

# inplace true will update existing dataframe
data.set_index('name', inplace=True)        

You can also have composite key as indexed column,

No alt text provided for this image

You can reset index method to reset the indexed column in pandas, this will again by default apply integer based indexing of 'RangeIndex' type.

# you can use drop=True parameter here however that will remove the 
# previously indexed column from dataframe itself
data.reset_index()        

.loc and .iloc

The .loc and .iloc indexers in pandas are used to access and manipulate data in a DataFrame or Series.

The basic syntax for .loc and .iloc is, dataframe.loc[row_selection, column_selection] however the .iloc indexer is primarily integer-based and is used to access data by specifying row and column integer positions.

On the first look the functionality between .loc/.iloc and the bracket notation ([]) seems same however, using .loc and .iloc is essential when you want to set values like we did in update operation.

# loc is more generic and accepts convenient row_selector, column_selector
data.loc[data['age']>30, ['name','age']]

# iloc is used for integet based data access
data.iloc[[1,3,4,5], [1,2]]        
No alt text provided for this image
No alt text provided for this image

Often times the we even see people not specifying a clear condition on .loc, in such cases it is worth remembering that if row_selection doesn't have column name for the filter, the value will be matched on the indexed column.

No alt text provided for this image

This works because pandas will look for the indexes having value 1,2,3,4 which clarifies the row selection parameter in implicit cases.


In conclusion, leveraging the power of pandas as an SQL tool will be useful in multiple scenarios like, caching entire database, filtering out relevant data from huge CSV etc. Throughout this article, we explored how pandas, a versatile Python library, can be used effectively to handle and manipulate data, providing a seamless SQL-like experience.


要查看或添加评论,请登录

sakshyam ghimire的更多文章

  • Understanding Python GIL

    Understanding Python GIL

    Python developers often encounter the term "GIL," which is often considered a major drawback of the Python programming…

  • Delving into memory inspection with python

    Delving into memory inspection with python

    In this article, we set out to investigate Python's complex memory management mechanisms. Our aim is to shed light on…

  • Demystifying OpenSearch Queries

    Demystifying OpenSearch Queries

    For those who are unaware of what OpenSearch is, OpenSearch is an open-source search and analytics engine that is…

  • Mathematics and Big O notation

    Mathematics and Big O notation

    Wouldn't it be great if your program works with best efficiency and minimum complexity? But as a programmer how exactly…

  • Implementing Pub-Sub in Golang

    Implementing Pub-Sub in Golang

    Publish-Subscribe (Pub-Sub) is an messaging pattern where sender of message (Publisher) announce event to multiple…

  • Concurrency States In Golang

    Concurrency States In Golang

    It’s well-known that writing concurrent code is challenging. Things generally takes a few rounds to get it operating as…

社区洞察

其他会员也浏览了