Pandas like an SQL
Everyone is pretty much aware about the fact that pandas is an amazing python library for data manipulation, but more often than not people associate pandas with vast universe of machine learning overlooking it's simpler use cases. So in this article I will be using pandas like an small SQL database.
DataFrame in pandas
In Pandas, a DataFrame is a two-dimensional tabular data structure, similar to a spreadsheet or SQL table
For the simple use case of this article, I downloaded a CSV file from "https://www.kaggle.com/" which features "id, name, age, bloodgroup, city, occupation" columns. Let's load the file as pandas dataframe.
Rows and Columns in pandas
Accessing rows and columns in pandas is quite easy but understanding in what possible ways can square brackets be used to access data sometimes become tricky. Let's see in what possible ways can we use square bracket.
In pandas you can use square bracket to specify what columns you want to select.
data[['id','name','occupation']]
An equivalent of SQL statement where you write something like
select id, name, occupation from <table-name>;
The square brackets can also be used to filter rows based on certain given condition. The following snippet will provide only data where age is greater than 30.
condition = (data['age'] > 30)
data[condition]
Needless to say you can apply multiple filter condition o filter for required data. Something like
condition = (data['age'] > 30) | ((data['id'] == 1) & (data['name'] == 'John'))
data[condition]
Will result you
An equivalent SQL statement for this would look something like
SELECT *
FROM your_table
WHERE (age > 30) OR (id = 1 AND name = 'John');
The square brackets can also be used to select designated number of rows, something very similar to limit and offset on SQL query.
data.head(5) # select top 5 rows
data[1:5] # select rows from index 1 to 5
data.tail(5) # select last 5 rows
An equivalent SQL query for slicing would look something like
SELECT *
FROM your_table
LIMIT 4 OFFSET 1;
Updating values in pandas is easy as well, but be sure to use .loc method.
领英推荐
update_condition = (data['name'] == 'John')
data.loc[update_condition, 'occupation'] = 'Doctor' # to update single row
# to update multiple rows
data.loc[update_condition, ['occupation','age']] = ['Doctor', 40]
Here we can see I used .loc to access the data, previously we were accessing data using only data[update_condition], so what's the difference you might ask.
Well when accessing data via data[update_condition] you will get copy of data but by using .loc you will get the reference of data thus updating the data will update in actual dataframe. We will get to know about the parameters in .loc in the upcoming sections.
An equivalent SQL query for this would look like,
UPDATE your_table
SET occupation = 'Doctor', age = 40
WHERE name='John';
Indexing in pandas
Pandas identify each row as unique row with help of index. The concept of index is very similar to primary key in SQL however it is noteworthy that indexed column in pandas can have duplicate values although not recommended. If indexed column has duplicates and you strictly want to enforce uniqueness you can use '.drop_duplicate' function. When you load a data as pandas dataframe, by default pandas indexes each row which can be checked with 'data.index'
You can see numbers prior to id starting from 0, these number are the indexes in pandas.
You can change the default index in pandas by using the function 'set_index'.
# inplace true will update existing dataframe
data.set_index('name', inplace=True)
You can also have composite key as indexed column,
You can reset index method to reset the indexed column in pandas, this will again by default apply integer based indexing of 'RangeIndex' type.
# you can use drop=True parameter here however that will remove the
# previously indexed column from dataframe itself
data.reset_index()
.loc and .iloc
The .loc and .iloc indexers in pandas are used to access and manipulate data in a DataFrame or Series.
The basic syntax for .loc and .iloc is, dataframe.loc[row_selection, column_selection] however the .iloc indexer is primarily integer-based and is used to access data by specifying row and column integer positions.
On the first look the functionality between .loc/.iloc and the bracket notation ([]) seems same however, using .loc and .iloc is essential when you want to set values like we did in update operation.
# loc is more generic and accepts convenient row_selector, column_selector
data.loc[data['age']>30, ['name','age']]
# iloc is used for integet based data access
data.iloc[[1,3,4,5], [1,2]]
Often times the we even see people not specifying a clear condition on .loc, in such cases it is worth remembering that if row_selection doesn't have column name for the filter, the value will be matched on the indexed column.
This works because pandas will look for the indexes having value 1,2,3,4 which clarifies the row selection parameter in implicit cases.
In conclusion, leveraging the power of pandas as an SQL tool will be useful in multiple scenarios like, caching entire database, filtering out relevant data from huge CSV etc. Throughout this article, we explored how pandas, a versatile Python library, can be used effectively to handle and manipulate data, providing a seamless SQL-like experience.