Polars Vs Pandas:  Benchmarking performances and beyond

Polars Vs Pandas: Benchmarking performances and beyond

by Arlind Avdullahi

Introduction

If you have ever done any kind of experimenting in data science, you must have heard of Pandas. To quote the corresponding Github documentation , Pandas is a??“Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more”. The library is widely popular in the data science community, because it offers a lot of relevant functionalities, is both easy and intuitive, and is reasonably fast.??

However, a new package was recently published and promises highly competitive performance. This package is called??Polars and it calls itself a “Blazingly Fast DataFrame Library”.??It claims to be nothing less than “one of the best performing solutions available”.??

In this article, we intend to challenge this statement by comparing Pandas and Polars in a scenario close to our data science use cases. For this, we will use a large data set from Kaggle.???

In the first section of this article, we will review the performance and efficiency of the most essential functionalities in data science such as reading, filtering, etc. We will also assess how difficult it is to transfer prior knowledge of Pandas syntax to Polars. The syntax of Pandas is familiar and easy to use for most data scientists, thus it is important to understand if the Polars syntax entails additional complexity.?

After covering this side-by-side comparison, we will dive deeper in the Polars world and investigate the promising Lazy API of Polars.?

Finally, we will complete our study with a more qualitative comparison, considering the resources available, such as documentation, community support, available extensions and integrations.? These aspects tend to be overlooked in regular benchmarking, because they cannot be quantified. However, it is a key aspect for developers, as all developers know the immense value of communities such as Stack Overflow.?

Before diving in, let us take a brief step back and start by introducing Polars and the reasons why the tool is so promising.???

The strength of Polars?

If Polars brings such a performance boost, it is due to some key technical features.

  1. Written in Rust: Polars offers a Python API. However, it is written in Rust. This means that the code does not need to be interpreted, as code in Python would.?

  1. Out of Core:?Polars supports out of core data transformation with its streaming API. This allows you to process your results without requiring all your data to be in memory at the same time.?

  1. Parallelization:?Polars fully utilizes the power of your machine by dividing the workload among the available CPU cores without any?additional configuration .?

  1. Vectorized Query Engine: Polars uses?Apache Arrow , a columnar data format, to process your queries in a vectorized manner.?

  1. Lazy Evaluation API: With the lazy API, Polars does not run each query line-by-line but instead processes the full query end-to-end.?


Benchmarking on a large data set from Kaggle?

We consider a very large dataset, consisting of millions of rows. For such a volume of data, we roughly expect that the data processing in pandas takes several hours. When presenting results, we systematically compare over 10 repetitions to include both mean and standard deviation. All the experiments were conducted on Apple M1 Pro chip, on python version 3.9.13.?

In the following, we will compare different key functionalities between pandas and polars, focusing on two aspects. On the one hand the efficiency: is Polars faster and by how much? On the other hand: how easy is it to write the equivalent of Pandas commands in Polars???

  • Installing and importing the packages?

We install the required packages using pip. During the writing of this article?pandas version 2.0.0?and?polars version 0.17.0?has been used.?

pip install pandas==2.0.0  
pip install polars==0.17.0           

After installation we import the packages in our runtime environment?

import pandas as pd?
import polars as pl?        

1. Comparing reading time of the dataset?

The?US Accidents ?dataset from Kaggle has been used for all of the experiments presented below. The dataset is contained in a csv file and it contains ~17 million rows and 46 columns.?

The code to read such a dataset looks very similar for both pandas and polars?

#pandas 
pd_df = pd.read_csv("US_Accidents.csv", nrows=1000) 
 
#polars 
pl_df = pl.read_csv("US_Accidents.csv", n_rows=1000)         
Performance comparison of reading 1000 rows from a CSV file

Using Polars, it takes 4.36 ms ± 3.55?ms to read the first 1000 rows of the CSV file. By contrast, it takes pandas?10.7 ms ± 549 μs. While this tends to indicate that Polars is faster, the result is less reproducible than with Pandas. In this case, Polars is always faster, but we cannot entirely exclude that Polars might in some cases be slower than Pandas, nor can we control the lack of reproducibility on a small/sample dataset.

The difference in reading csv files becomes more apparent when the full dataset is read.

Performance comparison of reading a full CSV file with ~17Millions rows

In our example, Pandas takes?1min 27 s ± 2.43 s? to read the full file, while Polars takes?7.79 s ± 1.11 s. That is a speedup of over 10 times, which becomes very significant in data science use cases.

?We also notice that the large standard deviation previously observed in the small sample of the dataset becomes irrelevant once considering the entire dataset. This is an important observation, which shows that, even in a testing phase, we must consider a significant sample of the overall dataset. As we tend to develop PoC based on small samples of the data, this observation on the Polars stability is noteworthy.

Another interesting observation is the following: reading a dataset with Polars and then converting it to a Pandas dataframe, takes on average?29.4 s ± 2.75 s. Therefore, it is faster to read a dataset with Polars and converting it to Pandas dataframe, than reading the same dataset using only Pandas.

2. Selecting data columns

The syntax for selecting a subset of the columns using the Polars package is again almost identical to the syntax of pandas.

#pandas
pd_df_selected = pd_df[['Severity', 'Start_Time', 'End_Time', 'Station', 'Stop', 'Traffic_Signal']]

#polars
pl_df_selected = pl_df[['Severity', 'Start_Time', 'End_Time', 'Station', 'Stop', 'Traffic_Signal']]        
Performance comparison of selecting 6 out of 46 columns

As for reading the data, Polars also performs a lot better when it comes to selecting certain columns of the dataframe. On average, it takes Pandas?256 ms ± 406 ms?to return a dataframe of the selected columns, while it takes Polars only?573 μs ± 1.68 ms?. This is again a speedup of multiple times.?

3. Filtering data

Filtering based on column values, on the other hand, has a slightly different syntax in polars.

#pandas
filter_pd_df = pd_df[pd_df['Traffic_Signal']==True]

#polars
filter_pl_df = pl_df.filter(pl.col('Traffic_Signal')==True)        
Performance comparison of filtering based on Traffic_Signal==True

When it comes to filtering data, Pandas and Polars are a lot more similar in terms of performance, however Polars is still slightly faster than Pandas. On average, it takes pandas?3.8 s ± 2.45 s?to filter the rows based on the condition specified, while it takes polars?1.37 s ± 1.18 s?to perform the same transformation.

4. Sorting

The syntax to sort the data presents minor differences between the two packages. The sorting function in Pandas is called?sort_values?whereas in Polars it is called?sort. Beside the function name, if you want to sort values in descending order, in pandas you need to use?ascending=False?while in polars you have to use?descending=True.

#pandas
sorted_pd_df = pd_df.sort_values(by='Humidity(%)', ascending=False)

#polars
sorted_pl_df = pl_df.sort("Humidity(%)", descending=True)        
Performance comparison of sorting based on Humidity(%) column


As for filtering, Polars is only slightly faster than Pandas for sorting. On average pandas takes?10.1 s ± 813 ms?while polars does the sorting in?6.61 s ± 786 ms.

5. Grouping

#pandas
grouped_pd_df = pd_df.groupby(['State'])['ID'].agg('count')

#polars
grouped_pl_df = pl_df.groupby('State').agg(pl.col('ID').count())        
Performance comparison of grouping based on State

To aggregate a dataframe, Polars once again outperforms Pandas. It takes Pandas an average of? 606 ms ± 19.4?ms to run the grouping command provided above, while it takes polars only 107 ms ± 12.1 ms?to perform the same task.


6. Conclusion on benchmarking

Performance comparison on a dataset of Kaggle with ~17Millions rows and 46 columns



Lazy API

Polars offers another opportunity to further enhance the performance of the dataframe operations by using the Lazy API. Through the Lazy API, polars does not run each query line-by-line, but instead processes the full query end-to-end. According to the Polars website, it is important to use the Lazy API because:

1.??? The lazy API allows Polars to apply automatic query optimization with the query optimizer.

2.??? The lazy API allows you to work with larger than memory datasets using streaming.

3.??? The lazy API can catch schema errors before processing the data.

Polars supports two modes of operation, eager and lazy API. As an example, let's say that we want to find the number of accidents of high severity (Severity==4) per county. Using the eager API, the code reads as follow:

pl_df = pl.read_csv("US_Accidents.csv")
    .filter(pl.col('Severity')==4
    .groupby(['State', 'County'])
    .agg(pl.col('ID').count().alias("Count Severity"))
    .sort("Count Severity", descending=True)        

In order to use the lazy API, we must implement some syntax changes on the code above. Most importantly, we must substitute the function ?“read_csv ” by “scan_csv ”. This function returns a LazyFrame, instead of a DataFrame and that ensures that the lazy API is being used.

Going back to our example, the code that uses lazy API now reads:

q1 = (
    pl.scan_csv("US_Accidents.csv")
    .filter(pl.col('Severity')==4)
    .groupby(['State', 'County']).agg(pl.col('ID').count().alias("Count Severity"))
    .sort("Count Severity", descending=True)
    .collect()
)        
Performance comparison between eager and lazy API.

As displayed in the figure above, the lazy API greatly outperforms the eager API. The Lazy API performs the task on average 1.27 s ± 203 ms, compared to 8.42 s ± 1.19 s for the eager API. That is a significant speedup, which is achieved with minimal changes to the source code.


Qualitative Comparison

So far, our comparative analysis of Pandas and Polars has been in favor of Polars. The performance boost of Polars depends on the function, but Polars undeniably always won when handling large datasets.

This point is made even clearer when considering the potential offered by lazy API, which can make Polars even more efficient. This extensibility allows users to tailor Polars to their specific requirements, opening up new avenues for optimization and customization.

?However, a data science project is not only about performance. It is essential to consider the broader context when selecting a tool for your data analysis needs. When looking at the statistics of the most used programming languages among developers worldwide as of 2023 , Python accounts for 49.28% while Rust only amounts to 13.05% (and this number was lower in previous years). In short, Pandas boasts a significantly larger and more mature community, which translates into a wealth of online resources and a vast ecosystem of extensions and integrations. For users who prioritize readily available code examples, troubleshooting resources, and a well-documented user base, Pandas offers a clear advantage.


Conclusion

Ultimately, the choice between Pandas and Polars hinges on your unique project requirements and priorities. If raw performance and efficiency are paramount, Polars seems to be the superior choice. On the other hand, if you value the extensive support and resources offered by a larger community, Pandas remains a dependable and widely adopted tool. Striking the right balance between these factors is a decision that should be made with careful consideration of your specific use case and objectives.

?

Andres Aranda

Software Engineering

6 个月

Created a gist out of this for comparing approx. performance for 1-time run on full dataset here https://gist.github.com/ankandrew/47fa7bc73984981d54839dab57949f4c Result on M1 chip:

  • 该图片无替代文字
回复
Prashanth Yennampelli

Data & Analytics Consultant | Data Architect | AWS Solution Architect | Cloud Migration Consultant

7 个月

Results may vary based on the type of data source utilized for benchmarking. Did you check the results for parquet data as input for both polars and pandas?

回复
Gaurav Babbar

Product & Stategy | Data Strategy & Workflow Automation

7 个月
回复

要查看或添加评论,请登录

Machine Learning Reply GmbH的更多文章

社区洞察

其他会员也浏览了