DataFrames Battle Royale | Pandas vs Polars vs Spark
Olivier Soucy
Founder @ okube.ai | Fractional Data Platform Engineer | Open-source Developer | Databricks Partner
Usually, when I sit down to write these blog posts, I have a clear direction in mind. This time, however, it's a bit different. Polars is a technology I'm not very familiar with, but its recent surge in popularity makes it worth exploring. To understand its place in the realm of data manipulation, let's compare it with the two established giants: Pandas and Spark.
Preface
Believe it or not, more than ten years ago, I started building a custom DataFrame library on top of NumPy for CAE , the company I was working for at the time. Why? Simply because I didn't know any better. Coming from a MATLAB background, it was one of my first Python projects and, back in 2012, Pandas wasn't as well-known as it is today. In fact, I didn't even know what a DataFrame was. It was an interesting endeavor, and I learned a lot, but as you might guess, it was soon replaced by Pandas—a much more powerful and well-designed library. I evolved into a master of the black-and-white bear, even using its obscure block structure and block manager
df._mgr.blocks
to build a real-time application with a rolling buffer. Fast forward to today, and I only use Pandas occasionally. Now, as I manage large (and not-so-large) datasets on the cloud, Apache Spark has become my primary DataFrame engine.
Overview
Back to the task at hand, let's discuss what each DataFrame technology has to offer.
Pandas operates with an in-memory, single-threaded architecture ideal for small to medium datasets, providing simplicity and immediate feedback. Polars, built with Rust, offers multi-threaded, in-memory processing and supports both eager and lazy execution, optimizing performance for larger datasets. Apache Spark uses a distributed computing architecture with lazy execution, designed for processing massive datasets across clusters, ensuring scalability and fault tolerance.
Architecture
Pandas
Pandas operates entirely in-memory, storing data in a DataFrame structure similar to a table in a relational database. Each operation is executed immediately (eager execution), which makes it simple and intuitive but limits its scalability to the memory available on a single machine. It is generally best suited for datasets up to a few gigabytes.
Polars
Polars is built for speed and efficiency, utilizing Rust's performance capabilities and supporting multi-threaded execution. With its lazy execution model, Polars can optimize the entire workflow, making it highly efficient for complex operations on larger datasets. It outperforms Pandas significantly in terms of speed and memory usage, especially when handling larger-than-memory data.
Spark
Apache Spark excels in distributed data processing, making it ideal for massive datasets spread across a cluster of machines. Its lazy execution model allows for complex query optimization and efficient resource management. Spark's ability to handle large-scale data processing with fault tolerance and scalability makes it the go-to choice for big data environments, though it requires more setup, overhead and resources compared to Pandas and Polars.
Installation
Pandas and Polars use backend implementations like C, PyArrow, and Rust to enhance performance, yet they remain pure Python packages that can be installed with the usual pip install. In contrast, Spark offers a Python package (PySpark) but also requires the installation of Java components for the worker nodes to perform computations. This installation is not trivial and may also necessitate configuring and optimizing clusters to achieve the best performance. The recent introduction of Spark Connect—a lightweight client that allows you to analyze, optimize, and schedule your transformations—is a welcome addition. However, you'll still need to set up a server somewhere to handle actual data transformations.
In this context, Pandas and Polars are much more approachable than Spark. Most people don't set up a Spark environment themselves but instead rely on cloud-managed services like Databricks or AWS EMR.
APIs
While Pandas, Polars, and Spark offer similar functionalities for common data operations, they differ significantly in syntax and performance characteristics (more on that later). Polars and Spark are quite similar in that their operations are chainable (each call returns a new DataFrame) and often involve using a column object. On the other hand, Pandas methods typically return a new column that needs to be assigned back to the DataFrame. Let's illustrate this with an example where we:
In Pandas, adding df["x"] to df["y"] creates a new column directly within the DataFrame, as it already stores its data. In contrast, with Polars and Spark, pl.col("x") + pl.col("y") and F.col("x") + F.col("y") simply return the definition of a column, which then needs to be assigned to the DataFrame.
While the Pandas approach might be more intuitive and less verbose, Spark and Polars facilitate chaining multiple operations, as each method explicitly returns a new DataFrame. There’s a good reason why the respective APIs were designed this way...
Execution: Lazy vs Eager
One of the key differences between these frameworks is that Pandas adheres to the eager execution model, while Polars and Spark follow the lazy execution model. This largely explains the design of their respective APIs as explained above.
Lazy execution delays the computation of operations until the final result is needed, allowing for potential optimization and efficient execution of the entire workflow. This approach can combine multiple operations into a single, more efficient task, reducing unnecessary computations and memory usage.
Eager execution, on the other hand, performs each operation immediately as it is called, providing immediate feedback and making the code easier to debug and understand. However, this can lead to inefficiencies, as each operation is performed independently, potentially resulting in redundant computations and higher memory consumption.
This contrast highlights how Pandas is designed for user-friendliness, while Polars and Spark focus on performance by optimizing each operation and executing calculations only when necessary.
领英推荐
Hands-on Experiment
So, in theory, Pandas is best for small datasets, Polars for medium-sized ones, and Spark for large-scale data. But how does this hold up in practice? Let's find out! We'll build DataFrames of varying sizes, apply some transformations, save them to disk, and monitor the performance.
Setup
Here is the setup of the experiment
Results
The figure below shows the processing time for each type of DataFrame as a function of the number of rows:
Caveats
The results discussed above provide a general idea of the optimal target size for each DataFrame, but many factors need to be considered. The type of data (floats, strings, etc.), sparsity, partitioning (for Spark), cluster size and the nature of the transformations all significantly impact computation time.
We haven't extensively discussed memory usage, but it's also something to consider. The memory usage, and especially distribution will be quite different for each framework. From a cost consideration point of vue, using multiple small workers, which is only possible with Spark, might actually be cheaper than using a single worker with massive amount of memory.
In other words, performance also has to be balanced with operation costs.
Summary
Placing the three frameworks on a performance vs. friendliness chart would look something like this:
The increase in performance generally comes at the cost of increased complexity, whether through the offered API or the installation and execution process.
Spark should probably not be considered for DataFrames with fewer than 10 million rows, while Pandas starts to show signs of slowing down at around 1 to 5 million rows. Beyond that, it really comes down to:
What about you? Have you compared these frameworks for your projects? Which one did you ended up using and why? Share your experiences below!
And don't forget to follow me for more great content about data engineering.
Data Engineer at Ship Creek Group
2 个月My favorite is using ibis as the dataframe API on top of duckdb as the execution engine. It's faster than any of these (except for sometimes polars), scales to larger than memory datasets better (except for extremely huge data, when spark is probably the better choice), and has a very nice API.