DataFrames Battle Royale | Pandas vs Polars vs Spark
pandas vs spark vs polars

DataFrames Battle Royale | Pandas vs Polars vs Spark

Usually, when I sit down to write these blog posts, I have a clear direction in mind. This time, however, it's a bit different. Polars is a technology I'm not very familiar with, but its recent surge in popularity makes it worth exploring. To understand its place in the realm of data manipulation, let's compare it with the two established giants: Pandas and Spark.

Preface

Believe it or not, more than ten years ago, I started building a custom DataFrame library on top of NumPy for CAE , the company I was working for at the time. Why? Simply because I didn't know any better. Coming from a MATLAB background, it was one of my first Python projects and, back in 2012, Pandas wasn't as well-known as it is today. In fact, I didn't even know what a DataFrame was. It was an interesting endeavor, and I learned a lot, but as you might guess, it was soon replaced by Pandas—a much more powerful and well-designed library. I evolved into a master of the black-and-white bear, even using its obscure block structure and block manager

df._mgr.blocks        

to build a real-time application with a rolling buffer. Fast forward to today, and I only use Pandas occasionally. Now, as I manage large (and not-so-large) datasets on the cloud, Apache Spark has become my primary DataFrame engine.

Overview

Back to the task at hand, let's discuss what each DataFrame technology has to offer.

Pandas operates with an in-memory, single-threaded architecture ideal for small to medium datasets, providing simplicity and immediate feedback. Polars, built with Rust, offers multi-threaded, in-memory processing and supports both eager and lazy execution, optimizing performance for larger datasets. Apache Spark uses a distributed computing architecture with lazy execution, designed for processing massive datasets across clusters, ensuring scalability and fault tolerance.

Architecture

Pandas

Pandas operates entirely in-memory, storing data in a DataFrame structure similar to a table in a relational database. Each operation is executed immediately (eager execution), which makes it simple and intuitive but limits its scalability to the memory available on a single machine. It is generally best suited for datasets up to a few gigabytes.

Polars

Polars is built for speed and efficiency, utilizing Rust's performance capabilities and supporting multi-threaded execution. With its lazy execution model, Polars can optimize the entire workflow, making it highly efficient for complex operations on larger datasets. It outperforms Pandas significantly in terms of speed and memory usage, especially when handling larger-than-memory data.

Spark

Apache Spark excels in distributed data processing, making it ideal for massive datasets spread across a cluster of machines. Its lazy execution model allows for complex query optimization and efficient resource management. Spark's ability to handle large-scale data processing with fault tolerance and scalability makes it the go-to choice for big data environments, though it requires more setup, overhead and resources compared to Pandas and Polars.

Installation

Pandas and Polars use backend implementations like C, PyArrow, and Rust to enhance performance, yet they remain pure Python packages that can be installed with the usual pip install. In contrast, Spark offers a Python package (PySpark) but also requires the installation of Java components for the worker nodes to perform computations. This installation is not trivial and may also necessitate configuring and optimizing clusters to achieve the best performance. The recent introduction of Spark Connect—a lightweight client that allows you to analyze, optimize, and schedule your transformations—is a welcome addition. However, you'll still need to set up a server somewhere to handle actual data transformations.

In this context, Pandas and Polars are much more approachable than Spark. Most people don't set up a Spark environment themselves but instead rely on cloud-managed services like Databricks or AWS EMR.

APIs

While Pandas, Polars, and Spark offer similar functionalities for common data operations, they differ significantly in syntax and performance characteristics (more on that later). Polars and Spark are quite similar in that their operations are chainable (each call returns a new DataFrame) and often involve using a column object. On the other hand, Pandas methods typically return a new column that needs to be assigned back to the DataFrame. Let's illustrate this with an example where we:

  1. Select columns x, y, and z.
  2. Create a new column, xy, by adding x and y.

dataframe APIs

In Pandas, adding df["x"] to df["y"] creates a new column directly within the DataFrame, as it already stores its data. In contrast, with Polars and Spark, pl.col("x") + pl.col("y") and F.col("x") + F.col("y") simply return the definition of a column, which then needs to be assigned to the DataFrame.

While the Pandas approach might be more intuitive and less verbose, Spark and Polars facilitate chaining multiple operations, as each method explicitly returns a new DataFrame. There’s a good reason why the respective APIs were designed this way...

Execution: Lazy vs Eager

One of the key differences between these frameworks is that Pandas adheres to the eager execution model, while Polars and Spark follow the lazy execution model. This largely explains the design of their respective APIs as explained above.

Lazy execution delays the computation of operations until the final result is needed, allowing for potential optimization and efficient execution of the entire workflow. This approach can combine multiple operations into a single, more efficient task, reducing unnecessary computations and memory usage.

Eager execution, on the other hand, performs each operation immediately as it is called, providing immediate feedback and making the code easier to debug and understand. However, this can lead to inefficiencies, as each operation is performed independently, potentially resulting in redundant computations and higher memory consumption.

This contrast highlights how Pandas is designed for user-friendliness, while Polars and Spark focus on performance by optimizing each operation and executing calculations only when necessary.

Hands-on Experiment

So, in theory, Pandas is best for small datasets, Polars for medium-sized ones, and Spark for large-scale data. But how does this hold up in practice? Let's find out! We'll build DataFrames of varying sizes, apply some transformations, save them to disk, and monitor the performance.

Setup

Here is the setup of the experiment

  • Environment: The experiment runs on a notebook with a cluster of 8 workers, each with 4 cores and 14 GB of memory. The driver, the only compute resource that can be used by Pandas and Polars is 8 cores with 56 GB.
  • DataFrames: We will create DataFrames of various sizes (up to 100 million rows) using each framework, with 10 columns containing a mix of floats, integers, strings, and timestamps. Here's a sample code for Pandas:

  • Three transformations will be applied to each DataFrame: Columns selection, Addition of a arctan column and Calculation of the average value of column A and the standard deviation of column B for each unique value of column C. Here is the corresponding code for Spark:

  • Each DataFrame will be written to disk as a Parquet file
  • The total processing time (including generation, transformation, and writing) for each combination of DataFrame type and number of rows will be calculated by averaging over 5 runs.

Results

The figure below shows the processing time for each type of DataFrame as a function of the number of rows:

  • Below 1 million rows, Pandas and Polars offer essentially the same performance, while Spark is significantly slower because of the additional overhead.
  • When approaching 5 millions rows, both Spark and Polars starts to take the lead over Pandas and offer a 25% (.75 seconds) reduction in compute time.
  • By 10 million rows, Polars is still about 25% faster than Pandas. Spark has now taken the lead.
  • By 100 million rows, Spark is now 5 times faster than Pandas and 4 times faster than Polars.


Caveats

The results discussed above provide a general idea of the optimal target size for each DataFrame, but many factors need to be considered. The type of data (floats, strings, etc.), sparsity, partitioning (for Spark), cluster size and the nature of the transformations all significantly impact computation time.

We haven't extensively discussed memory usage, but it's also something to consider. The memory usage, and especially distribution will be quite different for each framework. From a cost consideration point of vue, using multiple small workers, which is only possible with Spark, might actually be cheaper than using a single worker with massive amount of memory.

In other words, performance also has to be balanced with operation costs.

Summary

Placing the three frameworks on a performance vs. friendliness chart would look something like this:

data frames friendliness vs performance

The increase in performance generally comes at the cost of increased complexity, whether through the offered API or the installation and execution process.

Spark should probably not be considered for DataFrames with fewer than 10 million rows, while Pandas starts to show signs of slowing down at around 1 to 5 million rows. Beyond that, it really comes down to:

  • API preference
  • Balancing cost vs performance
  • Futureproofing yourself for growth in your dataset

What about you? Have you compared these frameworks for your projects? Which one did you ended up using and why? Share your experiences below!

And don't forget to follow me for more great content about data engineering.



Nicholas Crews

Data Engineer at Ship Creek Group

2 个月

My favorite is using ibis as the dataframe API on top of duckdb as the execution engine. It's faster than any of these (except for sometimes polars), scales to larger than memory datasets better (except for extremely huge data, when spark is probably the better choice), and has a very nice API.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了