登录查看更多内容

DataFrames Battle Royale | Pandas vs Polars vs Spark

Olivier Soucy

Founder @ okube.ai | Fractional Data Platform Engineer | Open-source Developer | Databricks Partner

发布日期: 2024年5月31日

Usually, when I sit down to write these blog posts, I have a clear direction in mind. This time, however, it's a bit different. Polars is a technology I'm not very familiar with, but its recent surge in popularity makes it worth exploring. To understand its place in the realm of data manipulation, let's compare it with the two established giants: Pandas and Spark.

Preface

Believe it or not, more than ten years ago, I started building a custom DataFrame library on top of NumPy for CAE , the company I was working for at the time. Why? Simply because I didn't know any better. Coming from a MATLAB background, it was one of my first Python projects and, back in 2012, Pandas wasn't as well-known as it is today. In fact, I didn't even know what a DataFrame was. It was an interesting endeavor, and I learned a lot, but as you might guess, it was soon replaced by Pandas—a much more powerful and well-designed library. I evolved into a master of the black-and-white bear, even using its obscure block structure and block manager

df._mgr.blocks

to build a real-time application with a rolling buffer. Fast forward to today, and I only use Pandas occasionally. Now, as I manage large (and not-so-large) datasets on the cloud, Apache Spark has become my primary DataFrame engine.

Overview

Back to the task at hand, let's discuss what each DataFrame technology has to offer.

Pandas operates with an in-memory, single-threaded architecture ideal for small to medium datasets, providing simplicity and immediate feedback. Polars, built with Rust, offers multi-threaded, in-memory processing and supports both eager and lazy execution, optimizing performance for larger datasets. Apache Spark uses a distributed computing architecture with lazy execution, designed for processing massive datasets across clusters, ensuring scalability and fault tolerance.

Architecture

Pandas

Pandas operates entirely in-memory, storing data in a DataFrame structure similar to a table in a relational database. Each operation is executed immediately (eager execution), which makes it simple and intuitive but limits its scalability to the memory available on a single machine. It is generally best suited for datasets up to a few gigabytes.

Polars

Polars is built for speed and efficiency, utilizing Rust's performance capabilities and supporting multi-threaded execution. With its lazy execution model, Polars can optimize the entire workflow, making it highly efficient for complex operations on larger datasets. It outperforms Pandas significantly in terms of speed and memory usage, especially when handling larger-than-memory data.

Spark

Apache Spark excels in distributed data processing, making it ideal for massive datasets spread across a cluster of machines. Its lazy execution model allows for complex query optimization and efficient resource management. Spark's ability to handle large-scale data processing with fault tolerance and scalability makes it the go-to choice for big data environments, though it requires more setup, overhead and resources compared to Pandas and Polars.

Installation

Pandas and Polars use backend implementations like C, PyArrow, and Rust to enhance performance, yet they remain pure Python packages that can be installed with the usual pip install. In contrast, Spark offers a Python package (PySpark) but also requires the installation of Java components for the worker nodes to perform computations. This installation is not trivial and may also necessitate configuring and optimizing clusters to achieve the best performance. The recent introduction of Spark Connect—a lightweight client that allows you to analyze, optimize, and schedule your transformations—is a welcome addition. However, you'll still need to set up a server somewhere to handle actual data transformations.

In this context, Pandas and Polars are much more approachable than Spark. Most people don't set up a Spark environment themselves but instead rely on cloud-managed services like Databricks or AWS EMR.

APIs

While Pandas, Polars, and Spark offer similar functionalities for common data operations, they differ significantly in syntax and performance characteristics (more on that later). Polars and Spark are quite similar in that their operations are chainable (each call returns a new DataFrame) and often involve using a column object. On the other hand, Pandas methods typically return a new column that needs to be assigned back to the DataFrame. Let's illustrate this with an example where we:

Select columns x, y, and z.
Create a new column, xy, by adding x and y.

In Pandas, adding df["x"] to df["y"] creates a new column directly within the DataFrame, as it already stores its data. In contrast, with Polars and Spark, pl.col("x") + pl.col("y") and F.col("x") + F.col("y") simply return the definition of a column, which then needs to be assigned to the DataFrame.

While the Pandas approach might be more intuitive and less verbose, Spark and Polars facilitate chaining multiple operations, as each method explicitly returns a new DataFrame. There’s a good reason why the respective APIs were designed this way...

Execution: Lazy vs Eager

One of the key differences between these frameworks is that Pandas adheres to the eager execution model, while Polars and Spark follow the lazy execution model. This largely explains the design of their respective APIs as explained above.

Lazy execution delays the computation of operations until the final result is needed, allowing for potential optimization and efficient execution of the entire workflow. This approach can combine multiple operations into a single, more efficient task, reducing unnecessary computations and memory usage.

Eager execution, on the other hand, performs each operation immediately as it is called, providing immediate feedback and making the code easier to debug and understand. However, this can lead to inefficiencies, as each operation is performed independently, potentially resulting in redundant computations and higher memory consumption.

This contrast highlights how Pandas is designed for user-friendliness, while Polars and Spark focus on performance by optimizing each operation and executing calculations only when necessary.

领英推荐

Data Science Road Map 2022 – The Ultimate Guide

Abhinavan Sarikonda ? 2 年前

Python’s Must-Have Libraries for Data Science Beginners

Walter Shields 4 个月前

Distributed Bloom Filter

Patrick Nicolas 8 个月前

Hands-on Experiment

So, in theory, Pandas is best for small datasets, Polars for medium-sized ones, and Spark for large-scale data. But how does this hold up in practice? Let's find out! We'll build DataFrames of varying sizes, apply some transformations, save them to disk, and monitor the performance.

Setup

Here is the setup of the experiment

Environment: The experiment runs on a notebook with a cluster of 8 workers, each with 4 cores and 14 GB of memory. The driver, the only compute resource that can be used by Pandas and Polars is 8 cores with 56 GB.
DataFrames: We will create DataFrames of various sizes (up to 100 million rows) using each framework, with 10 columns containing a mix of floats, integers, strings, and timestamps. Here's a sample code for Pandas:

Three transformations will be applied to each DataFrame: Columns selection, Addition of a arctan column and Calculation of the average value of column A and the standard deviation of column B for each unique value of column C. Here is the corresponding code for Spark:

Each DataFrame will be written to disk as a Parquet file
The total processing time (including generation, transformation, and writing) for each combination of DataFrame type and number of rows will be calculated by averaging over 5 runs.

Results

The figure below shows the processing time for each type of DataFrame as a function of the number of rows:

Below 1 million rows, Pandas and Polars offer essentially the same performance, while Spark is significantly slower because of the additional overhead.
When approaching 5 millions rows, both Spark and Polars starts to take the lead over Pandas and offer a 25% (.75 seconds) reduction in compute time.
By 10 million rows, Polars is still about 25% faster than Pandas. Spark has now taken the lead.
By 100 million rows, Spark is now 5 times faster than Pandas and 4 times faster than Polars.

Caveats

The results discussed above provide a general idea of the optimal target size for each DataFrame, but many factors need to be considered. The type of data (floats, strings, etc.), sparsity, partitioning (for Spark), cluster size and the nature of the transformations all significantly impact computation time.

We haven't extensively discussed memory usage, but it's also something to consider. The memory usage, and especially distribution will be quite different for each framework. From a cost consideration point of vue, using multiple small workers, which is only possible with Spark, might actually be cheaper than using a single worker with massive amount of memory.

In other words, performance also has to be balanced with operation costs.

Summary

Placing the three frameworks on a performance vs. friendliness chart would look something like this:

The increase in performance generally comes at the cost of increased complexity, whether through the offered API or the installation and execution process.

Spark should probably not be considered for DataFrames with fewer than 10 million rows, while Pandas starts to show signs of slowing down at around 1 to 5 million rows. Beyond that, it really comes down to:

API preference
Balancing cost vs performance
Futureproofing yourself for growth in your dataset

What about you? Have you compared these frameworks for your projects? Which one did you ended up using and why? Share your experiences below!

And don't forget to follow me for more great content about data engineering.

Nicholas Crews

Data Engineer at Ship Creek Group

6 个月

My favorite is using ibis as the dataframe API on top of duckdb as the execution engine. It's faster than any of these (except for sometimes polars), scales to larger than memory datasets better (except for extremely huge data, when spark is probably the better choice), and has a very nice API.

2 次回应

查看更多评论

要查看或添加评论，请登录

Olivier Soucy的更多文章

Lakehouse as Code | 04. Delta Live Tables Data Pipelines

2024年10月15日

Lakehouse as Code | 04. Delta Live Tables Data Pipelines

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks…

1 条评论
Lakehouse as Code | 03. Data Pipeline Jobs

2024年10月8日

Lakehouse as Code | 03. Data Pipeline Jobs

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks…
Lakehouse as Code | 02. Workspace

2024年9月30日

Lakehouse as Code | 02. Workspace

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks…
Lakehouse as Code | 01. Unity Catalog

2024年9月24日

Lakehouse as Code | 01. Unity Catalog

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks…
Data Pipelines | To merge, or not to merge

2024年9月10日

Data Pipelines | To merge, or not to merge

In recent years, data has shifted towards a more streaming-centric nature. Online transactions, website clicks, TikTok…
Unity Catalog | 3 levels to rule them all

2024年9月3日

Unity Catalog | 3 levels to rule them all

In May 2021, Databricks introduced Unity Catalog (UC), promising a unified governance layer designed to streamline the…

5 条评论
Databricks AI Playground | How to bring your own model

2024年8月26日

Databricks AI Playground | How to bring your own model

After a few months in public preview, Databricks AI Playground has garnered great feedback from the community. But if…

1 条评论
Building a Data Pipeline with Polars and Laktory

2024年7月8日

Building a Data Pipeline with Polars and Laktory

When discussing data pipelines, distributed engines like Spark and big data platforms such as Databricks and Snowflake…

1 条评论
Analytics for Everyone | Data driven decisions using ChatGPT

2024年5月21日

Analytics for Everyone | Data driven decisions using ChatGPT

Preface: If you are non-technical but have an interest in technology or data-driven decision-making, please keep…

2 条评论
Mastering Streaming Data Pipelines with Kappa Architecture

2024年5月16日

Mastering Streaming Data Pipelines with Kappa Architecture

These days, experience with streaming data is a common requirement in most data engineering job postings. It seems that…

2 条评论

See all articles

DataFrames Battle Royale | Pandas vs Polars vs Spark

Olivier Soucy

Founder @ okube.ai | Fractional Data Platform Engineer | Open-source Developer | Databricks Partner

Preface

Overview

Architecture

Installation

APIs

Execution: Lazy vs Eager

领英推荐

Hands-on Experiment

Setup

Results

Caveats

Summary

Olivier Soucy的更多文章

社区洞察

其他会员也浏览了

Mastering Data Visualization with Matplotlib: A Comprehensive Guide to Creating Powerful Plots and Charts

Polars Vs Pandas: Benchmarking performances and beyond

The Power of Ten

Machine Learning fast-track: Telco Customer Churn Prediction

Cleaning Data with Pandas

R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results

Accessing Data with loc: Label-Based Indexing in Pandas

Accessing Data with iloc: Position-Based Indexing in Pandas

Exploring Python’s Advanced Basics for Data Science

My PySpark Job Is Taking Forever… Now What? ?

Preface

Overview

Architecture

Installation

APIs

Execution: Lazy vs Eager

领英推荐

Hands-on Experiment

Setup

Results

Caveats

Summary

Olivier Soucy的更多文章

Lakehouse as Code | 04. Delta Live Tables Data Pipelines

Lakehouse as Code | 03. Data Pipeline Jobs

Lakehouse as Code | 02. Workspace

Lakehouse as Code | 01. Unity Catalog

Data Pipelines | To merge, or not to merge

Unity Catalog | 3 levels to rule them all

Databricks AI Playground | How to bring your own model

Building a Data Pipeline with Polars and Laktory

Analytics for Everyone | Data driven decisions using ChatGPT

Mastering Streaming Data Pipelines with Kappa Architecture

社区洞察

其他会员也浏览了

Mastering Data Visualization with Matplotlib: A Comprehensive Guide to Creating Powerful Plots and Charts

Polars Vs Pandas: Benchmarking performances and beyond

The Power of Ten

Machine Learning fast-track: Telco Customer Churn Prediction

Cleaning Data with Pandas

R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results

Accessing Data with loc: Label-Based Indexing in Pandas

Accessing Data with iloc: Position-Based Indexing in Pandas

Exploring Python’s Advanced Basics for Data Science

My PySpark Job Is Taking Forever… Now What? ?