登录查看更多内容

Bird versus Bear: Comparing DuckDB and Polars

Jorrit Sandbrink

data/software engineer

发布日期: 2023年11月13日

I've been exploring #DuckDB and #Polars as faster alternatives for #PySpark when dealing with non-big data. Modern hardware is so powerful that many data processing tasks can be done on a single node nowadays – this reduces the need for distributed processing. Both DuckDB and Polars leverage this notion and aim to utilize the power of your machine to the fullest. Vectorized execution and parallelization across cores are two of the ingredients to make this work. The result? Blazing-fast analytical queries and top-of-chart benchmark results. While both libraries have a lot in common, there are also fundamental differences. In this article I compare DuckDB and Polars by looking at their overlap, their differences, and their popularity.

DuckDB and Polars in one sentence

Before diving into the details, let's quickly characterize DuckDB and Polars in a single sentence:

DuckDB is an in-process OLAP RDBMS modeled after SQLite.

Polars is a DataFrame library focused on fast execution.

Overlap

Let's start with what DuckDB and Polars have in common. Both libraries

are built for analytical workloads – they are optimized for OLAP, not OLTP
internally represent data in columnar format – not record-oriented
run on a single node – no cluster of machines like in Spark
are embedded in a host process – not standalone like Postgres
have no required dependencies (unlike pandas, which needs to install the tzdata, six, numpy, and python-dateutil packages)
make use of a vectorized execution engine
use Single Instruction, Multiple Data (SIMD) instructions
have top-of-chart benchmark performance
support processing larger-than-memory data sets by leveraging data streaming
are available in multiple programming languages
expose multiple APIs (SQL but also DataFrame APIs)
have their origins in The Netherlands (fun fact: DuckDB has been developed at CWI in Amsterdam – the research institute that gave birth to the Python programming language)

Differences

While DuckDB and Polars are similar in many ways, there are also fundamental differences.

RDBMS versus DataFrame library

DuckDB is an actual RDBMS, with native support for persisted storage (in single-file .db files) and ACID transactions. Table constraints (PRIMARY KEY, CHECK, ...) and indexes are also supported. Polars, on the other hand, is a DataFrame library for transient in-memory data processing. Data can be read from and written to external files (think JSON, Parquet, CSV, ...), but there are no transactional guarantees. Important to note is that while DuckDB can persist data in a database, it does not have to – an in-memory database is used if you don't specify a database file path when initializing your DuckDB connection. In that modus operandi, DuckDB is more similar to Polars.

C++ versus Rust

DuckDB is written in C++, Polars is written in Rust. Both C++ and rust are low-level languages optimized for performance. This makes these languages an excellent choice for execution engines that require fast computation (much better than Python, which is why pandas has such poor performance compared to DuckDB and Polars). While there are interesting differences between C++ and Rust, those goes beyond my expertise and the scope of this article.

Arno Wakfer MCT 6 个月前

A Taxonomy of the AI Database Ecosystem

Vincent Granville 3 个月前

Aggregation Functions in PySpark

Sachin D N ???? 8 个月前

Custom format versus Arrow

DuckDB uses a custom format for the internal representation of vectors. Polars uses the Apache Arrow Columnar format, a language-agnostic specification that aims to standardize the way data is represented in in-memory analytics applications. Although DuckDB uses its own custom vector format, it has a zero-copy data integration with Apache Arrow. This enables interoperability with the Arrow ecosystem. Since Polars is part of that ecosystem, DuckDB and Polars can easily and efficiently talk to each other. A Polars DataFrame can directly be queried from DuckDB, and a DuckDB relation (table) can be accessed from Polars through Arrow Database Connectivity (ADBC).

SQL versus DataFrame API

As you might expect in a relational database, SQL is the primary interface to interact with DuckDB. Polars takes a different route: their primary interface is a DataFrame API that is centered around "expressions". This expression API is the recommended way of working with Polars. That being said, both DuckDB and Polars provide secondary ways to interface with the execution engine. Polars supports writing queries in SQL, which are translated into expressions in the background and processed using their standard engine. Conversely, DuckDB exposes a "relational API" which can be used to interact with the SQL engine using a DataFrame API. Be aware that the first-class citizens (SQL for DuckDB, DataFrame API for Polars) get the most attention, and secondary APIs typically lag behind in terms of supported features.

Query evaluation

DuckDB lazily evaluates queries to be able to fully optimize their execution plan. Polars does the same. The difference is that Polars supports an eager evaluation mode as well. It's up to the user to decide which evaluation mode is most suitable – typically lazy evaluation is preferred because of computational efficiency, though eager evaluation can be useful for interactive workloads.

Delta support

Nowadays, the Lakehouse data architecture is commonplace. As such, being able to read from and write to a table format like Delta is highly desired (especially within the Databricks and/or Microsoft ecosystem). DuckDB does not (yet) support this, while Polars has functionality for both reading and writing Delta tables.

Fundamental differences between DuckDB and Polars

Popularity

It's difficult to assess a technology's popularity. Github "stargazing" is one (flawed) way of doing this. The chart below shows DuckDB's and Polars' Github star count over time. If anything can be concluded from this, it would be that Polars is more "trending" than DuckDB.

Conclusion

Both DuckDB and Polars are powerful technologies for analytical data processing. Despite the commonality, there are fundamental differences that can make either DuckDB or Polars the better fit for your workload. I hope this article provides a good overview and starting point for individuals or data teams considering to adopt DuckDB or Polars.

Bird versus Bear: Comparing DuckDB and Polars

Jorrit Sandbrink

data/software engineer

DuckDB and Polars in one sentence

Overlap

Differences

RDBMS versus DataFrame library

C++ versus Rust

领英推荐

Custom format versus Arrow

SQL versus DataFrame API

Query evaluation

Delta support

Popularity

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Automation in Data Analytics: How Python and SQL Are Changing the Game

Best Ways to Use Pandas with PySpark

Expedite Apache Spark Queries with Bloom Filter Indexing

Simplifying Apache Spark usage with Optimus

PySpark

Handling Nested Schema in Apache Spark

How to implement Apache Spark in Data Processing and Analytics?

Comparison Between SQL Joins and Python Joins

Spark Tidbits - Lesson 9

Spark Tidbits - Lesson 8

DuckDB and Polars in one sentence

Overlap

Differences

RDBMS versus DataFrame library

C++ versus Rust

领英推荐

Custom format versus Arrow

SQL versus DataFrame API

Query evaluation

Delta support

Popularity

Conclusion

Databricks Photon and its relation to Apache Spark

2023年11月18日

A way to avoid the "void data type" in PySpark and Delta

2023年10月25日

Mapping Microsoft's Data Analytics Landscape – Comparing Databricks, Synapse and Fabric

2023年7月19日

Exploring Fabric: putting Microsoft's new analytics platform to the test

2023年5月26日

Microsoft OneLake adopts Delta, says goodbye to closed storage formats

2023年5月24日

Which Data Lake storage format wins the popularity contest?

2023年5月21日

社区洞察

其他会员也浏览了

Automation in Data Analytics: How Python and SQL Are Changing the Game

Best Ways to Use Pandas with PySpark

Expedite Apache Spark Queries with Bloom Filter Indexing

Simplifying Apache Spark usage with Optimus

PySpark

Handling Nested Schema in Apache Spark

How to implement Apache Spark in Data Processing and Analytics?

Comparison Between SQL Joins and Python Joins

Spark Tidbits - Lesson 9

Spark Tidbits - Lesson 8