Bird versus Bear: Comparing DuckDB and Polars
I've been exploring #DuckDB and #Polars as faster alternatives for #PySpark when dealing with non-big data. Modern hardware is so powerful that many data processing tasks can be done on a single node nowadays – this reduces the need for distributed processing. Both DuckDB and Polars leverage this notion and aim to utilize the power of your machine to the fullest. Vectorized execution and parallelization across cores are two of the ingredients to make this work. The result? Blazing-fast analytical queries and top-of-chart benchmark results. While both libraries have a lot in common, there are also fundamental differences. In this article I compare DuckDB and Polars by looking at their overlap, their differences, and their popularity.
DuckDB and Polars in one sentence
Before diving into the details, let's quickly characterize DuckDB and Polars in a single sentence:
DuckDB is an in-process OLAP RDBMS modeled after SQLite.
Polars is a DataFrame library focused on fast execution.
Overlap
Let's start with what DuckDB and Polars have in common. Both libraries
Differences
While DuckDB and Polars are similar in many ways, there are also fundamental differences.
RDBMS versus DataFrame library
DuckDB is an actual RDBMS, with native support for persisted storage (in single-file .db files) and ACID transactions. Table constraints (PRIMARY KEY, CHECK, ...) and indexes are also supported. Polars, on the other hand, is a DataFrame library for transient in-memory data processing. Data can be read from and written to external files (think JSON, Parquet, CSV, ...), but there are no transactional guarantees. Important to note is that while DuckDB can persist data in a database, it does not have to – an in-memory database is used if you don't specify a database file path when initializing your DuckDB connection. In that modus operandi, DuckDB is more similar to Polars.
C++ versus Rust
DuckDB is written in C++, Polars is written in Rust. Both C++ and rust are low-level languages optimized for performance. This makes these languages an excellent choice for execution engines that require fast computation (much better than Python, which is why pandas has such poor performance compared to DuckDB and Polars). While there are interesting differences between C++ and Rust, those goes beyond my expertise and the scope of this article.
领英推荐
Custom format versus Arrow
DuckDB uses a custom format for the internal representation of vectors. Polars uses the Apache Arrow Columnar format, a language-agnostic specification that aims to standardize the way data is represented in in-memory analytics applications. Although DuckDB uses its own custom vector format, it has a zero-copy data integration with Apache Arrow. This enables interoperability with the Arrow ecosystem. Since Polars is part of that ecosystem, DuckDB and Polars can easily and efficiently talk to each other. A Polars DataFrame can directly be queried from DuckDB, and a DuckDB relation (table) can be accessed from Polars through Arrow Database Connectivity (ADBC).
SQL versus DataFrame API
As you might expect in a relational database, SQL is the primary interface to interact with DuckDB. Polars takes a different route: their primary interface is a DataFrame API that is centered around "expressions". This expression API is the recommended way of working with Polars. That being said, both DuckDB and Polars provide secondary ways to interface with the execution engine. Polars supports writing queries in SQL, which are translated into expressions in the background and processed using their standard engine. Conversely, DuckDB exposes a "relational API" which can be used to interact with the SQL engine using a DataFrame API. Be aware that the first-class citizens (SQL for DuckDB, DataFrame API for Polars) get the most attention, and secondary APIs typically lag behind in terms of supported features.
Query evaluation
DuckDB lazily evaluates queries to be able to fully optimize their execution plan. Polars does the same. The difference is that Polars supports an eager evaluation mode as well. It's up to the user to decide which evaluation mode is most suitable – typically lazy evaluation is preferred because of computational efficiency, though eager evaluation can be useful for interactive workloads.
Delta support
Nowadays, the Lakehouse data architecture is commonplace. As such, being able to read from and write to a table format like Delta is highly desired (especially within the Databricks and/or Microsoft ecosystem). DuckDB does not (yet) support this, while Polars has functionality for both reading and writing Delta tables.
Popularity
It's difficult to assess a technology's popularity. Github "stargazing" is one (flawed) way of doing this. The chart below shows DuckDB's and Polars' Github star count over time. If anything can be concluded from this, it would be that Polars is more "trending" than DuckDB.
Conclusion
Both DuckDB and Polars are powerful technologies for analytical data processing. Despite the commonality, there are fundamental differences that can make either DuckDB or Polars the better fit for your workload. I hope this article provides a good overview and starting point for individuals or data teams considering to adopt DuckDB or Polars.