Bird versus Bear: Comparing DuckDB and Polars
DuckDB versus Polars

Bird versus Bear: Comparing DuckDB and Polars

I've been exploring #DuckDB and #Polars as faster alternatives for #PySpark when dealing with non-big data. Modern hardware is so powerful that many data processing tasks can be done on a single node nowadays – this reduces the need for distributed processing. Both DuckDB and Polars leverage this notion and aim to utilize the power of your machine to the fullest. Vectorized execution and parallelization across cores are two of the ingredients to make this work. The result? Blazing-fast analytical queries and top-of-chart benchmark results. While both libraries have a lot in common, there are also fundamental differences. In this article I compare DuckDB and Polars by looking at their overlap, their differences, and their popularity.

DuckDB and Polars in one sentence

Before diving into the details, let's quickly characterize DuckDB and Polars in a single sentence:

DuckDB is an in-process OLAP RDBMS modeled after SQLite.

Polars is a DataFrame library focused on fast execution.

Overlap

Let's start with what DuckDB and Polars have in common. Both libraries

  • are built for analytical workloads – they are optimized for OLAP, not OLTP
  • internally represent data in columnar format – not record-oriented
  • run on a single node – no cluster of machines like in Spark
  • are embedded in a host process – not standalone like Postgres
  • have no required dependencies (unlike pandas, which needs to install the tzdata, six, numpy, and python-dateutil packages)
  • make use of a vectorized execution engine
  • use Single Instruction, Multiple Data (SIMD) instructions
  • have top-of-chart benchmark performance
  • support processing larger-than-memory data sets by leveraging data streaming
  • are available in multiple programming languages
  • expose multiple APIs (SQL but also DataFrame APIs)
  • have their origins in The Netherlands (fun fact: DuckDB has been developed at CWI in Amsterdam – the research institute that gave birth to the Python programming language)

Differences

While DuckDB and Polars are similar in many ways, there are also fundamental differences.

RDBMS versus DataFrame library

DuckDB is an actual RDBMS, with native support for persisted storage (in single-file .db files) and ACID transactions. Table constraints (PRIMARY KEY, CHECK, ...) and indexes are also supported. Polars, on the other hand, is a DataFrame library for transient in-memory data processing. Data can be read from and written to external files (think JSON, Parquet, CSV, ...), but there are no transactional guarantees. Important to note is that while DuckDB can persist data in a database, it does not have to – an in-memory database is used if you don't specify a database file path when initializing your DuckDB connection. In that modus operandi, DuckDB is more similar to Polars.

C++ versus Rust

DuckDB is written in C++, Polars is written in Rust. Both C++ and rust are low-level languages optimized for performance. This makes these languages an excellent choice for execution engines that require fast computation (much better than Python, which is why pandas has such poor performance compared to DuckDB and Polars). While there are interesting differences between C++ and Rust, those goes beyond my expertise and the scope of this article.

Custom format versus Arrow

DuckDB uses a custom format for the internal representation of vectors. Polars uses the Apache Arrow Columnar format, a language-agnostic specification that aims to standardize the way data is represented in in-memory analytics applications. Although DuckDB uses its own custom vector format, it has a zero-copy data integration with Apache Arrow. This enables interoperability with the Arrow ecosystem. Since Polars is part of that ecosystem, DuckDB and Polars can easily and efficiently talk to each other. A Polars DataFrame can directly be queried from DuckDB, and a DuckDB relation (table) can be accessed from Polars through Arrow Database Connectivity (ADBC).

SQL versus DataFrame API

As you might expect in a relational database, SQL is the primary interface to interact with DuckDB. Polars takes a different route: their primary interface is a DataFrame API that is centered around "expressions". This expression API is the recommended way of working with Polars. That being said, both DuckDB and Polars provide secondary ways to interface with the execution engine. Polars supports writing queries in SQL, which are translated into expressions in the background and processed using their standard engine. Conversely, DuckDB exposes a "relational API" which can be used to interact with the SQL engine using a DataFrame API. Be aware that the first-class citizens (SQL for DuckDB, DataFrame API for Polars) get the most attention, and secondary APIs typically lag behind in terms of supported features.

Query evaluation

DuckDB lazily evaluates queries to be able to fully optimize their execution plan. Polars does the same. The difference is that Polars supports an eager evaluation mode as well. It's up to the user to decide which evaluation mode is most suitable – typically lazy evaluation is preferred because of computational efficiency, though eager evaluation can be useful for interactive workloads.

Delta support

Nowadays, the Lakehouse data architecture is commonplace. As such, being able to read from and write to a table format like Delta is highly desired (especially within the Databricks and/or Microsoft ecosystem). DuckDB does not (yet) support this, while Polars has functionality for both reading and writing Delta tables.

Fundamental differences between DuckDB and Polars

Popularity

It's difficult to assess a technology's popularity. Github "stargazing" is one (flawed) way of doing this. The chart below shows DuckDB's and Polars' Github star count over time. If anything can be concluded from this, it would be that Polars is more "trending" than DuckDB.

Popularity of DuckDB and Polars in terms of Github Stars

Conclusion

Both DuckDB and Polars are powerful technologies for analytical data processing. Despite the commonality, there are fundamental differences that can make either DuckDB or Polars the better fit for your workload. I hope this article provides a good overview and starting point for individuals or data teams considering to adopt DuckDB or Polars.



要查看或添加评论,请登录

社区洞察

其他会员也浏览了