登录查看更多内容

How fast can a Duck be?

Jaak Laineste

Senior Product Lead for Data Product and Sustainability in Supply Chain Tech | Founder & Geotech Innovator | City Council Member

发布日期: 2025年1月23日

As follow-up to the recent comparison of the big three cloud-based data engines: Databricks, Snowflake and MS Fabric I tried DuckDB - the latest and greatest in-process query engine with the same data (about 3 billion rows) and queries, just to see how fast no-cloud can go nowadays. And is my macbook good enough.

Data:

Used https://github.com/JosueBogran/coffeeshopdatagenerator/ locations and products files
For base facts I tried first same script with Databricks, and this did generate nice 170GB table across 10.000 files in about 20 minutes or so.
Did want to hog my office wifi to download just some random data which can be created locally. As I don't have local Spark easily available I thought it is simpler to convert the generator script to plain python (Polars) instead. Here is the result, this ran about 4 hours to generate 200 parquet files (total 25GB): https://gist.github.com/jaakla/a3b02255121c2b15a43cbed1e0277d36

Setup:

DuckDB 1.1.3 (in DBeaver), Macbook Pro with M1Pro / 32G RAM/1T SSD

Tests:

Prep: connected to the :memory: database and loaded dimension tables as:

attach '/Users/jaak/Downloads/coffeeshop3b.duckdb' AS sweetcoffeetree;

create table sweetcoffeetree.dim_locations as select * from '/Users/jaak/Downloads/coffeeshopdatagenerator-main/Dim_Locations_Table.csv';
create table sweetcoffeetree.dim_products as select * from '/Users/jaak/Downloads/coffeeshopdatagenerator-main/Dim_Products_Table.csv';

Loaded real data from parquet to duckdb, assuming it is faster than direct parquet use. Loading takes a couple of minutes. Later observation: using native duckdb file was not universally faster than direct parquet use, so next time could just skip it and save some GBs of disk space.

领英推荐

What The Heck is Apache Polaris?

Shawn Gordon 6 个月前

Technical Analysis of the latest UK House Price Index,…

Mich Talebzadeh (Ph.D.) 4 年前

Week of May 13th

Stefan Krawczyk 9 个月前

create table sweetcoffeetree.facts_sales_3b_rows as
  SELECT
a."Order_ID"
, a.order_line_id
, a.order_date
, a.time_Of_day
, a.season
, b.location_id
, c.name AS product_name
, a.quantity
, (c.standard_price * ((100-discount_rate)/100)) * a.Quantity AS sales_amount
, a.discount_rate AS discount_percentage
FROM
'/Users/jaak/Downloads/jupydata/generated_dataset_2B_*.parquet' AS a
LEFT JOIN sweetcoffeetree.dim_locations AS b ON (a.Location_ID = b.record_id)
LEFT JOIN sweetcoffeetree.dim_products AS c ON (a."Product_ID" = c.product_id AND a.Order_Date BETWEEN c.from_date AND c.to_date)

Real tests: exactly same SQL as in the post mentioned, just removed ` which duckdb sql does not like.
In first run I took 1/3 size tests to have some idea of scaling. Also I was impatient to wait full generation, and a bit nerveous that it would be so horrible that full test would not work anyway and I need to cancel (I did not!).
Extra tests: I re-ran some of the tests with direct parquet, just the "create table" above was used as-is, as CTE in test queries.
Added some small extra tests:

11) select * from sweetcoffeetree.facts_sales_3b_rows;
 -->  in duckdb crashed whole macos, trying to allocate swap to ~200GB, hw restart needed 
12) select count(distinct order_id) from sweetcoffeetree.facts_sales_3b_rows; 
--> SLOW, took 97 s / 228 s / 261 s. On Databricks Medium (3b rows): 113 s
13) select approx_count_distinct(order_id) from sweetcoffeetree.facts_sales_3b_rows
 --> to see how inaccurate (and fast that is. Duckdb 9 s (3b), Databricks 37 sec (sic!); both gave number about 1-2% off of real (exactly 2b).

Results:

Cost analysis: not done, as I dont know (read: care) what a second of my macbook time really costs.
Execution times:

Conclusions:

DuckDB for 3b row scale analytical queries with M1Pro/32G machine (I believe it is quite average modern dev machine) is same ballpark as any of the the big engines.
I would think of "local PC" (with decent capacity) as a "Small" cloud instances, and for single-user ad-hoc queries it is just fast enough.
If you go to 10b+ scale, then probably you would need to use a big data engine (and they all seem to be good enough nowadays, performance-wise) or go through to all the pain to setup whatever Spark flavor you like locally.

How fast can a Duck be?

Jaak Laineste

Senior Product Lead for Data Product and Sustainability in Supply Chain Tech | Founder & Geotech Innovator | City Council Member

领英推荐

社区洞察

其他会员也浏览了

Using the alexmerced/datanotebook Docker Image

Spark Tidbits - Lesson 6

Optimizing Databricks Workflows with Parameterization and Notebook Chaining: A Real-Time Scenario with PySpark

Lakes, Lakehouses, Warehouse and.....MDM?

Regex in Snowflake

Is DuckDB Useful For Parsing JSON? Yes, Definitely.

RDD vs Dataframe vs Dataset

How to get started on Databricks?

Some Spark Concepts