How fast can a Duck be?
Jaak Laineste
Senior Product Lead for Data Product and Sustainability in Supply Chain Tech | Founder & Geotech Innovator | City Council Member
As follow-up to the recent comparison of the big three cloud-based data engines: Databricks, Snowflake and MS Fabric I tried DuckDB - the latest and greatest in-process query engine with the same data (about 3 billion rows) and queries, just to see how fast no-cloud can go nowadays. And is my macbook good enough.
Data:
Setup:
Tests:
attach '/Users/jaak/Downloads/coffeeshop3b.duckdb' AS sweetcoffeetree;
create table sweetcoffeetree.dim_locations as select * from '/Users/jaak/Downloads/coffeeshopdatagenerator-main/Dim_Locations_Table.csv';
create table sweetcoffeetree.dim_products as select * from '/Users/jaak/Downloads/coffeeshopdatagenerator-main/Dim_Products_Table.csv';
领英推荐
create table sweetcoffeetree.facts_sales_3b_rows as
SELECT
a."Order_ID"
, a.order_line_id
, a.order_date
, a.time_Of_day
, a.season
, b.location_id
, c.name AS product_name
, a.quantity
, (c.standard_price * ((100-discount_rate)/100)) * a.Quantity AS sales_amount
, a.discount_rate AS discount_percentage
FROM
'/Users/jaak/Downloads/jupydata/generated_dataset_2B_*.parquet' AS a
LEFT JOIN sweetcoffeetree.dim_locations AS b ON (a.Location_ID = b.record_id)
LEFT JOIN sweetcoffeetree.dim_products AS c ON (a."Product_ID" = c.product_id AND a.Order_Date BETWEEN c.from_date AND c.to_date)
11) select * from sweetcoffeetree.facts_sales_3b_rows;
--> in duckdb crashed whole macos, trying to allocate swap to ~200GB, hw restart needed
12) select count(distinct order_id) from sweetcoffeetree.facts_sales_3b_rows;
--> SLOW, took 97 s / 228 s / 261 s. On Databricks Medium (3b rows): 113 s
13) select approx_count_distinct(order_id) from sweetcoffeetree.facts_sales_3b_rows
--> to see how inaccurate (and fast that is. Duckdb 9 s (3b), Databricks 37 sec (sic!); both gave number about 1-2% off of real (exactly 2b).
Results:
Conclusions: