How fast can a Duck be?
openai generated

How fast can a Duck be?

As follow-up to the recent comparison of the big three cloud-based data engines: Databricks, Snowflake and MS Fabric I tried DuckDB - the latest and greatest in-process query engine with the same data (about 3 billion rows) and queries, just to see how fast no-cloud can go nowadays. And is my macbook good enough.

Data:

  • Used https://github.com/JosueBogran/coffeeshopdatagenerator/ locations and products files
  • For base facts I tried first same script with Databricks, and this did generate nice 170GB table across 10.000 files in about 20 minutes or so.
  • Did want to hog my office wifi to download just some random data which can be created locally. As I don't have local Spark easily available I thought it is simpler to convert the generator script to plain python (Polars) instead. Here is the result, this ran about 4 hours to generate 200 parquet files (total 25GB): https://gist.github.com/jaakla/a3b02255121c2b15a43cbed1e0277d36

Setup:

  • DuckDB 1.1.3 (in DBeaver), Macbook Pro with M1Pro / 32G RAM/1T SSD

Tests:

  • Prep: connected to the :memory: database and loaded dimension tables as:

attach '/Users/jaak/Downloads/coffeeshop3b.duckdb' AS sweetcoffeetree;

create table sweetcoffeetree.dim_locations as select * from '/Users/jaak/Downloads/coffeeshopdatagenerator-main/Dim_Locations_Table.csv';
create table sweetcoffeetree.dim_products as select * from '/Users/jaak/Downloads/coffeeshopdatagenerator-main/Dim_Products_Table.csv';        

  • Loaded real data from parquet to duckdb, assuming it is faster than direct parquet use. Loading takes a couple of minutes. Later observation: using native duckdb file was not universally faster than direct parquet use, so next time could just skip it and save some GBs of disk space.

create table sweetcoffeetree.facts_sales_3b_rows as
  SELECT
a."Order_ID"
, a.order_line_id
, a.order_date
, a.time_Of_day
, a.season
, b.location_id
, c.name AS product_name
, a.quantity
, (c.standard_price * ((100-discount_rate)/100)) * a.Quantity AS sales_amount
, a.discount_rate AS discount_percentage
FROM
'/Users/jaak/Downloads/jupydata/generated_dataset_2B_*.parquet' AS a
LEFT JOIN sweetcoffeetree.dim_locations AS b ON (a.Location_ID = b.record_id)
LEFT JOIN sweetcoffeetree.dim_products AS c ON (a."Product_ID" = c.product_id AND a.Order_Date BETWEEN c.from_date AND c.to_date)
        

  • Real tests: exactly same SQL as in the post mentioned, just removed ` which duckdb sql does not like.
  • In first run I took 1/3 size tests to have some idea of scaling. Also I was impatient to wait full generation, and a bit nerveous that it would be so horrible that full test would not work anyway and I need to cancel (I did not!).
  • Extra tests: I re-ran some of the tests with direct parquet, just the "create table" above was used as-is, as CTE in test queries.
  • Added some small extra tests:

11) select * from sweetcoffeetree.facts_sales_3b_rows;
 -->  in duckdb crashed whole macos, trying to allocate swap to ~200GB, hw restart needed 
12) select count(distinct order_id) from sweetcoffeetree.facts_sales_3b_rows; 
--> SLOW, took 97 s / 228 s / 261 s. On Databricks Medium (3b rows): 113 s
13) select approx_count_distinct(order_id) from sweetcoffeetree.facts_sales_3b_rows
 --> to see how inaccurate (and fast that is. Duckdb 9 s (3b), Databricks 37 sec (sic!); both gave number about 1-2% off of real (exactly 2b).        

Results:

  • Cost analysis: not done, as I dont know (read: care) what a second of my macbook time really costs.
  • Execution times:

openai generated

Conclusions:

  • DuckDB for 3b row scale analytical queries with M1Pro/32G machine (I believe it is quite average modern dev machine) is same ballpark as any of the the big engines.
  • I would think of "local PC" (with decent capacity) as a "Small" cloud instances, and for single-user ad-hoc queries it is just fast enough.
  • If you go to 10b+ scale, then probably you would need to use a big data engine (and they all seem to be good enough nowadays, performance-wise) or go through to all the pain to setup whatever Spark flavor you like locally.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了