Querying a Snowflake Data Warehouse

Querying a Snowflake Data Warehouse

In Snowflake there is a sample database called Snowflake_Sample_Data. This database contains 6 data warehouses:

One of them is TPCDS_SF100TCL. That is a 100 TB data warehouse:

TPC DS data warehouse contains 7 fact tables and 17 dimensions. It stores 5 years sales data of a department store in the US. They have 100 million customers and 500 thousand products. The company conduct their business over 3 sales channels: in the physical stores, online and catalog orders. The largest fact table is Store Sales, with 288 billion rows, which contains the sales data in the physical stores. The second largest fact table is highlighted above: Catalog Sales, which contains 144 billion rows.

The ERD for Catalog Sales is like this:

When querying a data warehouse this size, you can’t just join the fact table with the dimensions and hoping for the best. Query like below works fine if your data warehouse contains fact tables with a few million rows:

SELECT C.C_FIRST_NAME, C.C_LAST_NAME, C.C_EMAIL_ADDRESS
FROM CATALOG_SALES CS
JOIN CUSTOMER C
    ON CS.CS_SHIP_CUSTOMER_SK = C.C_CUSTOMER_SK;        

But if your fact table contains billions of rows queries like above would take a long time, costing quite a lot of money.

Querying Billions Rows Fact Tables

When querying fact tables containing 144 billion rows like Catalog Sales above, the first thing we need to do is to look at the clustering key. You can do that by typing this:

SELECT SYSTEM$CLUSTERING_INFORMATION('CATALOG_SALES')

If you click on the output row, on the right hand side you get the clustering information of that fact table. The red arrow below shows you the clustering key:

As we can see above, Catalog Sales fact table, the clustering keys are cs_sold_date_sk column and cs_item_sk column. The “sk” suffix means that these columns are Surrogate Key. A surrogate key is the primary key of a dimension. So cs_sold_date_sk is the primary key of the Date dimension, and cs_item_sk is the primary key of the item dimension.

The fact tables in Snowflake data warehouse are divided into something called micro-partitions. Micro-partitions are storage units. Each micro-partition contains between 50 and 500 MB of data. That is the normal size (uncompressed), but because the data is always stored compressed, the actual MB on the disk is smaller.

The micro-partitions are created based on the clustering keys. You can read up about the items below the red arrows above: total partition count, total constant partition count, average overlaps and average depth – they show that the clustering keys were chosen well. The point is: when querying a large table containing billions of rows, we need to follow the clustering keys.

This is how the micro-partitions look like:

The table on the left above contains 4 columns: Type, Name, Country and Date. Snowflake is a column oriented database. It stores the data column by column, not row by row like a normal transactional database. If we look at Micro-partition 1 above, the Type column are stored together, followed by the Name, Country and Date columns.

The blue row on the left table is stored in Micro-partition 1, with each column stored separately. The red row on the left table is stored in Micro-partition 4, with each column stored separately.

OK. Enough with the theory. Back to the “Querying Data Warehouse” business. Now that we know the clustering key, we can query that Catalog Sales fact table by joining it to the Date dimension on cs_sold_date_sk and filtering on a certain date. Like this:

SELECT CS.* FROM CATALOG_SALES CS
JOIN DATE_DIM DD ON CS.CS_SOLD_DATE_SK = DD.D_DATE_SK
WHERE DD.D_DATE = '2002-11-21'
LIMIT 100;        

Don't forget the LIMIT clause. When query very large table like this we must specify LIMIT in the query (or TOP). Otherwise the query will go on and on, costing a lot of credits. The output of the above query is like below:

It shows all the columns from the Catalog Sales table. The query finished in 3.2 seconds (red arrow above), and returned 100 rows (green arrow).

If you click the “3 dots” (yellow arrow above) and select View Query Profile, you will see something like this:

This is the query profile of the SQL we just ran. The red arrows show that the final output contains 100 rows, before that 256 rows, and before that it splits into 2 branches: the left branch has 1 row and the right branch has 256 rows. The green arrows show you the most expensive nodes, which is TableScan[7] consuming 92.2% of total execution time, and the most expensive operation, which is Remote Disk I/O consuming 87.7% of total execution time.

If you look at the yellow arrow on the left above, the whole thing is on the Query History section of the UI. There are 3 tabs on this page: Query Details, Query Profile and Query Telemetry. We are are on the middle tab which is Query Profile.

When querying a very large table containing billions of rows, we need to look at the Query Profile, to understand what the query is doing.

OK. Let’s go back to the Querying Data Warehouse business. We can now add the second dimension on the clustering key, which is the Item dimension.

SELECT I.I_CATEGORY, SUM(CS_NET_PAID) AS AMOUNT
FROM CATALOG_SALES CS
JOIN DATE_DIM DD ON CS.CS_SOLD_DATE_SK = DD.D_DATE_SK
JOIN ITEM AS I ON CS.CS_ITEM_SK = I.I_ITEM_SK
WHERE DD.D_DATE = '2002-11-21'
GROUP BY I.I_CATEGORY;        

The output is like below:

It took just 4.4 seconds to query 144 billion rows, join it to the Item dimensions and sum up the Amount for each product category. Not bad hey?

OK. Here’s the final SQL for today. It gets the order amount for each product category, for each city in Michigan state:

SELECT I.I_CATEGORY, CA.CA_CITY, SUM(CS_NET_PAID) AS AMOUNT

FROM CATALOG_SALES CS

JOIN DATE_DIM DD ON CS.CS_SOLD_DATE_SK = DD.D_DATE_SK

JOIN ITEM AS I ON CS.CS_ITEM_SK = I.I_ITEM_SK
JOIN CUSTOMER AS C
   ON CS.CS_SHIP_CUSTOMER_SK = C.C_CUSTOMER_SK
JOIN CUSTOMER_ADDRESS AS CA
   ON CS.CS_SHIP_ADDR_SK = CA.CA_ADDRESS_SK
WHERE DD.D_DATE = ‘2002-11-21’
   AND CA.CA_STATE = ‘MI’
GROUP BY 1,2;        

The output is like this:

It took 7.2 seconds, returning 8000 rows across all cities in Michigan. Now try to query different things yourself, and each time look at the query profile to understand what it is doing under the hood.

Our last query above has a query profile like this:

Explore the profile, understand where it spend most time and understand the the various operations.

And that my friends, is how you query a Snowflake data warehouse. Have fun!

As usual would appreciate your comments and opinions below.

Earthlings still querying like it's the stone age? Maybe it's time to size those virtual warehouses properly and optimize those queries before cloud bills eclipse the sun.

回复
Alex Ooi

Business Analytics, Data Engineering

4 周

If anyone ever can create such a practical warehouse design, we all would be jobless! Haha. But that's a fun read. Thanks for posting.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了