登录查看更多内容

Querying a Snowflake Data Warehouse

Vincent Rainardi

Data Architect & Data Engineer

发布日期: 2024年10月29日

+ 关注

In Snowflake there is a sample database called Snowflake_Sample_Data. This database contains 6 data warehouses:

One of them is TPCDS_SF100TCL. That is a 100 TB data warehouse:

TPC DS data warehouse contains 7 fact tables and 17 dimensions. It stores 5 years sales data of a department store in the US. They have 100 million customers and 500 thousand products. The company conduct their business over 3 sales channels: in the physical stores, online and catalog orders. The largest fact table is Store Sales, with 288 billion rows, which contains the sales data in the physical stores. The second largest fact table is highlighted above: Catalog Sales, which contains 144 billion rows.

The ERD for Catalog Sales is like this:

When querying a data warehouse this size, you can’t just join the fact table with the dimensions and hoping for the best. Query like below works fine if your data warehouse contains fact tables with a few million rows:

SELECT C.C_FIRST_NAME, C.C_LAST_NAME, C.C_EMAIL_ADDRESS
FROM CATALOG_SALES CS
JOIN CUSTOMER C
    ON CS.CS_SHIP_CUSTOMER_SK = C.C_CUSTOMER_SK;

But if your fact table contains billions of rows queries like above would take a long time, costing quite a lot of money.

Querying Billions Rows Fact Tables

When querying fact tables containing 144 billion rows like Catalog Sales above, the first thing we need to do is to look at the clustering key. You can do that by typing this:

SELECT SYSTEM$CLUSTERING_INFORMATION('CATALOG_SALES')

If you click on the output row, on the right hand side you get the clustering information of that fact table. The red arrow below shows you the clustering key:

As we can see above, Catalog Sales fact table, the clustering keys are cs_sold_date_sk column and cs_item_sk column. The “sk” suffix means that these columns are Surrogate Key. A surrogate key is the primary key of a dimension. So cs_sold_date_sk is the primary key of the Date dimension, and cs_item_sk is the primary key of the item dimension.

The fact tables in Snowflake data warehouse are divided into something called micro-partitions. Micro-partitions are storage units. Each micro-partition contains between 50 and 500 MB of data. That is the normal size (uncompressed), but because the data is always stored compressed, the actual MB on the disk is smaller.

The micro-partitions are created based on the clustering keys. You can read up about the items below the red arrows above: total partition count, total constant partition count, average overlaps and average depth – they show that the clustering keys were chosen well. The point is: when querying a large table containing billions of rows, we need to follow the clustering keys.

This is how the micro-partitions look like:

The table on the left above contains 4 columns: Type, Name, Country and Date. Snowflake is a column oriented database. It stores the data column by column, not row by row like a normal transactional database. If we look at Micro-partition 1 above, the Type column are stored together, followed by the Name, Country and Date columns.

The blue row on the left table is stored in Micro-partition 1, with each column stored separately. The red row on the left table is stored in Micro-partition 4, with each column stored separately.

OK. Enough with the theory. Back to the “Querying Data Warehouse” business. Now that we know the clustering key, we can query that Catalog Sales fact table by joining it to the Date dimension on cs_sold_date_sk and filtering on a certain date. Like this:

Nikola Ilic 1 年前

Mastering Zero-Copy Cloning for Effortless Data…

Adam Morton 1 个月前

TIQ Part 3 – Ultimate Guide to Date dimension creation

Nikola Ilic 4 年前

SELECT CS.* FROM CATALOG_SALES CS
JOIN DATE_DIM DD ON CS.CS_SOLD_DATE_SK = DD.D_DATE_SK
WHERE DD.D_DATE = '2002-11-21'
LIMIT 100;

Don't forget the LIMIT clause. When query very large table like this we must specify LIMIT in the query (or TOP). Otherwise the query will go on and on, costing a lot of credits. The output of the above query is like below:

It shows all the columns from the Catalog Sales table. The query finished in 3.2 seconds (red arrow above), and returned 100 rows (green arrow).

If you click the “3 dots” (yellow arrow above) and select View Query Profile, you will see something like this:

This is the query profile of the SQL we just ran. The red arrows show that the final output contains 100 rows, before that 256 rows, and before that it splits into 2 branches: the left branch has 1 row and the right branch has 256 rows. The green arrows show you the most expensive nodes, which is TableScan[7] consuming 92.2% of total execution time, and the most expensive operation, which is Remote Disk I/O consuming 87.7% of total execution time.

If you look at the yellow arrow on the left above, the whole thing is on the Query History section of the UI. There are 3 tabs on this page: Query Details, Query Profile and Query Telemetry. We are are on the middle tab which is Query Profile.

When querying a very large table containing billions of rows, we need to look at the Query Profile, to understand what the query is doing.

OK. Let’s go back to the Querying Data Warehouse business. We can now add the second dimension on the clustering key, which is the Item dimension.

SELECT I.I_CATEGORY, SUM(CS_NET_PAID) AS AMOUNT
FROM CATALOG_SALES CS
JOIN DATE_DIM DD ON CS.CS_SOLD_DATE_SK = DD.D_DATE_SK
JOIN ITEM AS I ON CS.CS_ITEM_SK = I.I_ITEM_SK
WHERE DD.D_DATE = '2002-11-21'
GROUP BY I.I_CATEGORY;

The output is like below:

It took just 4.4 seconds to query 144 billion rows, join it to the Item dimensions and sum up the Amount for each product category. Not bad hey?

OK. Here’s the final SQL for today. It gets the order amount for each product category, for each city in Michigan state:

SELECT I.I_CATEGORY, CA.CA_CITY, SUM(CS_NET_PAID) AS AMOUNT

FROM CATALOG_SALES CS

JOIN DATE_DIM DD ON CS.CS_SOLD_DATE_SK = DD.D_DATE_SK

JOIN ITEM AS I ON CS.CS_ITEM_SK = I.I_ITEM_SK
JOIN CUSTOMER AS C
   ON CS.CS_SHIP_CUSTOMER_SK = C.C_CUSTOMER_SK
JOIN CUSTOMER_ADDRESS AS CA
   ON CS.CS_SHIP_ADDR_SK = CA.CA_ADDRESS_SK
WHERE DD.D_DATE = ‘2002-11-21’
   AND CA.CA_STATE = ‘MI’
GROUP BY 1,2;

The output is like this:

It took 7.2 seconds, returning 8000 rows across all cities in Michigan. Now try to query different things yourself, and each time look at the query profile to understand what it is doing under the hood.

Our last query above has a query profile like this:

Explore the profile, understand where it spend most time and understand the the various operations.

And that my friends, is how you query a Snowflake data warehouse. Have fun!

As usual would appreciate your comments and opinions below.

Zyx

3 周

Earthlings still querying like it's the stone age? Maybe it's time to size those virtual warehouses properly and optimize those queries before cloud bills eclipse the sun.

Alex Ooi

Business Analytics, Data Engineering

3 周

If anyone ever can create such a practical warehouse design, we all would be jobless! Haha. But that's a fun read. Thanks for posting.

2 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Querying a Snowflake Data Warehouse

Vincent Rainardi

Data Architect & Data Engineer

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

A note on Data Warehouses

Data Warehousing with Star and Snowflake schemas

HOW TO HANDLE HISTORICAL DATA IN DATA WAREHOUSE IMPLEMENTATION

ISN'T THE DATA WAREHOUSE THE SAME THING AS THE DATA LAKEHOUSE?

Who owns the Data in a Data Warehouse?

Role Hierarchies in Snowflake: Effective Data Access Management

What is the significance of Query Folding in Power BI, and why should it matter to you?

5 practices and tips to enhance your query performance experience in Snowflake

Is Kimball Still Relevant in the Modern Data Warehouse?

TOP FIVE DIFFERENCES BETWEEN DATA LAKES AND DATA WAREHOUSES

领英推荐

Data Mesh

2024年11月6日

SQL

2024年11月2日

Unit test in dbt

2024年10月30日

Snowflake: convert arrays to rows

2024年10月30日

Build button in dbt

2024年10月30日

Data Warehousing is Dead

2024年10月27日

Managing a Data Warehouse Project

2024年10月20日

Objectives of Data Warehousing

2024年10月14日

Snowflake Coding (without stored proc)

2024年10月12日

Snowflake Tips

2024年10月12日

社区洞察

其他会员也浏览了

A note on Data Warehouses

Data Warehousing with Star and Snowflake schemas

HOW TO HANDLE HISTORICAL DATA IN DATA WAREHOUSE IMPLEMENTATION

ISN'T THE DATA WAREHOUSE THE SAME THING AS THE DATA LAKEHOUSE?

Who owns the Data in a Data Warehouse?

Role Hierarchies in Snowflake: Effective Data Access Management

What is the significance of Query Folding in Power BI, and why should it matter to you?

5 practices and tips to enhance your query performance experience in Snowflake

Is Kimball Still Relevant in the Modern Data Warehouse?

TOP FIVE DIFFERENCES BETWEEN DATA LAKES AND DATA WAREHOUSES