SQL Wizardry — From Hiatus to Mastery | The Snowflake Way

SQL Wizardry — From Hiatus to Mastery | The Snowflake Way

I bet this has happened with most of us. We learnt something back in school, college, or even later in some or the other educational setting, and then completely lost touch with it (there can be several factors why that happens though?—?that knowledge wasn’t applied or revisited frequently and hence it faded away, or, if the information was memorized without a deep understanding of its concepts or relevance, it is less likely to be retained, and more such reasons).

One such skill for me has been SQL. I picked it up really quick back in college (so much so, that my RDBMS Professor was quite impressed and suggested taking it up full-time post my graduation. Data Analyst jobs were a trending sensation back then).

Well, I must say my Professor was quite a prophecy-er ?? ?? While I didn’t take up SQL immediately after my graduation, however, it was very recent where I did a project for an Ex-employer of mine (seeking trends and to query some meaningful insights out of the raw/crude order (transactional) data), and we used SQL for the same. The platform we worked on, was Snowflake.

The way you will approach SQL can change significantly depending upon the database you are using or the platform, you are running your SQL on.

In our case, as I said, we used Snowflake. For those of you who may not know (just in case)?—?Snowflake is a cloud-based data platform designed for data warehousing, analytics, and sharing. It is built on a cloud infrastructure (e.g., AWS, Azure, or GCP) and allows businesses to store, manage, and analyze large amounts of data efficiently.

Right at this moment, a really important question to ask is?:-

What SQL does Snowflake use?

Snowflake actually uses a version of ANSI SQL (American National Standards Institute SQL). It supports most standard SQL commands and provides additional extensions tailored for cloud-based analytics and data warehousing.

Some unique features include:-

  • Support for semi-structured data types like JSON, Parquet, and Avro.
  • SQL functions for handling variant and array data types.
  • Time Travel for querying historical data?—?This one’s quite a gig, boy. I will write in length around this later. Hang in there, if you will. I shall also tell why I find this as a really important feature of Snowflake. ??

Alright, now?—?let’s get to business.

Today, I wanted to share some of my favorite functions / topics / concepts in ANSI SQL (Snowflake), that I was hands-on with, in the project that I mentioned above. However, before I get to that, I really want to mention that there was a long hiatus before I actually resumed SQL, when asked to lead this project, and sort of felt really content seeing how good things can be turned around, when done with focus and perseverance.

And, also before I explain each of those concepts and the right syntactical / logical approach towards using each, it is important that we create some dummy data first so as to be able to demonstrate how each of those concepts / functions are applied to any data.

Creating some dummy data [table = DUMMY_ORDERITEM_DATA]

Step 01?: Create a dummy table using CREATE TABLE statement to define the table’s schema. (I may want to create a very transactional data (like they have in e-commerce companies and keep the orders’ data at an order-item level so as to have some decent granularity, to be able to use some of the window functions in their quintessential capacity and strength, as and when I come to demonstrating them)).

Here?—?I have just created a table (a schema for now though) called DUMMY_ORDERITEM_DATA.? We are yet to insert dummy data, if you will. Let’s get to that.

Inserting some dummy data [table = DUMMY_ORDERITEM_DATA]

-- Creating a sequence table with numbers from 1 to 10 (assuming max 10 items per order)
WITH seq AS (
    SELECT ROW_NUMBER() OVER (ORDER BY SEQ4()) AS num FROM TABLE(GENERATOR(ROWCOUNT => 10))
),
orders AS (
    SELECT
        'ORD_' || (10000 + SEQ4()) AS order_id,
        UNIFORM(1, 6, RANDOM()) AS num_items,
        'Customer_' || SEQ4() AS customer_name,
        UNIFORM(18, 65, RANDOM()) AS customer_age,
        'CUST_' || (10000 + SEQ4()) AS customer_id,
        CURRENT_DATE - UNIFORM(0, 365, RANDOM()) AS order_date,
        CASE UNIFORM(0, 2, RANDOM()) 
            WHEN 0 THEN 'Male'
            WHEN 1 THEN 'Female'
            ELSE 'Non-Binary'
        END AS customer_gender
    FROM TABLE(GENERATOR(ROWCOUNT => 1000))
),
items AS (
    SELECT
        o.order_id,
        'ORD_' || o.order_id || '_ITEM_' || ROW_NUMBER() OVER (PARTITION BY o.order_id ORDER BY RANDOM()) AS item_id,
        UNIFORM(1, 10, RANDOM()) AS order_item_qty,
        o.customer_name,
        o.customer_age,
        o.customer_id,
        o.order_date,
        o.customer_gender,
        ARRAY_CONSTRUCT('Electronics', 'Clothing', 'Books', 'Home', 'Toys')[UNIFORM(0, 5, RANDOM())] AS order_product_type,
        ARRAY_CONSTRUCT('Red', 'Blue', 'Green', 'Black', 'White')[UNIFORM(0, 5, RANDOM())] AS order_product_colour,
        CASE UNIFORM(0, 3, RANDOM())
            WHEN 0 THEN 'Male'
            WHEN 1 THEN 'Female'
            WHEN 2 THEN 'Unisex'
            ELSE 'Kids'
        END AS order_product_gender,
        UNIFORM(100, 5000, RANDOM()) AS order_item_net_total_inr,
        ARRAY_CONSTRUCT('Online', 'In-Store', 'Mobile')[UNIFORM(0, 3, RANDOM())] AS order_sales_channel
    FROM orders o
    CROSS JOIN seq
    WHERE seq.num <= o.num_items  -- Ensuring only `num_items` rows are created per order
)        

And?—?let’s look at how the data looks like.? Let’s do a select * from?…. (the most famous jazz)?:p

Actually, in the above screenshot I ended up creating two CTEs for orders and items dummy data. However, I did not really insert the data inside the DUMMY_ORDERITEM_DATA table.

Let me just do that here, again!

INSERT INTO dummy_orderitem_data
WITH seq AS (
    SELECT ROW_NUMBER() OVER (ORDER BY SEQ4()) AS num FROM TABLE(GENERATOR(ROWCOUNT => 10))
),
orders AS (
    SELECT
        'ORD_' || (10000 + SEQ4()) AS order_id,
        UNIFORM(1, 6, RANDOM()) AS num_items,  -- Random number of items per order
        'Customer_' || UNIFORM(1, 300, RANDOM()) AS customer_name, -- Allow repeat customers
        UNIFORM(18, 65, RANDOM()) AS customer_age,
        'CUST_' || (10000 + UNIFORM(1, 300, RANDOM())) AS customer_id, -- Repeat customer IDs
        CURRENT_DATE - UNIFORM(0, 365, RANDOM()) AS order_date,
        CASE UNIFORM(0, 2, RANDOM()) 
            WHEN 0 THEN 'Male'
            WHEN 1 THEN 'Female'
            ELSE 'Non-Binary'
        END AS customer_gender
    FROM TABLE(GENERATOR(ROWCOUNT => 2000)) -- Increased row count to allow duplicate customers
),
items AS (
    SELECT
        o.order_id,
        'ORD_' || o.order_id || '_ITEM_' || ROW_NUMBER() OVER (PARTITION BY o.order_id ORDER BY RANDOM()) AS item_id,
        UNIFORM(1, 10, RANDOM()) AS order_item_qty,
        o.customer_name,
        o.customer_age,
        o.customer_id,
        o.order_date,
        o.customer_gender,
        ARRAY_CONSTRUCT('Electronics', 'Clothing', 'Books', 'Home', 'Toys')[UNIFORM(0, 5, RANDOM())] AS order_product_type,
        ARRAY_CONSTRUCT('Red', 'Blue', 'Green', 'Black', 'White')[UNIFORM(0, 5, RANDOM())] AS order_product_colour,
        CASE UNIFORM(0, 3, RANDOM())
            WHEN 0 THEN 'Male'
            WHEN 1 THEN 'Female'
            WHEN 2 THEN 'Unisex'
            ELSE 'Kids'
        END AS order_product_gender,
        UNIFORM(100, 5000, RANDOM()) AS order_item_net_total_inr,
        ARRAY_CONSTRUCT('Online', 'In-Store', 'Mobile')[UNIFORM(0, 3, RANDOM())] AS order_sales_channel
    FROM orders o
    CROSS JOIN seq
    WHERE seq.num <= o.num_items  -- Ensuring only `num_items` rows are created per order
)
SELECT * FROM items;        

One thing that I want to highlight above is that when I did ‘Customer_’ || UNIFORM(1, 300, RANDOM()) AS customer_name, I ensured that for a customer there are multiple orders.

This is because, earlier I had kept it as ‘Customer_’ || SEQ4() AS customer_name, which was leading to a very peculiar problem. The problem being that when I tried to perform some order sequencing on the data via a ROW_NUMBER() function, it gave me a messed up data in the end. ? For instance, I got something in a way where I could see a customer showing the same order ID with different order sequences (numbers).

The following is what the select * from items will give!

Now let us get down to the concepts I was talking about?:-

The very first concept I’d like to talk about is a CTE or a Common Table Expression.

CTE?—?Common Table Expression

If you have been a regular SQL’er, I can imagine that you must be playing around with a lot of joins considering most of the organizations who use SQL databases, use a Relational one (RDBMS) with data being queried from across a myriad tables. Haven’t you often felt that in your pursuit of applying joins, you rather end up creating a mess up in your head and lose the train of where the joins started from and where they led to.

I think, be it any analysis?—?retaining the chronology of all the steps that go in your analysis, is the key. CTEs is the way, you actually can draw a chronology in your overall data analysis where you are using SQL to fetch insights out of your raw / crude data.

Also one should know that?—?CTEs (Common Table Expressions) are temporary. They only exist for the duration of the query in which they are used. Once the query execution is complete, the CTE is discarded and does not store data permanently.

Syntactically speaking,

You define a CTE through the following method?:-

WITH CTE_NAME AS (?…………)

And how can you create multiple CTEs in a single session are?:- CTE1 CTE2 CTE3 and then here?—?—?—?-at this level?—?—?—?-you go ahead marry the CTEs together, joining all of them. Now you know how would you use CTEs.

Some advantages of CTEs?

  • CTEs make complex queries more readable and manageable.
  • They help avoid duplicate subqueries.
  • Recursive CTEs are great for hierarchical data (ok, this one might be a little obscure to explain in one go, but i will try).
  • Some databases (like Snowflake, PostgreSQL, SQL Server) optimize CTEs for better performance.

<?—?—?—?—?—?—?—?—?—?— Demonstration for CTEs?—?— -?—?—?—?—?—?—?—?→

So, today is Jan 31, 2025; and am back demonstrating CTEs as promised. How is it going?

As you will see in the code below, am trying to create an analytical use-case, wherein am trying to come up with all those customers who placed their first order in the first half of 2024 and came back for an incremental order in the second half of 2024.

Here, I have used the concept of CTEs as such?:-

  • unique_orders is a CTE which helps me create a distinct set of Order Id, Customer Id, Order Date; thus eliminating the dimension of items.? Why I actually created this CTE is purely because the orders sequencing done in the subsequent CTE cte_orders_sequencing shall then not spoil the sequencing (in view of the issue I shared above where I mentioned that there was the same order id repeating with different order sequence number).
  • cte_orders_sequencing is a CTE which helps me create the order sequencing for a customer looking at all his/her orders.? Here, I have used a ROW_NUMBER() function in addition to an OVER function which encloses a statement where the entire unique_orders CTE is being partitioned by customer_id ORDERING it by order_date. This spits an order sequence.
  • Next, I create a first_orderers_first_half_2024 CTE which gives me an Order ID, Customer ID and Order Date for all the consumers who made their first order in the first half of 2024.
  • Next, I create an incremental_orderers_second_half_2024 CTE which gives me an Order ID, Customer ID and Order Date for all the consumers who made their first order in the first half of 2024 and later came back for an incremental order in the second half of 2024.
  • At the end, I do a SELECT distinct customer_id from incremental_orderers_second_half_2024; which gives me the list of customers who came for an incremental order in the later half of 2024, after having placed their first order in the first half of 2024.
  • We are looking at a roughly 275 such customers.

-- Demonstrating CTEs for insights discovery
--- Taking a usecase wherein I want to see that of all the customers who placed their first order in the first half of 2024, how many also came back to place any incremental order within the next 6 months from the first order date.

WITH unique_orders AS (
    SELECT DISTINCT order_id, customer_id, order_date
    FROM dummy_orderitem_data
),
cte_orders_sequencing AS (
    SELECT 
        order_id,
        customer_id,
        order_date,
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS order_sequence
    FROM unique_orders
),
-- SELECT * FROM cte_orders_sequencing LIMIT 10; --> This was to check the sample records that got created in cte_orders_sequencing CTE


first_orderers_first_half_2024 AS (
SELECT order_id, customer_id, order_date from cte_orders_sequencing where order_sequence = 1 AND ORDER_DATE BETWEEN '2024-01-01' AND '2024-06-30'
),

incremental_orderers_second_half_2024 AS (
SELECT order_id, customer_id, order_date from cte_orders_sequencing where order_sequence > 1 AND ORDER_DATE BETWEEN '2024-07-01' AND '2024-12-31' AND customer_id IN (select customer_id from first_orderers_first_half_2024)
)

SELECT distinct customer_id from incremental_orderers_second_half_2024;        

  • There is also one another way of zeroing-on to these 275 customers?—?which is when I try to left join the first_orderers_first_half_2024 with cte_orders_sequencing inside incremental_orderers_second_half_2024. This is just me attempting another way of getting to the same thing. Nothing very special about it tho?:)

WITH unique_orders AS (
    SELECT DISTINCT order_id, customer_id, order_date
    FROM dummy_orderitem_data
),
cte_orders_sequencing AS (
    SELECT 
        order_id,
        customer_id,
        order_date,
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS order_sequence
    FROM unique_orders
),
-- SELECT * FROM cte_orders_sequencing LIMIT 10; --> This was to check the sample records that got created in cte_orders_sequencing CTE


first_orderers_first_half_2024 AS (
SELECT order_id, customer_id, order_date from cte_orders_sequencing where order_sequence = 1 AND ORDER_DATE BETWEEN '2024-01-01' AND '2024-06-30'
),

incremental_orderers_second_half_2024 AS (
SELECT order_id, customer_id, order_date from cte_orders_sequencing where order_sequence > 1 AND ORDER_DATE BETWEEN '2024-07-01' AND '2024-12-31' AND customer_id IN (select customer_id from first_orderers_first_half_2024)
),

incremental_orderers_second_half_2024_ver2 AS (
SELECT fofh2.order_id, fofh2.customer_id as customer_id_fofh2, fofh2.order_date, cos.order_id, cos.customer_id as customer_id_cos, cos.order_date, cos.order_sequence from first_orderers_first_half_2024 fofh2 left join cte_orders_sequencing cos ON fofh2.customer_id = cos.customer_id where cos.order_sequence > 1 AND cos.ORDER_DATE BETWEEN '2024-07-01' AND '2024-12-31'
)

SELECT COUNT(distinct customer_id_fofh2) from incremental_orderers_second_half_2024_ver2; #275        

Guess what?—?the above query is giving me 275 too.? I did nothing but a left join between first_orderers_first_half_2024 and incremental_orderers_second_half_2024_ver2.

This was just a way for me to state that one particular thing can be achieved through various means.

—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?— —

I shall be back tomorrow (Feb 01, 2025) to take up COALESCE and EXISTS

—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?— -

The other topics I would like to cover about, are?: COALESCE, EXISTS, ROW NUMBER (WINDOWS FUNCTIONS, PARTITION), DENSE RANK, RANK, PIVOT.

Then I would also like to talk about CREATING A TEMPORARY TABLE? AND THEN THE MOST CONVENTIONAL JAZZ — JOINS.

要查看或添加评论,请登录

Shivam Dutt Sharma的更多文章

社区洞察

其他会员也浏览了