SQL Wizardry — From Hiatus to Mastery | The Snowflake Way
Shivam Dutt Sharma
Developing ?????? (A Conversational AI) | Volunteer for UNICEF Initiatives | Runner | Blogger | Technical Product Manager | Data Science | Python | SAFe 5.0 PO/PM
I bet this has happened with most of us. We learnt something back in school, college, or even later in some or the other educational setting, and then completely lost touch with it (there can be several factors why that happens though?—?that knowledge wasn’t applied or revisited frequently and hence it faded away, or, if the information was memorized without a deep understanding of its concepts or relevance, it is less likely to be retained, and more such reasons).
One such skill for me has been SQL. I picked it up really quick back in college (so much so, that my RDBMS Professor was quite impressed and suggested taking it up full-time post my graduation. Data Analyst jobs were a trending sensation back then).
Well, I must say my Professor was quite a prophecy-er ?? ?? While I didn’t take up SQL immediately after my graduation, however, it was very recent where I did a project for an Ex-employer of mine (seeking trends and to query some meaningful insights out of the raw/crude order (transactional) data), and we used SQL for the same. The platform we worked on, was Snowflake.
The way you will approach SQL can change significantly depending upon the database you are using or the platform, you are running your SQL on.
In our case, as I said, we used Snowflake. For those of you who may not know (just in case)?—?Snowflake is a cloud-based data platform designed for data warehousing, analytics, and sharing. It is built on a cloud infrastructure (e.g., AWS, Azure, or GCP) and allows businesses to store, manage, and analyze large amounts of data efficiently.
Right at this moment, a really important question to ask is?:-
What SQL does Snowflake use?
Snowflake actually uses a version of ANSI SQL (American National Standards Institute SQL). It supports most standard SQL commands and provides additional extensions tailored for cloud-based analytics and data warehousing.
Some unique features include:-
Alright, now?—?let’s get to business.
Today, I wanted to share some of my favorite functions / topics / concepts in ANSI SQL (Snowflake), that I was hands-on with, in the project that I mentioned above. However, before I get to that, I really want to mention that there was a long hiatus before I actually resumed SQL, when asked to lead this project, and sort of felt really content seeing how good things can be turned around, when done with focus and perseverance.
And, also before I explain each of those concepts and the right syntactical / logical approach towards using each, it is important that we create some dummy data first so as to be able to demonstrate how each of those concepts / functions are applied to any data.
Creating some dummy data [table = DUMMY_ORDERITEM_DATA]
Step 01?: Create a dummy table using CREATE TABLE statement to define the table’s schema. (I may want to create a very transactional data (like they have in e-commerce companies and keep the orders’ data at an order-item level so as to have some decent granularity, to be able to use some of the window functions in their quintessential capacity and strength, as and when I come to demonstrating them)).
Here?—?I have just created a table (a schema for now though) called DUMMY_ORDERITEM_DATA.? We are yet to insert dummy data, if you will. Let’s get to that.
Inserting some dummy data [table = DUMMY_ORDERITEM_DATA]
-- Creating a sequence table with numbers from 1 to 10 (assuming max 10 items per order)
WITH seq AS (
SELECT ROW_NUMBER() OVER (ORDER BY SEQ4()) AS num FROM TABLE(GENERATOR(ROWCOUNT => 10))
),
orders AS (
SELECT
'ORD_' || (10000 + SEQ4()) AS order_id,
UNIFORM(1, 6, RANDOM()) AS num_items,
'Customer_' || SEQ4() AS customer_name,
UNIFORM(18, 65, RANDOM()) AS customer_age,
'CUST_' || (10000 + SEQ4()) AS customer_id,
CURRENT_DATE - UNIFORM(0, 365, RANDOM()) AS order_date,
CASE UNIFORM(0, 2, RANDOM())
WHEN 0 THEN 'Male'
WHEN 1 THEN 'Female'
ELSE 'Non-Binary'
END AS customer_gender
FROM TABLE(GENERATOR(ROWCOUNT => 1000))
),
items AS (
SELECT
o.order_id,
'ORD_' || o.order_id || '_ITEM_' || ROW_NUMBER() OVER (PARTITION BY o.order_id ORDER BY RANDOM()) AS item_id,
UNIFORM(1, 10, RANDOM()) AS order_item_qty,
o.customer_name,
o.customer_age,
o.customer_id,
o.order_date,
o.customer_gender,
ARRAY_CONSTRUCT('Electronics', 'Clothing', 'Books', 'Home', 'Toys')[UNIFORM(0, 5, RANDOM())] AS order_product_type,
ARRAY_CONSTRUCT('Red', 'Blue', 'Green', 'Black', 'White')[UNIFORM(0, 5, RANDOM())] AS order_product_colour,
CASE UNIFORM(0, 3, RANDOM())
WHEN 0 THEN 'Male'
WHEN 1 THEN 'Female'
WHEN 2 THEN 'Unisex'
ELSE 'Kids'
END AS order_product_gender,
UNIFORM(100, 5000, RANDOM()) AS order_item_net_total_inr,
ARRAY_CONSTRUCT('Online', 'In-Store', 'Mobile')[UNIFORM(0, 3, RANDOM())] AS order_sales_channel
FROM orders o
CROSS JOIN seq
WHERE seq.num <= o.num_items -- Ensuring only `num_items` rows are created per order
)
And?—?let’s look at how the data looks like.? Let’s do a select * from?…. (the most famous jazz)?:p
Actually, in the above screenshot I ended up creating two CTEs for orders and items dummy data. However, I did not really insert the data inside the DUMMY_ORDERITEM_DATA table.
Let me just do that here, again!
INSERT INTO dummy_orderitem_data
WITH seq AS (
SELECT ROW_NUMBER() OVER (ORDER BY SEQ4()) AS num FROM TABLE(GENERATOR(ROWCOUNT => 10))
),
orders AS (
SELECT
'ORD_' || (10000 + SEQ4()) AS order_id,
UNIFORM(1, 6, RANDOM()) AS num_items, -- Random number of items per order
'Customer_' || UNIFORM(1, 300, RANDOM()) AS customer_name, -- Allow repeat customers
UNIFORM(18, 65, RANDOM()) AS customer_age,
'CUST_' || (10000 + UNIFORM(1, 300, RANDOM())) AS customer_id, -- Repeat customer IDs
CURRENT_DATE - UNIFORM(0, 365, RANDOM()) AS order_date,
CASE UNIFORM(0, 2, RANDOM())
WHEN 0 THEN 'Male'
WHEN 1 THEN 'Female'
ELSE 'Non-Binary'
END AS customer_gender
FROM TABLE(GENERATOR(ROWCOUNT => 2000)) -- Increased row count to allow duplicate customers
),
items AS (
SELECT
o.order_id,
'ORD_' || o.order_id || '_ITEM_' || ROW_NUMBER() OVER (PARTITION BY o.order_id ORDER BY RANDOM()) AS item_id,
UNIFORM(1, 10, RANDOM()) AS order_item_qty,
o.customer_name,
o.customer_age,
o.customer_id,
o.order_date,
o.customer_gender,
ARRAY_CONSTRUCT('Electronics', 'Clothing', 'Books', 'Home', 'Toys')[UNIFORM(0, 5, RANDOM())] AS order_product_type,
ARRAY_CONSTRUCT('Red', 'Blue', 'Green', 'Black', 'White')[UNIFORM(0, 5, RANDOM())] AS order_product_colour,
CASE UNIFORM(0, 3, RANDOM())
WHEN 0 THEN 'Male'
WHEN 1 THEN 'Female'
WHEN 2 THEN 'Unisex'
ELSE 'Kids'
END AS order_product_gender,
UNIFORM(100, 5000, RANDOM()) AS order_item_net_total_inr,
ARRAY_CONSTRUCT('Online', 'In-Store', 'Mobile')[UNIFORM(0, 3, RANDOM())] AS order_sales_channel
FROM orders o
CROSS JOIN seq
WHERE seq.num <= o.num_items -- Ensuring only `num_items` rows are created per order
)
SELECT * FROM items;
One thing that I want to highlight above is that when I did ‘Customer_’ || UNIFORM(1, 300, RANDOM()) AS customer_name, I ensured that for a customer there are multiple orders.
This is because, earlier I had kept it as ‘Customer_’ || SEQ4() AS customer_name, which was leading to a very peculiar problem. The problem being that when I tried to perform some order sequencing on the data via a ROW_NUMBER() function, it gave me a messed up data in the end. ? For instance, I got something in a way where I could see a customer showing the same order ID with different order sequences (numbers).
The following is what the select * from items will give!
领英推荐
Now let us get down to the concepts I was talking about?:-
The very first concept I’d like to talk about is a CTE or a Common Table Expression.
CTE?—?Common Table Expression
If you have been a regular SQL’er, I can imagine that you must be playing around with a lot of joins considering most of the organizations who use SQL databases, use a Relational one (RDBMS) with data being queried from across a myriad tables. Haven’t you often felt that in your pursuit of applying joins, you rather end up creating a mess up in your head and lose the train of where the joins started from and where they led to.
I think, be it any analysis?—?retaining the chronology of all the steps that go in your analysis, is the key. CTEs is the way, you actually can draw a chronology in your overall data analysis where you are using SQL to fetch insights out of your raw / crude data.
Also one should know that?—?CTEs (Common Table Expressions) are temporary. They only exist for the duration of the query in which they are used. Once the query execution is complete, the CTE is discarded and does not store data permanently.
Syntactically speaking,
You define a CTE through the following method?:-
WITH CTE_NAME AS (?…………)
And how can you create multiple CTEs in a single session are?:- CTE1 CTE2 CTE3 and then here?—?—?—?-at this level?—?—?—?-you go ahead marry the CTEs together, joining all of them. Now you know how would you use CTEs.
Some advantages of CTEs?
<?—?—?—?—?—?—?—?—?—?— Demonstration for CTEs?—?— -?—?—?—?—?—?—?—?→
So, today is Jan 31, 2025; and am back demonstrating CTEs as promised. How is it going?
As you will see in the code below, am trying to create an analytical use-case, wherein am trying to come up with all those customers who placed their first order in the first half of 2024 and came back for an incremental order in the second half of 2024.
Here, I have used the concept of CTEs as such?:-
-- Demonstrating CTEs for insights discovery
--- Taking a usecase wherein I want to see that of all the customers who placed their first order in the first half of 2024, how many also came back to place any incremental order within the next 6 months from the first order date.
WITH unique_orders AS (
SELECT DISTINCT order_id, customer_id, order_date
FROM dummy_orderitem_data
),
cte_orders_sequencing AS (
SELECT
order_id,
customer_id,
order_date,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS order_sequence
FROM unique_orders
),
-- SELECT * FROM cte_orders_sequencing LIMIT 10; --> This was to check the sample records that got created in cte_orders_sequencing CTE
first_orderers_first_half_2024 AS (
SELECT order_id, customer_id, order_date from cte_orders_sequencing where order_sequence = 1 AND ORDER_DATE BETWEEN '2024-01-01' AND '2024-06-30'
),
incremental_orderers_second_half_2024 AS (
SELECT order_id, customer_id, order_date from cte_orders_sequencing where order_sequence > 1 AND ORDER_DATE BETWEEN '2024-07-01' AND '2024-12-31' AND customer_id IN (select customer_id from first_orderers_first_half_2024)
)
SELECT distinct customer_id from incremental_orderers_second_half_2024;
WITH unique_orders AS (
SELECT DISTINCT order_id, customer_id, order_date
FROM dummy_orderitem_data
),
cte_orders_sequencing AS (
SELECT
order_id,
customer_id,
order_date,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS order_sequence
FROM unique_orders
),
-- SELECT * FROM cte_orders_sequencing LIMIT 10; --> This was to check the sample records that got created in cte_orders_sequencing CTE
first_orderers_first_half_2024 AS (
SELECT order_id, customer_id, order_date from cte_orders_sequencing where order_sequence = 1 AND ORDER_DATE BETWEEN '2024-01-01' AND '2024-06-30'
),
incremental_orderers_second_half_2024 AS (
SELECT order_id, customer_id, order_date from cte_orders_sequencing where order_sequence > 1 AND ORDER_DATE BETWEEN '2024-07-01' AND '2024-12-31' AND customer_id IN (select customer_id from first_orderers_first_half_2024)
),
incremental_orderers_second_half_2024_ver2 AS (
SELECT fofh2.order_id, fofh2.customer_id as customer_id_fofh2, fofh2.order_date, cos.order_id, cos.customer_id as customer_id_cos, cos.order_date, cos.order_sequence from first_orderers_first_half_2024 fofh2 left join cte_orders_sequencing cos ON fofh2.customer_id = cos.customer_id where cos.order_sequence > 1 AND cos.ORDER_DATE BETWEEN '2024-07-01' AND '2024-12-31'
)
SELECT COUNT(distinct customer_id_fofh2) from incremental_orderers_second_half_2024_ver2; #275
Guess what?—?the above query is giving me 275 too.? I did nothing but a left join between first_orderers_first_half_2024 and incremental_orderers_second_half_2024_ver2.
This was just a way for me to state that one particular thing can be achieved through various means.
—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?— —
I shall be back tomorrow (Feb 01, 2025) to take up COALESCE and EXISTS
—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?—?— -
The other topics I would like to cover about, are?: COALESCE, EXISTS, ROW NUMBER (WINDOWS FUNCTIONS, PARTITION), DENSE RANK, RANK, PIVOT.
Then I would also like to talk about CREATING A TEMPORARY TABLE? AND THEN THE MOST CONVENTIONAL JAZZ — JOINS.