SQL for Data Engineering: Window Functions and Common Table Expressions (CTEs)
Shanoj Kumar V
VP - Technology Architect & Data Engineering | AWS | AI & ML | Big Data & Analytics | Digital Transformation Leader | Author
This article is inspired by a true story involving one of my close friends. He’s a backend developer, not a database expert, but during a recent interview, he was grilled heavily on SQL. After hearing about his experience, I realized something that might resonate with many of you: the days when SQL knowledge was limited to basic GROUP BY and JOIN operations are long gone. Today, the depth of SQL skills required—especially in data engineering roles—demands much more. If you're preparing for interviews, you’ll need to master more advanced concepts, like window functions and Common Table Expressions (CTEs), to truly stand out.
“In theory, there is no difference between theory and practice. But in practice, there is.”?—?Yogi Berra
Why Focus on Window Functions in?SQL?
As my friend’s story reminded me, SQL interviews aren’t just about basic querying anymore. Window functions, in particular, have become a critical part of data engineering interviews. Whether it’s ranking transactions, calculating rolling metrics, or handling complex partitioning, window functions allow you to perform operations that basic SQL can’t easily handle.
Let’s start by breaking down window functions and why they’re essential in real-world scenarios, especially when working with large-scale data.
What is a Window Function?
A window function is a SQL tool that allows you to perform calculations across a set of rows that are somehow related to the current row. Think of it as looking at a “window” of surrounding rows to compute things like cumulative sums, ranks, or moving averages.
The most common window functions are:
These functions come in handy when dealing with tasks like analyzing customer transactions, calculating running totals, or ranking entries in financial datasets. Now, let’s move into a practical banking example that you might encounter in an interview.
Example: Identifying Top Three Transactions by Amount for Each?Customer
Imagine you’re asked in an interview: “Find the top three largest transactions for each customer in the past year.” Right away, you should recognize that a simple GROUP BY or JOIN won’t work here—you’ll need a window function to rank transactions by amount for each customer.
Here’s how to approach it using the ROW_NUMBER function:
WITH customer_transactions AS (
SELECT customer_id, transaction_id, transaction_date, amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS transaction_rank
FROM transactions
WHERE transaction_date >= DATEADD(year, -1, GETDATE())
)
SELECT customer_id, transaction_id, transaction_date, amount
FROM customer_transactions
WHERE transaction_rank <= 3;
In this query:
This example goes beyond basic SQL skills, showcasing how window functions enable you to solve more complex real-world problems?—?something you’ll encounter frequently in interviews and on the job.
Keywords That Hint at Using Window Functions
In a SQL interview, look out for keywords that signal the need for window functions:
领英推荐
For example, a question like “Give me the top five deposits per account over the past six months” is a clear indication that a window function?—?such as RANK or ROW_NUMBER—is required.
Understanding the Components of a Window?Function
Each window function consists of three main components:
Let’s look at a practical example for a bank that wants to calculate the rolling 30-day balance for each customer’s account:
SELECT customer_id, transaction_date,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY transaction_date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_balance
FROM transactions;
Common Table Expressions (CTEs): Your Best Friend for Complex?Queries
CTEs are another key tool in advanced SQL, especially for interviews. A CTE allows you to define a temporary result set that can be referenced within the main query, making your code more readable and maintainable.
WITH cte_name AS (
SELECT column1, column2
FROM table
WHERE condition
)
SELECT *
FROM cte_name
WHERE another_condition;
Let’s extend our banking example. Suppose you’re asked to identify customers who have made consecutive late payments. Instead of cluttering your query with subqueries, you can simplify it using a CTE:
WITH customer_late_payments AS (
SELECT customer_id, payment_date,
LAG(payment_status) OVER (PARTITION BY customer_id ORDER BY payment_date) AS previous_payment_status
FROM payments
WHERE payment_status = 'Late'
)
SELECT customer_id, COUNT(*) AS consecutive_late_payments
FROM customer_late_payments
WHERE previous_payment_status = 'Late'
GROUP BY customer_id;
In this case, the LAG() function helps identify whether the previous payment was also marked as "Late." This query identifies customers with consecutive late payments, a typical use case in risk management for banks.
When to Use CTEs vs. Subqueries vs. Temporary Tables
A common question that arises is when to use CTEs over subqueries or temporary tables. Here’s a quick rule of thumb:
For example, if you’re working with millions of financial transactions and need to run multiple calculations, a temporary table could save significant processing time by storing intermediate results that are reused in other queries.
Mastering window functions and CTEs is your secret weapon in SQL interviews. These tools allow you to handle complex tasks like ranking transactions, calculating rolling balances, and identifying consecutive events?—?skills that will set you apart from other candidates.
By mastering these advanced SQL techniques and knowing when to apply them, you will not only excel in interviews but also be prepared for the challenges you will face in real-world data analysis.
Thanks for sharing??