SQL for Data Engineering: Window Functions and Common Table Expressions (CTEs)

SQL for Data Engineering: Window Functions and Common Table Expressions (CTEs)


This article is inspired by a true story involving one of my close friends. He’s a backend developer, not a database expert, but during a recent interview, he was grilled heavily on SQL. After hearing about his experience, I realized something that might resonate with many of you: the days when SQL knowledge was limited to basic GROUP BY and JOIN operations are long gone. Today, the depth of SQL skills required—especially in data engineering roles—demands much more. If you're preparing for interviews, you’ll need to master more advanced concepts, like window functions and Common Table Expressions (CTEs), to truly stand out.

“In theory, there is no difference between theory and practice. But in practice, there is.”?—?Yogi Berra

Why Focus on Window Functions in?SQL?

As my friend’s story reminded me, SQL interviews aren’t just about basic querying anymore. Window functions, in particular, have become a critical part of data engineering interviews. Whether it’s ranking transactions, calculating rolling metrics, or handling complex partitioning, window functions allow you to perform operations that basic SQL can’t easily handle.

Let’s start by breaking down window functions and why they’re essential in real-world scenarios, especially when working with large-scale data.

What is a Window Function?

A window function is a SQL tool that allows you to perform calculations across a set of rows that are somehow related to the current row. Think of it as looking at a “window” of surrounding rows to compute things like cumulative sums, ranks, or moving averages.

The most common window functions are:

  • LAG: Fetches the value from the previous row.
  • LEAD: Fetches the value from the next row.
  • RANK: Assigns ranks to rows, skipping numbers when there are ties.
  • DENSE_RANK: Similar to RANK but doesn’t skip numbers.
  • ROW_NUMBER: Assigns unique sequential numbers to rows, without ties.

These functions come in handy when dealing with tasks like analyzing customer transactions, calculating running totals, or ranking entries in financial datasets. Now, let’s move into a practical banking example that you might encounter in an interview.

Example: Identifying Top Three Transactions by Amount for Each?Customer

Imagine you’re asked in an interview: “Find the top three largest transactions for each customer in the past year.” Right away, you should recognize that a simple GROUP BY or JOIN won’t work here—you’ll need a window function to rank transactions by amount for each customer.

Here’s how to approach it using the ROW_NUMBER function:

WITH customer_transactions AS (
    SELECT customer_id, transaction_id, transaction_date, amount,
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS transaction_rank
    FROM transactions
    WHERE transaction_date >= DATEADD(year, -1, GETDATE())
)
SELECT customer_id, transaction_id, transaction_date, amount
FROM customer_transactions
WHERE transaction_rank <= 3;        

In this query:

  • The PARTITION BY clause divides the data into groups by customer.
  • The ORDER BY clause ranks the transactions based on the amount, from highest to lowest.
  • The ROW_NUMBER() function assigns a unique rank to each transaction for each customer, allowing you to filter out the top three for each.

This example goes beyond basic SQL skills, showcasing how window functions enable you to solve more complex real-world problems?—?something you’ll encounter frequently in interviews and on the job.

Keywords That Hint at Using Window Functions

In a SQL interview, look out for keywords that signal the need for window functions:

  • Rolling (e.g., rolling sum or average of balances)
  • Rank (e.g., top transactions, highest loan amounts)
  • Consecutive (e.g., consecutive late payments)
  • De-duplicate (e.g., identifying unique customer transactions)

For example, a question like “Give me the top five deposits per account over the past six months” is a clear indication that a window function?—?such as RANK or ROW_NUMBER—is required.

Understanding the Components of a Window?Function

Each window function consists of three main components:

  1. Function: This could be something like SUM(), RANK(), or LAG().
  2. OVER() Clause: Defines the window, i.e., the rows across which the function is applied. Without this, it’s just a regular aggregate function. This is where PARTITION BY and ORDER BY come into play.
  3. Optional ROWS Clause: Rarely used but powerful when you need to calculate things like rolling averages or sums over a defined number of rows.

Let’s look at a practical example for a bank that wants to calculate the rolling 30-day balance for each customer’s account:

SELECT customer_id, transaction_date, 
       SUM(amount) OVER (PARTITION BY customer_id ORDER BY transaction_date 
                         ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_balance
FROM transactions;        

  • ROWS BETWEEN 29 PRECEDING AND CURRENT ROW defines a 30-day window for the balance calculation.
  • The result is a rolling sum of account balances over the last 30 days for each customer, a common requirement in banking data analysis.

Common Table Expressions (CTEs): Your Best Friend for Complex?Queries

CTEs are another key tool in advanced SQL, especially for interviews. A CTE allows you to define a temporary result set that can be referenced within the main query, making your code more readable and maintainable.

WITH cte_name AS (
    SELECT column1, column2
    FROM table
    WHERE condition
)
SELECT *
FROM cte_name
WHERE another_condition;        

Let’s extend our banking example. Suppose you’re asked to identify customers who have made consecutive late payments. Instead of cluttering your query with subqueries, you can simplify it using a CTE:

WITH customer_late_payments AS (
    SELECT customer_id, payment_date, 
           LAG(payment_status) OVER (PARTITION BY customer_id ORDER BY payment_date) AS previous_payment_status
    FROM payments
    WHERE payment_status = 'Late'
)
SELECT customer_id, COUNT(*) AS consecutive_late_payments
FROM customer_late_payments
WHERE previous_payment_status = 'Late'
GROUP BY customer_id;        

In this case, the LAG() function helps identify whether the previous payment was also marked as "Late." This query identifies customers with consecutive late payments, a typical use case in risk management for banks.

When to Use CTEs vs. Subqueries vs. Temporary Tables

A common question that arises is when to use CTEs over subqueries or temporary tables. Here’s a quick rule of thumb:

  • CTEs: Ideal for improving readability and maintainability, especially in big data environments like Spark, Snowflake, or Trino.
  • Subqueries: Useful when you need a single scalar value, such as the total sum of loan amounts or average transaction size.
  • Temporary Tables: Best when you need to reuse intermediate results multiple times across queries, often improving performance in complex pipelines.

For example, if you’re working with millions of financial transactions and need to run multiple calculations, a temporary table could save significant processing time by storing intermediate results that are reused in other queries.


Mastering window functions and CTEs is your secret weapon in SQL interviews. These tools allow you to handle complex tasks like ranking transactions, calculating rolling balances, and identifying consecutive events?—?skills that will set you apart from other candidates.

By mastering these advanced SQL techniques and knowing when to apply them, you will not only excel in interviews but also be prepared for the challenges you will face in real-world data analysis.

#BeIndispensable #SQL

要查看或添加评论,请登录

社区洞察

其他会员也浏览了