Learning Window Functions in SQL
If you want to get better at SQL and be able to do more complex data analysis, learning window functions is a must. If you've never heard of a window function, here's a breakdown of what they do
There are several types of window functions. Some of them are more commonly used, like sum(), count(), and avg(). However, there are some advanced functions that let you do some pretty cool things like lead(), lag(), and others.
Here's a quick example that shows the difference between an aggregate sum() function, and a window sum() function. This data set is from google BigQuery, and includes taxi trips in the city of Chicago. In the data set, I'm pulling in...
The image below just shows the first 3 records for 1 specific taxi_id, but the arrow below the table indicates that there are many more records in the table.
Here's the SQL for this data...
For the purposes of this exercise, you can ignore the left() function outside of taxi_id. That function is returning the first 15 characters of the full taxi_id value. Since the full taxi_id value is so long, I want the query results to be more readable. Also, we are ordering these results by the trip_start_timestamp in ascending order.
If I was to do an aggregate sum() function on this data, this is what would be returned from the query. For this specific taxi_id, the total distance of all of their trips is 112,475 miles.
Here's the SQL for the aggregate sum() function...
I'm limiting my select statement to the taxi_id and sum(trip_miles), and I'm using a group by statement to aggregate the data.
If I was to do a window sum() function on this data, this is what would be returned from the query.
The query returns all of the records like we got in our original query, but with an additional column that shows to total_miles value. This value shows for every record.
Here's the SQL for the aggregate sum() function.
I added all the original fields back into my select statement. Rather than use a group by statement to aggregate the data, I'm using over (partition by ). This is essentially the equivalent of a group_by statement for a windows function. Some window functions allow you to use just over (partition by ), while others require you to use over (partition by order by). We'll see an example of that later on.
Now that we've seen an example of an aggregate sum() function vs. a window sum() function, the next step is to figure out a use case where we may use window functions in real life.
Personally, I work with a lot of product data. I've used window functions in the past to figure out when a user did something important in a product/app, to determine segmentation for communications, and more.
In this case, we'll pretend we own a taxi company and we need to segment our data for a marketing list.
Let's assume we have a predictive model that shows that taxi drivers that have a taxi trip of at least 10 miles (trip_miles >= 10) have much higher success and much higher retention with our company than taxi drivers who don't have trip that's at least 10 miles.
Once a taxi driver has a trip that is at least 10 miles, we want to figure the distance of their next trip that is at least 10 miles. We'll use the distance from that next trip for our marketing list segmentation. We have 3 buckets for our segmentation...
We'll use 2 window functions in this use case...
The first step is to find a taxi driver's first trip that's at least 10 miles. The data below is looking at a different taxi driver than our example above.
If you look at the row number all the way to the left, you see that it took this taxi driver 119 trips to finally get a 10 miles trip. Their first trip of at least 10 miles had a distance of 17.1 miles. Their next trip after that was 15.2 miles.
Here's the SQL...
领英推荐
This is the same query we used before, except this is a different taxi_id.
The second step is to limit the data set for this taxi driver, to only show their trips that are at least 10 miles.
If you look at the row number on the left now, you see the data starts with the first 10 mile trip (17.1 miles to be exact). It's followed by their 15.2 mile trip, a well as more records after that.
Here's the SQL...
In this query, we've added a condition to the WHERE clause. Not only are we looking at one taxi driver, but now we're also looking only at trips with a trip_miles value >= 10.
The third step is to add the rank() and lead() values to our data.
The rank() value will rank the records based on the time_start_timestamp. The trip_start_timestamp values are in ascending order, and so are the rank() values.
The lead() values are pulling the trip_miles value from the next record. Since we want the next 10 mile trip after the first 10 mile trip, lead() will help us get the correct data. Look at the red arrow to get a better idea of what lead() does.
Here's the SQL for this step...
In this case, both rank() and lead() use over (partition by order by). We want to pull the rank() and lead() values for each taxi_id. When we add more taxi drivers to our data (in the final step), we want to reset our rank() and lead() values. Using over (partition by order by) lets us do that.
In the forth step, we'll get our final data for this specific taxi driver and do the list segmentation.
We have 1 row for this taxi driver, with their first 10 mile trip (trip_miles = 17.1) as well as their next 10 mile trip (trip_miles_next_record = 15.2). The last column, marketing_list_segment, is based on the trip_miles_next_record value.
Here's the SQL...
For step 4, we now put the SQL SELECT statement we had before into a sub query, and doing a new SELECT statement in an outer query. In the outer query, we're taking everything from our original query (SELECT *), and adding a case statement that will do the marketing list segmentation. Finally, we have a WHERE clause on the outer query that will take the record we need from the subquery (rank_atleast_10_mile = 1). Note that this is the same record as the first row in the step 3 results.
NOTE: You may see this query and ask "Rather using lead() and using the rank_atleast_10_mile = 1 to get the trip_mile_next_record value for the marketing segmentation, couldn't I just get the record where rank_atleast_10_mile = 2, and do the marketing segmentation off the trip_miles value? Technically, yes you could. However, sometimes rank data can get weird if you're ordering by a field with the same exact value (ie. 2+ records in a row with the same exact timestamp). Also, the purpose of this is to show more functionality of window functions.
For the final step, we'll get the segmentation data we need for all taxi drivers.
You can see that we now have a lot more records, with different taxi_id values, as well a different marketing_list segment values.
Here's the SQL for the final step...
The only change here from step 4 is that the WHERE condition that limits the data to 1 specific taxi_id is removed.
Possible Next Steps
At this point, we would take this data, append email addresses to each taxi_id, and send the data off to our marketing team. That could be done via a no-code tools like zapier, or possibly using a direct feed via an API. I may write an article on how to do that, but for now, I'll focus this article strictly on using window functions.
Recap
In this article, you learned what window functions are, how they different from aggregate functions, and how they can be used in a real life scenario. Hopefully you found this helpful, and can take some of this knowledge forward in your work. If you have any questions on this, feel free to message me on LinkedIn or send me an email at [email protected].
I'll be sharing more articles like this on LinkedIn, as well as through a newsletter. The newsletter will have additional content like video walkthroughs, my thoughts on navigating an analytics career, and more. If you'd like to subscribe to that, please fill out the form at the link below. I send high quality content once a week/once every 2 weeks, without spamming your inbox. Thanks!