Learning Window Functions in SQL
What are they? How are they different from aggregate functions? When to use them?

Learning Window Functions in SQL

If you want to get better at SQL and be able to do more complex data analysis, learning window functions is a must. If you've never heard of a window function, here's a breakdown of what they do

No alt text provided for this image

There are several types of window functions. Some of them are more commonly used, like sum(), count(), and avg(). However, there are some advanced functions that let you do some pretty cool things like lead(), lag(), and others.

No alt text provided for this image

Here's a quick example that shows the difference between an aggregate sum() function, and a window sum() function. This data set is from google BigQuery, and includes taxi trips in the city of Chicago. In the data set, I'm pulling in...

  • taxi_id - unique ID for a taxi driver
  • trip_start_timestamp - timestamp value for when a taxi trip started
  • trip_miles - the distance of the taxi trip
  • trip_total - the price of the taxi trip
  • payment_type - whether the trip was paid for by cash or by credit card

The image below just shows the first 3 records for 1 specific taxi_id, but the arrow below the table indicates that there are many more records in the table.

No alt text provided for this image

Here's the SQL for this data...

No alt text provided for this image

For the purposes of this exercise, you can ignore the left() function outside of taxi_id. That function is returning the first 15 characters of the full taxi_id value. Since the full taxi_id value is so long, I want the query results to be more readable. Also, we are ordering these results by the trip_start_timestamp in ascending order.

If I was to do an aggregate sum() function on this data, this is what would be returned from the query. For this specific taxi_id, the total distance of all of their trips is 112,475 miles.

No alt text provided for this image

Here's the SQL for the aggregate sum() function...

No alt text provided for this image

I'm limiting my select statement to the taxi_id and sum(trip_miles), and I'm using a group by statement to aggregate the data.

If I was to do a window sum() function on this data, this is what would be returned from the query.

No alt text provided for this image

The query returns all of the records like we got in our original query, but with an additional column that shows to total_miles value. This value shows for every record.

Here's the SQL for the aggregate sum() function.

No alt text provided for this image

I added all the original fields back into my select statement. Rather than use a group by statement to aggregate the data, I'm using over (partition by ). This is essentially the equivalent of a group_by statement for a windows function. Some window functions allow you to use just over (partition by ), while others require you to use over (partition by order by). We'll see an example of that later on.

Now that we've seen an example of an aggregate sum() function vs. a window sum() function, the next step is to figure out a use case where we may use window functions in real life.

Personally, I work with a lot of product data. I've used window functions in the past to figure out when a user did something important in a product/app, to determine segmentation for communications, and more.

In this case, we'll pretend we own a taxi company and we need to segment our data for a marketing list.

No alt text provided for this image

Let's assume we have a predictive model that shows that taxi drivers that have a taxi trip of at least 10 miles (trip_miles >= 10) have much higher success and much higher retention with our company than taxi drivers who don't have trip that's at least 10 miles.

Once a taxi driver has a trip that is at least 10 miles, we want to figure the distance of their next trip that is at least 10 miles. We'll use the distance from that next trip for our marketing list segmentation. We have 3 buckets for our segmentation...

  • Less than 50 miles: Since we're still looking at trips of at least 10 miles, this really should be 10 - 50 miles
  • 51 - 100 miles: Trips between 51 and 100 miles
  • Over 100 miles: Trips over 100 miles

We'll use 2 window functions in this use case...

  • Rank() - will rank records in ascending order of trip_start_timestamp
  • Lead() - Will pull data from the next record in order based on trip_start_timestamp

The first step is to find a taxi driver's first trip that's at least 10 miles. The data below is looking at a different taxi driver than our example above.

No alt text provided for this image

If you look at the row number all the way to the left, you see that it took this taxi driver 119 trips to finally get a 10 miles trip. Their first trip of at least 10 miles had a distance of 17.1 miles. Their next trip after that was 15.2 miles.

Here's the SQL...

No alt text provided for this image

This is the same query we used before, except this is a different taxi_id.

The second step is to limit the data set for this taxi driver, to only show their trips that are at least 10 miles.

No alt text provided for this image

If you look at the row number on the left now, you see the data starts with the first 10 mile trip (17.1 miles to be exact). It's followed by their 15.2 mile trip, a well as more records after that.

Here's the SQL...

No alt text provided for this image

In this query, we've added a condition to the WHERE clause. Not only are we looking at one taxi driver, but now we're also looking only at trips with a trip_miles value >= 10.

The third step is to add the rank() and lead() values to our data.

No alt text provided for this image

The rank() value will rank the records based on the time_start_timestamp. The trip_start_timestamp values are in ascending order, and so are the rank() values.

The lead() values are pulling the trip_miles value from the next record. Since we want the next 10 mile trip after the first 10 mile trip, lead() will help us get the correct data. Look at the red arrow to get a better idea of what lead() does.

Here's the SQL for this step...

No alt text provided for this image

In this case, both rank() and lead() use over (partition by order by). We want to pull the rank() and lead() values for each taxi_id. When we add more taxi drivers to our data (in the final step), we want to reset our rank() and lead() values. Using over (partition by order by) lets us do that.

In the forth step, we'll get our final data for this specific taxi driver and do the list segmentation.

No alt text provided for this image

We have 1 row for this taxi driver, with their first 10 mile trip (trip_miles = 17.1) as well as their next 10 mile trip (trip_miles_next_record = 15.2). The last column, marketing_list_segment, is based on the trip_miles_next_record value.

Here's the SQL...

No alt text provided for this image

For step 4, we now put the SQL SELECT statement we had before into a sub query, and doing a new SELECT statement in an outer query. In the outer query, we're taking everything from our original query (SELECT *), and adding a case statement that will do the marketing list segmentation. Finally, we have a WHERE clause on the outer query that will take the record we need from the subquery (rank_atleast_10_mile = 1). Note that this is the same record as the first row in the step 3 results.

NOTE: You may see this query and ask "Rather using lead() and using the rank_atleast_10_mile = 1 to get the trip_mile_next_record value for the marketing segmentation, couldn't I just get the record where rank_atleast_10_mile = 2, and do the marketing segmentation off the trip_miles value? Technically, yes you could. However, sometimes rank data can get weird if you're ordering by a field with the same exact value (ie. 2+ records in a row with the same exact timestamp). Also, the purpose of this is to show more functionality of window functions.

For the final step, we'll get the segmentation data we need for all taxi drivers.

No alt text provided for this image

You can see that we now have a lot more records, with different taxi_id values, as well a different marketing_list segment values.

Here's the SQL for the final step...

No alt text provided for this image

The only change here from step 4 is that the WHERE condition that limits the data to 1 specific taxi_id is removed.

Possible Next Steps

At this point, we would take this data, append email addresses to each taxi_id, and send the data off to our marketing team. That could be done via a no-code tools like zapier, or possibly using a direct feed via an API. I may write an article on how to do that, but for now, I'll focus this article strictly on using window functions.

Recap

In this article, you learned what window functions are, how they different from aggregate functions, and how they can be used in a real life scenario. Hopefully you found this helpful, and can take some of this knowledge forward in your work. If you have any questions on this, feel free to message me on LinkedIn or send me an email at [email protected].

I'll be sharing more articles like this on LinkedIn, as well as through a newsletter. The newsletter will have additional content like video walkthroughs, my thoughts on navigating an analytics career, and more. If you'd like to subscribe to that, please fill out the form at the link below. I send high quality content once a week/once every 2 weeks, without spamming your inbox. Thanks!

Subscribe to the Madapup newsletter

要查看或添加评论,请登录

Matt Pupa的更多文章

  • Maxima and Design Sprints

    Maxima and Design Sprints

    One shortcoming of design sprints is that the resulting prototypes can sometimes be underwhelming. A team may start a…

社区洞察

其他会员也浏览了