Efficient count distincts in dashboards
Have you ever wanted to make a dashboard with a count distinct metric? Maybe you want to display how many customers bought in a category. If you have a lot of data you probably started by creating a table with an aggregation say category, count(distinct customer_id). But then maybe users also wanted the total across all categories so you added that as an extra row. If you had multiple dimensions you might have used a cube aggregation to get different totals available. But what if you want users to be able to define custom groups by combining different categories and seeing how many distinct customers bought in those? Now what do you do? You can't add the different categories together as customers can have bought from multiple categories and you'd be double counting. Then you probably went back to reading from the complete table, but running count distincts on a lot of data gets slow. Then you might have looked into approx count distinct.
Both Databricks and Snowflake offer functions for this (As always Databricks gives you extra control with the relativeSD parameter while Snowflake keeps things simple. If you understand what you're doing it's nice).
HyperLogLog
If you read the descriptions for either of these functions you'll see they mention hyperloglog. Now what the hell is hyperloglog? It's some fancy maths that allows aggregating count distincts as long as you can live with a small error.
If you really want to read more about it here's some links, but we'll just assume it works.
The approx count distincts can give you a nice speed boost, but you now have to keep all the data. Wouldn't it be nice if you could aggregate your table like before?
The basic idea when working with hyperloglog is that you need a function to create a sketch, a function to combine sketches, and a final function to transform a sketch into a count distinct estimate. Approx count distinct has all that internally, but it wasn't always exposed. When I first ran into this Databricks article for a solution it seemed so great, but the functions we'll talk about today weren't yet available. You could use the libraries in Spark to generate a table, but then how could I use the functions from Tableau? I was determined to get it to work so I created some jars wrapping the Twitter Algebird implementation into functions I could register in hive. This worked if I was going to run my dashboard against a general purpose cluster in Databricks, but it was a lot of work and there was no support for registering jars in the sql warehouse. Now you can do it easily so enjoy!
Create a sketch (do this when creating your aggregated table)
Databaricks: use hll_sketch_agg like you would any other aggregation except the output will be a sketch. Here you can specify the lgConfigK (a bigger number will take longer to compute and use more storage, but the error will go down).
Snowflake: use hll_accumulate the same way but Snowflake has no parameter you can set to improve accuracy (it's complex and they like to keep things simple...).
Aggregate the sketches and get an estimate (do this at report time)
Databricks: Use hll_union_agg and then hll_sketch_estimate.
Snowflake: hll_combine and then hll_estimate.
An example implementation
To do a proper example, we need a big table because hll is all about performance. For my setup using a tiny cluster I need about 500 million rows to really show a good difference. We're using Databricks for this example, but you can change the functions using the reference above to convert them to Snowflake or with a little bit more work to any other hll implementation.
Step 1: Create the sketches for our aggregate table.
领英推荐
create or replace table alexis.test.one_table_hll_big as
select
transaction_date,
product_name,
product_description,
user_description,
sum(total_value) as total_value,
hll_sketch_agg(id, 12) as transactions_sketch
from alexis.test.one_table_big
group by all
With the demo data one_table_big contains 500M rows, but one_table_hll_big only contains 900 (that's where we get the massive performance gain from).
Step 2: Let's compare the performance on the two tables
For our example we will filter on two products because if you remember the start of the article that's a case traditional aggregation techniques can't handle.
Let's start with the regular count distinct on the table containing all of the data.
select product_name, count(distinct id) from alexis.test.one_table_big
where product_name in ('Product 3', 'Product 1')
group by all;
The performance isn't that bad considering the little compute we're using and it finishes in about 14 seconds. A little bit slow for dashboarding! We could speed things up by using more compute, but such thinking eventually leads to a big bill!
Let's try to do a bit better with approx count distinct.
select product_name, approx_count_distinct(id) from alexis.test.one_table_big
where product_name in ('Product 3', 'Product 1')
group by all;
It was a little faster but it still took 12 seconds because there's so much data to read.
Now let's take advantage of our 900 row table to do some magic.
select
product_name,
hll_sketch_estimate(hll_union_agg(transactions_sketch)) as distinct_transactions
from alexis.test.one_table_hll_big
where product_name in ('Product 3', 'Product 1')
group by all
This only took 1 second. It's so fast because after filtering we only have to process 600 rows and hyperloglog might be some fancy maths, but someone did the work for us and running that over 600 rows is a lot faster than retrieving 333 million rows and doing a count distinct. That's something we could improve with a little more compute and lower-latency storage like Premium Block Blob Storage Accounts in Azure or Amazon S3 Express One Zone Storage Class in AWS for which we can easily afford the storage because our aggregated table will be so much smaller. Now our users can have a report with complete freedom to filter dimensions and get count distincts like they would any other regular metric such as a sum or an average! Happy dashboarding!
Bonus tip
If you want to set up a simple dashboard in a Databricks notebook to try this, a widget can be nice.
Databricks has great documentation on how to create and work with widgets.
In our case we'll want to create a multiselect widgets of strings. You would think using it like this would work:
select
product_name
from alexis.test.one_table_hll_big
where product_name in (${products})
Unfortunately it's not that simple, but I figured it out so you don't have to:
select
product_name
from alexis.test.one_table_hll_big
where array_contains(split("${products}", ","), product_name)
Well at least you don't have to create account-wide filters and beg for the permission to do so like in Snowsight (Snowflake).