Efficient count distincts in dashboards

Efficient count distincts in dashboards

Have you ever wanted to make a dashboard with a count distinct metric? Maybe you want to display how many customers bought in a category. If you have a lot of data you probably started by creating a table with an aggregation say category, count(distinct customer_id). But then maybe users also wanted the total across all categories so you added that as an extra row. If you had multiple dimensions you might have used a cube aggregation to get different totals available. But what if you want users to be able to define custom groups by combining different categories and seeing how many distinct customers bought in those? Now what do you do? You can't add the different categories together as customers can have bought from multiple categories and you'd be double counting. Then you probably went back to reading from the complete table, but running count distincts on a lot of data gets slow. Then you might have looked into approx count distinct.

Both Databricks and Snowflake offer functions for this (As always Databricks gives you extra control with the relativeSD parameter while Snowflake keeps things simple. If you understand what you're doing it's nice).

https://docs.snowflake.com/en/sql-reference/functions/approx_count_distinct

https://docs.databricks.com/en/sql/language-manual/functions/approx_count_distinct.html

HyperLogLog

If you read the descriptions for either of these functions you'll see they mention hyperloglog. Now what the hell is hyperloglog? It's some fancy maths that allows aggregating count distincts as long as you can live with a small error.

If you really want to read more about it here's some links, but we'll just assume it works.

https://en.wikipedia.org/wiki/HyperLogLog

https://www.databricks.com/blog/2019/05/08/advanced-analytics-with-apache-spark.html

The approx count distincts can give you a nice speed boost, but you now have to keep all the data. Wouldn't it be nice if you could aggregate your table like before?

The basic idea when working with hyperloglog is that you need a function to create a sketch, a function to combine sketches, and a final function to transform a sketch into a count distinct estimate. Approx count distinct has all that internally, but it wasn't always exposed. When I first ran into this Databricks article for a solution it seemed so great, but the functions we'll talk about today weren't yet available. You could use the libraries in Spark to generate a table, but then how could I use the functions from Tableau? I was determined to get it to work so I created some jars wrapping the Twitter Algebird implementation into functions I could register in hive. This worked if I was going to run my dashboard against a general purpose cluster in Databricks, but it was a lot of work and there was no support for registering jars in the sql warehouse. Now you can do it easily so enjoy!

Create a sketch (do this when creating your aggregated table)

Databaricks: use hll_sketch_agg like you would any other aggregation except the output will be a sketch. Here you can specify the lgConfigK (a bigger number will take longer to compute and use more storage, but the error will go down).

Snowflake: use hll_accumulate the same way but Snowflake has no parameter you can set to improve accuracy (it's complex and they like to keep things simple...).

Aggregate the sketches and get an estimate (do this at report time)

Databricks: Use hll_union_agg and then hll_sketch_estimate.

Snowflake: hll_combine and then hll_estimate.

An example implementation

To do a proper example, we need a big table because hll is all about performance. For my setup using a tiny cluster I need about 500 million rows to really show a good difference. We're using Databricks for this example, but you can change the functions using the reference above to convert them to Snowflake or with a little bit more work to any other hll implementation.


Step 1: Create the sketches for our aggregate table.

create or replace table alexis.test.one_table_hll_big as 
select
transaction_date,
product_name,
product_description,
user_description,
sum(total_value) as total_value,
hll_sketch_agg(id, 12) as transactions_sketch
from alexis.test.one_table_big
group by all        

With the demo data one_table_big contains 500M rows, but one_table_hll_big only contains 900 (that's where we get the massive performance gain from).

Step 2: Let's compare the performance on the two tables

For our example we will filter on two products because if you remember the start of the article that's a case traditional aggregation techniques can't handle.

Let's start with the regular count distinct on the table containing all of the data.

select product_name, count(distinct id) from alexis.test.one_table_big 
where product_name in ('Product 3', 'Product 1')
group by all;        

The performance isn't that bad considering the little compute we're using and it finishes in about 14 seconds. A little bit slow for dashboarding! We could speed things up by using more compute, but such thinking eventually leads to a big bill!

Let's try to do a bit better with approx count distinct.

select product_name, approx_count_distinct(id) from alexis.test.one_table_big 
where product_name in ('Product 3', 'Product 1')
group by all;        

It was a little faster but it still took 12 seconds because there's so much data to read.

Now let's take advantage of our 900 row table to do some magic.

select
 product_name,
 hll_sketch_estimate(hll_union_agg(transactions_sketch)) as distinct_transactions
 from alexis.test.one_table_hll_big
 where product_name in ('Product 3', 'Product 1')
 group by all        

This only took 1 second. It's so fast because after filtering we only have to process 600 rows and hyperloglog might be some fancy maths, but someone did the work for us and running that over 600 rows is a lot faster than retrieving 333 million rows and doing a count distinct. That's something we could improve with a little more compute and lower-latency storage like Premium Block Blob Storage Accounts in Azure or Amazon S3 Express One Zone Storage Class in AWS for which we can easily afford the storage because our aggregated table will be so much smaller. Now our users can have a report with complete freedom to filter dimensions and get count distincts like they would any other regular metric such as a sum or an average! Happy dashboarding!

Bonus tip

If you want to set up a simple dashboard in a Databricks notebook to try this, a widget can be nice.

Databricks has great documentation on how to create and work with widgets.

In our case we'll want to create a multiselect widgets of strings. You would think using it like this would work:

select
 product_name
 from alexis.test.one_table_hll_big
 where product_name in (${products})        

Unfortunately it's not that simple, but I figured it out so you don't have to:

select
 product_name
 from alexis.test.one_table_hll_big
 where array_contains(split("${products}", ","), product_name)        

Well at least you don't have to create account-wide filters and beg for the permission to do so like in Snowsight (Snowflake).

要查看或添加评论,请登录

Alexis Chicoine的更多文章

  • Evolving Delta Tables

    Evolving Delta Tables

    Delta tables are great, but finding out how to alter them isn't always straightforward because there's many different…

    1 条评论
  • Avoid bad performance from optimizing too soon with liquid clustering

    Avoid bad performance from optimizing too soon with liquid clustering

    Liquid clustering is a new technology to organize delta tables. It allows you to set clustering keys on a table and to…

  • Some unusual joins

    Some unusual joins

    If you've done any sql, you should be quite familiar with inner join, left join, full join, and maybe cross join. All…

  • Using AI to convert json to terraform

    Using AI to convert json to terraform

    Introduction If you're using terraform for Databricks like me, you might have wished you could get terraform code out…

  • Grow your skills with duckdb

    Grow your skills with duckdb

    I've recently been discovering duckdb and it has been a useful tool for working with data. It's a relational database…

    1 条评论
  • Serverless or Pennyless?

    Serverless or Pennyless?

    In my previous article on streaming, I used delta live tables from Databricks in a classical mode with vms living in my…

  • Let's burn some cash streaming

    Let's burn some cash streaming

    In this part 3 on streaming with delta live tables in Databricks, I'll share with you my experience of scaling things…

    2 条评论
  • Let's stream some more

    Let's stream some more

    In my last article on streaming, we used delta live tables to read from a kinesis stream. In this article, we'll…

  • Let's get streaming

    Let's get streaming

    If you've been in a situation like mine where you need to run expensive batch jobs every 5 minutes or less, you might…

  • Sqlmesh empowering local development Part 2

    Sqlmesh empowering local development Part 2

    In my previous article on using sqlmesh to develop pipelines locally, I showed how you can use the transpilation powers…

    2 条评论

社区洞察

其他会员也浏览了