Data skipping and zorder in delta
In this post, we take a look at how delta under the hood is capable of sifting through petabytes of data within seconds. Particularly we will talk about Data Skipping and ZORDER clustering. These two features combined enable delta lake to reduce the amount of data that needs to be scanned while querying.
First look at issues within spark. If you are familiar with big data systems (Spark, Hive, Impala, etc.), to improve performance we often perform partitioning. Partition works by having a subdirectory for every distinct value of a partition column. So, when we query with a filter or read a file with a filter, we read only certain partitions. But the problem with this kind of approach in spark is what if we have multiple columns to partition. We cannot decide on which columns to partition. So, a spark cannot partition data as cardinality increases. Another feature that lacks in spark is I/O pruning based on aggregates. Meaning spark cannot keep track of simple statistics such as min and max values of a certain column and use these statistics while querying. This is what the data skipping feature is about.
Let’s say we have four small files with random data (some names, integers, etc.). Now we want to query data and search for Name=’Brad’. Without the data skipping concept, the spark does not have the capability of skipping files. It first checks for ‘Brad’ in file one then goes to file two and so on until it finds ‘Brad’. Opening and closing multiple files causes a delay in querying and more I/O operations are performed which are useless (simply closing and opening file).
Now using the ZORDER concept we order our data based on names. This will create a partition of data so we can perform data skipping. This ZORDER can be built with OPTMIZE command in delta lake. With OPTIMIZE command we are saying compact down files into simpler versions. So now we have two files instead of four and our data is ordered by name. The first file contains data from Andy to Dan and the second file has data from Fred to Tom. Now delta engine knows which files to skip and which files to read. When we run the same query to find ‘Brad’ we end up opening only one file. We can use this optimize command on multiple columns.
?
Syntax for ZORDER
OPTIMIZE table ZORDER BY (column)
?Useful links
ZORDER in delta