Data skipping and zorder in delta

Data skipping and zorder in delta

In this post, we take a look at how delta under the hood is capable of sifting through petabytes of data within seconds. Particularly we will talk about Data Skipping and ZORDER clustering. These two features combined enable delta lake to reduce the amount of data that needs to be scanned while querying.

First look at issues within spark. If you are familiar with big data systems (Spark, Hive, Impala, etc.), to improve performance we often perform partitioning. Partition works by having a subdirectory for every distinct value of a partition column. So, when we query with a filter or read a file with a filter, we read only certain partitions. But the problem with this kind of approach in spark is what if we have multiple columns to partition. We cannot decide on which columns to partition. So, a spark cannot partition data as cardinality increases. Another feature that lacks in spark is I/O pruning based on aggregates. Meaning spark cannot keep track of simple statistics such as min and max values of a certain column and use these statistics while querying. This is what the data skipping feature is about.

No alt text provided for this image


Let’s say we have four small files with random data (some names, integers, etc.). Now we want to query data and search for Name=’Brad’. Without the data skipping concept, the spark does not have the capability of skipping files. It first checks for ‘Brad’ in file one then goes to file two and so on until it finds ‘Brad’. Opening and closing multiple files causes a delay in querying and more I/O operations are performed which are useless (simply closing and opening file).

No alt text provided for this image

Now using the ZORDER concept we order our data based on names. This will create a partition of data so we can perform data skipping. This ZORDER can be built with OPTMIZE command in delta lake. With OPTIMIZE command we are saying compact down files into simpler versions. So now we have two files instead of four and our data is ordered by name. The first file contains data from Andy to Dan and the second file has data from Fred to Tom. Now delta engine knows which files to skip and which files to read. When we run the same query to find ‘Brad’ we end up opening only one file. We can use this optimize command on multiple columns.


?

Syntax for ZORDER

OPTIMIZE table ZORDER BY (column)        


?Useful links

ZORDER in delta

要查看或添加评论,请登录

pradeep ponduri的更多文章

  • Optimize your Spark Jobs

    Optimize your Spark Jobs

    As the volume of data increases, we always find bottlenecks dealing with it. Although spark has its own catalyst to…

  • Big Data Storage Formats

    Big Data Storage Formats

    An important task of any platform that processes big data is to decide on the type of format to store data. Hadoop has…

  • Concurrent Read Write Capability

    Concurrent Read Write Capability

    In the previous post, we have seen how transaction logs keep track of commits in delta lake. Now let’s talk about…

  • Transaction Logs in Delta Lake

    Transaction Logs in Delta Lake

    Understanding the transaction log in Delta Lake is key in understanding the concept of the delta. This log is…

    3 条评论
  • Data Lifecycle to Delta Lake Lifecycle

    Data Lifecycle to Delta Lake Lifecycle

    We’re always told to ‘Go for the Gold!’ but how do we get that? This article is about how data can be moved in stages…

  • Delta Lake To Prevent Data Corruption

    Delta Lake To Prevent Data Corruption

    Delta lake or simply Delta is my go-to big data storage format these days. Storage formats are continuously evolving…

  • Static models in a rapidly changing dynamic world

    Static models in a rapidly changing dynamic world

    We always develop a machine learning solution to solve real-life problems. The data that we use to train the models is…

  • Blockchain - As I See It

    Blockchain - As I See It

    Block chain is a technology that enables moving digital coins or assets from one place/individual to other. The terms…

    1 条评论
  • Neural Learning with Tensorflow2.0 Part-3 ( Tensorflow Model Graph in Neo4j and Linkurious)

    Neural Learning with Tensorflow2.0 Part-3 ( Tensorflow Model Graph in Neo4j and Linkurious)

    In Part-2 of Neural Learning, we built a simple model for computing sum of two numbers. In this part we will using…

  • Neural Learning with Tensorflow2.0 Part-2 (Overview of Gradient Descent and building simple model with Tensorflow)

    Neural Learning with Tensorflow2.0 Part-2 (Overview of Gradient Descent and building simple model with Tensorflow)

    In Part1 we have seen basics of Neural networks, how perceptron model and multi-layer perceptron model can be…

社区洞察

其他会员也浏览了