ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Data skipping and zorder in delta

pradeep ponduri

Develops on AWS...... Data Engineer @Amazon

å‘å¸ƒæ—¥æœŸ: 2021å¹´8æœˆ7æ—¥

In this post, we take a look at how delta under the hood is capable of sifting through petabytes of data within seconds. Particularly we will talk about Data Skipping and ZORDER clustering. These two features combined enable delta lake to reduce the amount of data that needs to be scanned while querying.

First look at issues within spark. If you are familiar with big data systems (Spark, Hive, Impala, etc.), to improve performance we often perform partitioning. Partition works by having a subdirectory for every distinct value of a partition column. So, when we query with a filter or read a file with a filter, we read only certain partitions. But the problem with this kind of approach in spark is what if we have multiple columns to partition. We cannot decide on which columns to partition. So, a spark cannot partition data as cardinality increases. Another feature that lacks in spark is I/O pruning based on aggregates. Meaning spark cannot keep track of simple statistics such as min and max values of a certain column and use these statistics while querying. This is what the data skipping feature is about.

Letâ€™s say we have four small files with random data (some names, integers, etc.). Now we want to query data and search for Name=â€™Bradâ€™. Without the data skipping concept, the spark does not have the capability of skipping files. It first checks for â€˜Bradâ€™ in file one then goes to file two and so on until it finds â€˜Bradâ€™. Opening and closing multiple files causes a delay in querying and more I/O operations are performed which are useless (simply closing and opening file).

Now using the ZORDER concept we order our data based on names. This will create a partition of data so we can perform data skipping. This ZORDER can be built with OPTMIZE command in delta lake. With OPTIMIZE command we are saying compact down files into simpler versions. So now we have two files instead of four and our data is ordered by name. The first file contains data from Andy to Dan and the second file has data from Fred to Tom. Now delta engine knows which files to skip and which files to read. When we run the same query to find â€˜Bradâ€™ we end up opening only one file. We can use this optimize command on multiple columns.

é¢†è‹±æŽ¨è

Patching Holes

Helen Wall 1 å¹´å‰

DATABASE NORMALIZATION

Simon Ngugi 11 ä¸ªæœˆå‰

Use liquid clustering for Delta tables

Arabinda Mohapatra 2 ä¸ªæœˆå‰

Syntax for ZORDER

OPTIMIZE table ZORDER BY (column)

?Useful links

ZORDER in delta

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

pradeep ponduriçš„æ›´å¤šæ–‡ç«

Optimize your Spark Jobs

2022å¹´5æœˆ16æ—¥

Optimize your Spark Jobs

As the volume of data increases, we always find bottlenecks dealing with it. Although spark has its own catalyst toâ€¦
Big Data Storage Formats

2021å¹´8æœˆ11æ—¥

Big Data Storage Formats

An important task of any platform that processes big data is to decide on the type of format to store data. Hadoop hasâ€¦
Concurrent Read Write Capability

2021å¹´8æœˆ9æ—¥

Concurrent Read Write Capability

In the previous post, we have seen how transaction logs keep track of commits in delta lake. Now letâ€™s talk aboutâ€¦
Transaction Logs in Delta Lake

2021å¹´8æœˆ6æ—¥

Transaction Logs in Delta Lake

Understanding the transaction log in Delta Lake is key in understanding the concept of the delta. This log isâ€¦

3 æ¡è¯„è®º
Data Lifecycle to Delta Lake Lifecycle

2021å¹´8æœˆ5æ—¥

Data Lifecycle to Delta Lake Lifecycle

Weâ€™re always told to â€˜Go for the Gold!â€™ but how do we get that? This article is about how data can be moved in stagesâ€¦
Delta Lake To Prevent Data Corruption

2021å¹´8æœˆ4æ—¥

Delta Lake To Prevent Data Corruption

Delta lake or simply Delta is my go-to big data storage format these days. Storage formats are continuously evolvingâ€¦
Static models in a rapidly changing dynamic world

2021å¹´8æœˆ2æ—¥

Static models in a rapidly changing dynamic world

We always develop a machine learning solution to solve real-life problems. The data that we use to train the models isâ€¦
Blockchain - As I See It

2021å¹´1æœˆ20æ—¥

Blockchain - As I See It

Block chain is a technology that enables moving digital coins or assets from one place/individual to other. The termsâ€¦

1 æ¡è¯„è®º
Neural Learning with Tensorflow2.0 Part-3 ( Tensorflow Model Graph in Neo4j and Linkurious)

2020å¹´2æœˆ3æ—¥

Neural Learning with Tensorflow2.0 Part-3 ( Tensorflow Model Graph in Neo4j and Linkurious)

In Part-2 of Neural Learning, we built a simple model for computing sum of two numbers. In this part we will usingâ€¦
Neural Learning with Tensorflow2.0 Part-2 (Overview of Gradient Descent and building simple model with Tensorflow)

2020å¹´2æœˆ3æ—¥

Neural Learning with Tensorflow2.0 Part-2 (Overview of Gradient Descent and building simple model with Tensorflow)

In Part1 we have seen basics of Neural networks, how perceptron model and multi-layer perceptron model can beâ€¦

See all articles

Data skipping and zorder in delta

pradeep ponduri

Develops on AWS...... Data Engineer @Amazon

é¢†è‹±æŽ¨è

pradeep ponduriçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

What is Linear Data Structure? List of Data Structures Explained

HOOK vs Data Vault: Willibald Part 2

ShuffleHashJoin - The what , why and when

DataClarity 2020.4 is here

Microsoft Data Platform News 2024 - Week 23

Microsoft Data Platform News 2024 - Week 46

Unlock the Power of Parquet and dplyr in R: Efficient Data Partitioning for Large Datasets

Tied With A Bow: Wrapping Up the Hierarchy Discussion (Part 6 of 6)

Understanding Stack Data Structure: Concepts, Examples, and Real-World Applications

Safely Convert Data Types with SAFE_CAST

é¢†è‹±æŽ¨è

pradeep ponduriçš„æ›´å¤šæ–‡ç«

Optimize your Spark Jobs

Big Data Storage Formats

Concurrent Read Write Capability

Transaction Logs in Delta Lake

Data Lifecycle to Delta Lake Lifecycle

Delta Lake To Prevent Data Corruption

Static models in a rapidly changing dynamic world

Blockchain - As I See It

Neural Learning with Tensorflow2.0 Part-3 ( Tensorflow Model Graph in Neo4j and Linkurious)

Neural Learning with Tensorflow2.0 Part-2 (Overview of Gradient Descent and building simple model with Tensorflow)

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

What is Linear Data Structure? List of Data Structures Explained

HOOK vs Data Vault: Willibald Part 2

ShuffleHashJoin - The what , why and when

DataClarity 2020.4 is here

Microsoft Data Platform News 2024 - Week 23

Microsoft Data Platform News 2024 - Week 46

Unlock the Power of Parquet and dplyr in R: Efficient Data Partitioning for Large Datasets

Tied With A Bow: Wrapping Up the Hierarchy Discussion (Part 6 of 6)

Understanding Stack Data Structure: Concepts, Examples, and Real-World Applications

Safely Convert Data Types with SAFE_CAST

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†