How to Optimize Your Delta Lake

How to Optimize Your Delta Lake

Tired of waiting for hours and hours before you fetch the data from your delta lake?

Tired of getting continuous calls from business analysts for the provision of data?

Tired of never-ending meetings with stakeholders because of lack of data?

Then this easy and simple guide is for you, by just following these 4 steps, observe a dramatic speed-up in fetching your query results.


Partitioning

Partition your data according to the needs of your stakeholders, for example, if the data is queried/filtered majorly on the basis of date, partition according to the date column.

Partitioning can be done on one or multiple columns while writing the delta table by using the option


.partitionBy('col1','col2')
        


Compaction (Bin-packing)

Compaction coalesces a large number of small files into a small number of larger files so that spark needs to read a lesser number of files while fetching the data.

To execute compaction, run the following command.


OPTIMIZE delta.`<delta-table-path` | delta.`<name>`
        


Z-Ordering

Z-Ordering co-locates the records, for example, it puts all records of a specific date into one file, so that when queried, spark has to read just one file rather than reading and gathering data from multiple files

To Z-Order, run the following command


OPTIMIZE delta.`<delta-table-path>` | delta.`<name>`

ZORDER BY (col>)
        

Z Ordering can also be done on a specific set of records


OPTIMIZE delta.`<delta-table-path>` | delta.`<name>`

WHERE date >= current_timestamp() - INTERVAL 1 day

ZORDER BY (<col>)
        

?

Auto Optimize Write / Auto Compact

If your table is continuously appended and/or modified, then you need to optimize your table, z-order your table and partition it repeatedly, in order for it to keep giving results in the quickest possible time.

To turn on auto optimization, run the following commands


ALTER TABLE [table_name | delta.`<table-path>`] SET TBLPROPERTIES

(delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true)
        

References

Optimize performance with File Management. Optimize performance with file management | Databricks on AWS. (n.d.). Retrieved October 13, 2022, from https://docs.databricks.com/delta/file-mgmt.html#enhanced-checkpoints-for-low-latency-queries


Written By Muhammad Hasan Malik

要查看或添加评论,请登录

Hasan Javaid Malik的更多文章

社区洞察

其他会员也浏览了