How to Optimize Your Delta Lake
Hasan Javaid Malik
Analytics & Usability @ Jazz | CX | Microsoft & Databricks Certified | xSystems, xMatas
Tired of waiting for hours and hours before you fetch the data from your delta lake?
Tired of getting continuous calls from business analysts for the provision of data?
Tired of never-ending meetings with stakeholders because of lack of data?
Then this easy and simple guide is for you, by just following these 4 steps, observe a dramatic speed-up in fetching your query results.
Partitioning
Partition your data according to the needs of your stakeholders, for example, if the data is queried/filtered majorly on the basis of date, partition according to the date column.
Partitioning can be done on one or multiple columns while writing the delta table by using the option
.partitionBy('col1','col2')
Compaction (Bin-packing)
Compaction coalesces a large number of small files into a small number of larger files so that spark needs to read a lesser number of files while fetching the data.
To execute compaction, run the following command.
OPTIMIZE delta.`<delta-table-path` | delta.`<name>`
领英推荐
Z-Ordering
Z-Ordering co-locates the records, for example, it puts all records of a specific date into one file, so that when queried, spark has to read just one file rather than reading and gathering data from multiple files
To Z-Order, run the following command
OPTIMIZE delta.`<delta-table-path>` | delta.`<name>`
ZORDER BY (col>)
Z Ordering can also be done on a specific set of records
OPTIMIZE delta.`<delta-table-path>` | delta.`<name>`
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (<col>)
?
Auto Optimize Write / Auto Compact
If your table is continuously appended and/or modified, then you need to optimize your table, z-order your table and partition it repeatedly, in order for it to keep giving results in the quickest possible time.
To turn on auto optimization, run the following commands
ALTER TABLE [table_name | delta.`<table-path>`] SET TBLPROPERTIES
(delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true)
References
Optimize performance with File Management. Optimize performance with file management | Databricks on AWS. (n.d.). Retrieved October 13, 2022, from https://docs.databricks.com/delta/file-mgmt.html#enhanced-checkpoints-for-low-latency-queries
Written By Muhammad Hasan Malik