登录查看更多内容

How to Optimize Your Delta Lake

Hasan Javaid Malik

Analytics & Usability @ Jazz | CX | Microsoft & Databricks Certified | xSystems, xMatas

发布日期: 2022年10月13日

+ 关注

Tired of waiting for hours and hours before you fetch the data from your delta lake?

Tired of getting continuous calls from business analysts for the provision of data?

Tired of never-ending meetings with stakeholders because of lack of data?

Then this easy and simple guide is for you, by just following these 4 steps, observe a dramatic speed-up in fetching your query results.

Partitioning

Partition your data according to the needs of your stakeholders, for example, if the data is queried/filtered majorly on the basis of date, partition according to the date column.

Partitioning can be done on one or multiple columns while writing the delta table by using the option


.partitionBy('col1','col2')

Compaction (Bin-packing)

Compaction coalesces a large number of small files into a small number of larger files so that spark needs to read a lesser number of files while fetching the data.

To execute compaction, run the following command.


OPTIMIZE delta.`<delta-table-path` | delta.`<name>`

领英推荐

5 Telltale Signs You Don't Understand Big Data

Bernard Marr 8 年前

Snowflake's File Sizing and Loading Tips

Adam Morton 4 个月前

Weekly Newsletter: Data Days 2024 Edition!

IDERA Software 4 个月前

Z-Ordering

Z-Ordering co-locates the records, for example, it puts all records of a specific date into one file, so that when queried, spark has to read just one file rather than reading and gathering data from multiple files

To Z-Order, run the following command


OPTIMIZE delta.`<delta-table-path>` | delta.`<name>`

ZORDER BY (col>)

Z Ordering can also be done on a specific set of records


OPTIMIZE delta.`<delta-table-path>` | delta.`<name>`

WHERE date >= current_timestamp() - INTERVAL 1 day

ZORDER BY (<col>)

?

Auto Optimize Write / Auto Compact

If your table is continuously appended and/or modified, then you need to optimize your table, z-order your table and partition it repeatedly, in order for it to keep giving results in the quickest possible time.

To turn on auto optimization, run the following commands


ALTER TABLE [table_name | delta.`<table-path>`] SET TBLPROPERTIES

(delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true)

References

Optimize performance with File Management. Optimize performance with file management | Databricks on AWS. (n.d.). Retrieved October 13, 2022, from https://docs.databricks.com/delta/file-mgmt.html#enhanced-checkpoints-for-low-latency-queries

Written By Muhammad Hasan Malik

要查看或添加评论，请登录

Hasan Javaid Malik的更多文章

How to Deploy any LLM (ChatGPT like) Python App on Azure

2023年12月22日

How to Deploy any LLM (ChatGPT like) Python App on Azure

Hi Folks! ?? Since this is something I've struggled with quite a lot in the past week or so, I thought about writing an…

1 条评论
How to Build ChatGPT like Custom Chatbot in 3 simple steps!

2023年12月4日

How to Build ChatGPT like Custom Chatbot in 3 simple steps!

Hi there! ?? Today, we are going to learn how to make a custom chatGPT-like chatbot in three simple steps, to which you…

1 条评论
Fine Tuning GPT 3.5 Turbo with your own data

2023年8月25日

Fine Tuning GPT 3.5 Turbo with your own data

Yes, you read that right, using OpenAI, now you can fine-tune GPT 3.5 with your own custom data to make GPT respond to…

8 条评论

How to Optimize Your Delta Lake

Hasan Javaid Malik

Analytics & Usability @ Jazz | CX | Microsoft & Databricks Certified | xSystems, xMatas

Partitioning

Compaction (Bin-packing)

领英推荐

Z-Ordering

?

Auto Optimize Write / Auto Compact

References

Hasan Javaid Malik的更多文章

社区洞察

其他会员也浏览了

KIMBALL VS INMON REDUX

Struggling to troubleshoot your code in Snowflake? You need to read this.

Big Data Explained Simply?

CTRL + ALT + Data Security #16

Working with Dataverse data - the 'legacy' of the Common Data Service

CTW's near real-time data pipeline solution

Data News - September

Microsoft Data Platform News 2024 - Week 09

MDS Newsletter #67

Microsoft Data Platform News 2024 - Week 01

Partitioning

Compaction (Bin-packing)

领英推荐

Z-Ordering

?

Auto Optimize Write / Auto Compact

References

Hasan Javaid Malik的更多文章

How to Deploy any LLM (ChatGPT like) Python App on Azure

How to Build ChatGPT like Custom Chatbot in 3 simple steps!

Fine Tuning GPT 3.5 Turbo with your own data

社区洞察

其他会员也浏览了

KIMBALL VS INMON REDUX

Struggling to troubleshoot your code in Snowflake? You need to read this.

Big Data Explained Simply?

CTRL + ALT + Data Security #16

Working with Dataverse data - the 'legacy' of the Common Data Service

CTW's near real-time data pipeline solution

Data News - September

Microsoft Data Platform News 2024 - Week 09

MDS Newsletter #67

Microsoft Data Platform News 2024 - Week 01