登录查看更多内容

Delta Table Performance Is Governed By Transaction Size

John Miner

Data Architect at Insight

发布日期: 2024年9月11日

Technology

Relational Databases are known for their atomic, consistent, independent and durable properties (ACID).? The first version of Apache Spark was released in 2014? However, the tables in the hive catalog did not have any ACID properties at that time.? That was because the hive catalog stored the schema for reading the source files at run time.? Databricks released the Delta file format in 2019 which changed how big data engineers inserted, updated, and deleted data within a data lake.

In short, a Delta table is composed of parquet files that contain rows of data and JSON files that keep track of transactions (actions).? If we execute large transactions, then the number of files that are generated is ?kept to a minimum.? However, many small transactions can cause performance issues due to the large number of files that are created.

Business Problem

Our manager has asked us to research why small batches can cause performance problems with the Delta file format.? Next, the same code will be rewritten to use large batches which will eliminates the performance issue.

Technical Solution

I have used the trial by division algorithm to find the prime numbers from 2 to 5 million.? This algorithm is a great way to bench mark a system since it uses both the computation unit as well as the storage system of the cluster.? Please see the Wikipedia page for details on algorithms to find prime numbers.? But first, we need to understand how the Delta file format works.

Please my recent article on SQL Server Central for full details.

Myint Htwe

Specialize in enterprise data handling and build data solution using Azure Databricks and Power BI for Effective AI Models.

5 个月

This is great. I was lucky that my data load fell into your optimized version, but I started seeing similar problems, and this post gives me an option to look into.

要查看或添加评论，请登录

John Miner的更多文章

Parting is such sweet sorrow!

2025年3月27日

Parting is such sweet sorrow!

Today is the last day of the MVP Summit 2025. I want to thank Rie Merritt, Betsy Webber, and Rochelle Sonnenberg for…

2 条评论
Why use Tally Tables in the Fabric Warehouse?

2025年2月26日

Why use Tally Tables in the Fabric Warehouse?

Technical Problem Did you know that Edgar F. Codd is considered the father of the relational model that is used by most…
Streaming Data with Azure Databricks

2025年2月25日

Streaming Data with Azure Databricks

Technical Problem The core functionality of Apache Spark has support for structured streaming using either a batch or a…

1 条评论
Upcoming Fabric Webinars from Insight

2025年2月19日

Upcoming Fabric Webinars from Insight

Don't miss the opportunity to boost your data skills with Insight and Microsoft. This webinar series will help you…
How to develop solutions with Fabric Data Warehouse?

2025年2月18日

How to develop solutions with Fabric Data Warehouse?

Technology Details The SQL endpoint of the Fabric Data Warehouse allows programs to read from and write to tables. The…
Understanding file formats within the Fabric Lakehouse

2025年2月10日

Understanding file formats within the Fabric Lakehouse

I am looking forward to talking to the Cloud Data Driven user group on March 13th. You can find all the presentation…

3 条评论
Engineering a Lakehouse with Azure Databricks with Spark Dataframes

2025年2月3日

Engineering a Lakehouse with Azure Databricks with Spark Dataframes

Problem Time does surely fly. I remember when Databricks was released to general availability in Azure in March 2018.
Create an Azure Databricks SQL Warehouse

2025年1月21日

Create an Azure Databricks SQL Warehouse

Problem Many companies are leveraging data lakes to manage both structured and unstructured data. However, not all…

2 条评论
How to Load a Fabric Warehouse?

2025年1月9日

How to Load a Fabric Warehouse?

Technology The data warehouse in Microsoft Fabric was re-written to use One Lake storage. This means each and every…
My Year End Wrap Up for 2024

2024年12月26日

My Year End Wrap Up for 2024

Hi Folks, It has been a very busy year. At the start of this year I wanted to learn Fabric in depth.

1 条评论

See all articles

Delta Table Performance Is Governed By Transaction Size

John Miner

Data Architect at Insight

Technology

Business Problem

Technical Solution

John Miner的更多文章

社区洞察

其他会员也浏览了

How to Drop Duplicates in PySpark?

Understanding Repartition and Coalesce in Apache Spark

Unleashing the Power of Spark Liquid Clustering: A Deep Dive into Efficient Data Processing

Spark Optimization: Setting the right balance between the spark cores and memory.

Demystifying Bloom Filters: Practical Guide in .NET

Spark : Repartition and Coalesce

Delta Format

DSA for Data engineers

Spark

Understanding Big Data - The Big Picture

Technology

Business Problem

Technical Solution

John Miner的更多文章

Parting is such sweet sorrow!

Why use Tally Tables in the Fabric Warehouse?

Streaming Data with Azure Databricks

Upcoming Fabric Webinars from Insight

How to develop solutions with Fabric Data Warehouse?

Understanding file formats within the Fabric Lakehouse

Engineering a Lakehouse with Azure Databricks with Spark Dataframes

Create an Azure Databricks SQL Warehouse

How to Load a Fabric Warehouse?

My Year End Wrap Up for 2024

社区洞察

其他会员也浏览了

How to Drop Duplicates in PySpark?

Understanding Repartition and Coalesce in Apache Spark

Unleashing the Power of Spark Liquid Clustering: A Deep Dive into Efficient Data Processing

Spark Optimization: Setting the right balance between the spark cores and memory.

Demystifying Bloom Filters: Practical Guide in .NET

Spark : Repartition and Coalesce

Delta Format

DSA for Data engineers

Spark

Understanding Big Data - The Big Picture