Delta Table Performance Is Governed By Transaction Size
Technology
Relational Databases are known for their atomic, consistent, independent and durable properties (ACID).? The first version of Apache Spark was released in 2014? However, the tables in the hive catalog did not have any ACID properties at that time.? That was because the hive catalog stored the schema for reading the source files at run time.? Databricks released the Delta file format in 2019 which changed how big data engineers inserted, updated, and deleted data within a data lake.
In short, a Delta table is composed of parquet files that contain rows of data and JSON files that keep track of transactions (actions).? If we execute large transactions, then the number of files that are generated is ?kept to a minimum.? However, many small transactions can cause performance issues due to the large number of files that are created.
Business Problem
Our manager has asked us to research why small batches can cause performance problems with the Delta file format.? Next, the same code will be rewritten to use large batches which will eliminates the performance issue.
Technical Solution
I have used the trial by division algorithm to find the prime numbers from 2 to 5 million.? This algorithm is a great way to bench mark a system since it uses both the computation unit as well as the storage system of the cluster.? Please see the Wikipedia page for details on algorithms to find prime numbers.? But first, we need to understand how the Delta file format works.
Please my recent article on SQL Server Central for full details.
Specialize in enterprise data handling and build data solution using Azure Databricks and Power BI for Effective AI Models.
5 个月This is great. I was lucky that my data load fell into your optimized version, but I started seeing similar problems, and this post gives me an option to look into.