Delta Lake Liquid Clustering
Have you ever wondered if there’s a dynamic solution to the relentless challenge of data partitioning in the world of data lakehouses?
Well, I did! So let’s talk about it.
The Challenge of Fixed Data Layouts
Have a look at this graph.
This graph projects yearly row counts for a table & reveals a significant skew in data distribution. This skew is particularly relevant as consumers frequently employ the year column as a filter in their queries.
This table, when created, was partitioned using year and month columns. This is how the DDL looks like for this one.
%sql
CREATE TABLE kaggle_partitioned (
year_month STRING,
exp_imp TINYINT,
hs9 SMALLINT,
Customs SMALLINT,
Country BIGINT,
quantity BIGINT,
value BIGINT,
year STRING,
month STRING
) USING delta PARTITIONED BY (year, month);
The problem here is, that 2 partitions have ~83% of the total data for the table.
Based on the information provided above, do you think the table is under partitioned? Or is over partitioned?
Let’s look at the data distribution in further depth for this table. The following chart present the monthly split for each yearly row counts.
Slicing the pictorial representation of data distribution further, 2020 has most data in March, with Jan and Feb following the suite, followed with a trail of smaller partitions for other months.
So, is the table over-partitioned? Or under partitioned? Or both?
Does the following picture rings any bell? Makes any sense?
With the current partitioning strategy,
Now let’s discuss the data write scenarios in my table. I think the picture summarizes the point I am trying to make and I don’t need to type it in the article ;)
Now! Let’s repeat the very first line of this article again!
Have you ever wondered if there’s a dynamic solution to the relentless challenge of data partitioning in the world of data lakehouses?
Enters Liquid Clustering! It simplifies data layout decisions and enhances query performance, all without demanding constant monitoring and adjustments.
So how does it do it?
Let’s understand it in a bit more detail. I will use the sample layout diagram from earlier.
领英推荐
So how does Liquid Clustering helps here? See for yourself below! Liquid Clustering efficiently balance clustering vs. file size
Not only does it automatically handles the smaller partitions, a heavier partition would be subdivided for a more efficient query if one wants to fetch only hourly data from a large partition
Let’s see it in action! Here’s the file size distribution for the partitioned table.
Let’s create a clustered table from this partitioned table. We’ll use CTAS.
CREATE TABLE kaggle_clustered CLUSTER BY(year, month) AS
SELECT
*
FROM
kaggle_partitioned;
And here’s the file size distribution for the clustered table
.It’s clearly evident that most of the smaller files have been merged together to create more optimized files.
Both the tables have OPTIMIZE run on them.
Liquid Clustering also aids with ingestion by leveraging partial/lazy clustering to make it more efficient.
Let’s understand how.
So! How do I use it in my tables?
Heads up! Databricks Runtime 13.3 LTS and above is required to create, write, or OPTIMIZE Delta tables with liquid clustering enabled.
First thing first! Clustering is not compatible with partitioning or ZORDER, and requires that the Databricks client manages all layout and optimization operations for data in your table.
Now let’s see how can you create a Delta table with Liquid Clustering.
--Create an empty table
CREATE TABLE table1(col0 int, col1 string) USING DELTA CLUSTER BY (col0);
--Using a CTAS statement
CREATE EXTERNAL TABLE table2 CLUSTER BY (col0) --specify clustering after table name, not in subquery
LOCATION 'table_location' AS
SELECT
*
FROM
table1;
--Using a LIKE statement to copy configurations
CREATE TABLE table3 LIKE table1;
Tables created with liquid clustering enabled have numerous Delta table features enabled at creation and use Delta writer version 7 and reader version 3. Table protocol versions cannot be downgraded, and tables with clustering enabled are not readable by Delta Lake clients that do not support all enabled Delta reader protocol table features. See How does Databricks manage Delta Lake feature compatibility?
How to trigger clustering
Simply use the OPTIMIZE command on your table. See the example below.
OPTIMIZE table_name;
Liquid clustering is incremental, so data is only rewritten as necessary to accommodate data that needs to be clustered. Data files with clustering keys that do not match data to be clustered are not rewritten.
You should run regular OPTIMIZE jobs to cluster data and hence achieve the best performance. Because liquid clustering is incremental, most OPTIMIZE jobs for clustered tables run quickly.
What is liquid clustering used for?
As per Databricks documentation, it is recommended to use liquid clustering for all new Delta tables. The following are examples of scenarios that benefit from clustering:
Things to consider when using Liquid Clustering
Director & Practice Head - Big Data & Databricks || Databricks Certified Solution Architect Champion || Microsoft, AWS and Google Certified Cloud Solution Architect and Data Platform Professional
9 个月Nice article...but one part I did not understand why do we have to turn off deletion vector or a row tracker property before we enable liquid clustering? Is it because LC turn them on by default and doing so older Delta clients might not be able to read/write to these LC enabled tables anymore?