登录查看更多内容

Delta Lake Liquid Clustering

Rahul Soni

Senior Solutions Architect at Databricks

发布日期: 2023年10月20日

Have you ever wondered if there’s a dynamic solution to the relentless challenge of data partitioning in the world of data lakehouses?

Well, I did! So let’s talk about it.

The Challenge of Fixed Data Layouts

Have a look at this graph.

yearly row counts for kaggle_partitioned table

This graph projects yearly row counts for a table & reveals a significant skew in data distribution. This skew is particularly relevant as consumers frequently employ the year column as a filter in their queries.

This table, when created, was partitioned using year and month columns. This is how the DDL looks like for this one.

%sql
CREATE TABLE kaggle_partitioned (
  year_month STRING,
  exp_imp TINYINT,
  hs9 SMALLINT,
  Customs SMALLINT,
  Country BIGINT,
  quantity BIGINT,
  value BIGINT,
  year STRING,
  month STRING
) USING delta PARTITIONED BY (year, month);

The problem here is, that 2 partitions have ~83% of the total data for the table.

Based on the information provided above, do you think the table is under partitioned? Or is over partitioned?

Let’s look at the data distribution in further depth for this table. The following chart present the monthly split for each yearly row counts.

Slicing the pictorial representation of data distribution further, 2020 has most data in March, with Jan and Feb following the suite, followed with a trail of smaller partitions for other months.

So, is the table over-partitioned? Or under partitioned? Or both?

Does the following picture rings any bell? Makes any sense?

With the current partitioning strategy,

For partitions like 2020–03, you might end up reading a huge amount of data to query, say, just one hour’s worth of data.
On the completely other end of spectrum, I may need to span across multiple partitions and scan many small files to serve a query which is for a customer with not a lot of data.
Lastly, if I need to repartition my table by week/day/month, you’d need to rewrite the entire table. AGAIN!

Now let’s discuss the data write scenarios in my table. I think the picture summarizes the point I am trying to make and I don’t need to type it in the article ;)

Now! Let’s repeat the very first line of this article again!

Have you ever wondered if there’s a dynamic solution to the relentless challenge of data partitioning in the world of data lakehouses?

Enters Liquid Clustering! It simplifies data layout decisions and enhances query performance, all without demanding constant monitoring and adjustments.

So how does it do it?

Fast: Faster writes and similar reads vs. well-tuned partitioned tables
Self-tuning: Avoids over- and under-partitioning
Incremental: Automatic partial clustering of new data
Skew-resistant: Produces consistent file sizes and low write amplification
Flexible: Want to change the clustering columns? No problem!
Better concurrency: Enables Row-Level concurrency on your tables.

Let’s understand it in a bit more detail. I will use the sample layout diagram from earlier.

领英推荐

5 Telltale Signs You Don't Understand Big Data

Bernard Marr 8 年前

Big Data: The All-Important 90/10 Rule

Bernard Marr 9 年前

There is no single “single pane of glass” in the data…

Prukalpa ? 2 年前

So how does Liquid Clustering helps here? See for yourself below! Liquid Clustering efficiently balance clustering vs. file size

Not only does it automatically handles the smaller partitions, a heavier partition would be subdivided for a more efficient query if one wants to fetch only hourly data from a large partition

Let’s see it in action! Here’s the file size distribution for the partitioned table.

Let’s create a clustered table from this partitioned table. We’ll use CTAS.

CREATE TABLE kaggle_clustered CLUSTER BY(year, month) AS
SELECT
  *
FROM
  kaggle_partitioned;

And here’s the file size distribution for the clustered table

.It’s clearly evident that most of the smaller files have been merged together to create more optimized files.

Both the tables have OPTIMIZE run on them.

Liquid Clustering also aids with ingestion by leveraging partial/lazy clustering to make it more efficient.

Let’s understand how.

2021–01 is a partition where I don’t have any data in my partitioned table.
As one will start to ingest the data for that date range, one file will be created covering all the customers.
As the data set starts to grow, Liquid Clustering will start to split files for customers.
The split operation will result in smaller files intermittently but the table maintenance will automatically merge those smaller files into larger ones, making sure the read performance is not impacted.

So! How do I use it in my tables?

Heads up! Databricks Runtime 13.3 LTS and above is required to create, write, or OPTIMIZE Delta tables with liquid clustering enabled.

First thing first! Clustering is not compatible with partitioning or ZORDER, and requires that the Databricks client manages all layout and optimization operations for data in your table.

Now let’s see how can you create a Delta table with Liquid Clustering.

--Create an empty table
CREATE TABLE table1(col0 int, col1 string) USING DELTA CLUSTER BY (col0);


--Using a CTAS statement
CREATE EXTERNAL TABLE table2 CLUSTER BY (col0) --specify clustering after table name, not in subquery
LOCATION 'table_location' AS
SELECT
  *
FROM
  table1;


--Using a LIKE statement to copy configurations
  CREATE TABLE table3 LIKE table1;

Tables created with liquid clustering enabled have numerous Delta table features enabled at creation and use Delta writer version 7 and reader version 3. Table protocol versions cannot be downgraded, and tables with clustering enabled are not readable by Delta Lake clients that do not support all enabled Delta reader protocol table features. See How does Databricks manage Delta Lake feature compatibility?

How to trigger clustering

Simply use the OPTIMIZE command on your table. See the example below.

OPTIMIZE table_name;

Liquid clustering is incremental, so data is only rewritten as necessary to accommodate data that needs to be clustered. Data files with clustering keys that do not match data to be clustered are not rewritten.

You should run regular OPTIMIZE jobs to cluster data and hence achieve the best performance. Because liquid clustering is incremental, most OPTIMIZE jobs for clustered tables run quickly.

What is liquid clustering used for?

As per Databricks documentation, it is recommended to use liquid clustering for all new Delta tables. The following are examples of scenarios that benefit from clustering:

Tables often filtered by high cardinality columns.
Tables with significant skew in data distribution.
Tables that grow quickly and require maintenance and tuning effort.
Tables with concurrent write requirements.
Tables with access patterns that change over time.
Tables where a typical partition key could leave the table with too many or too few partitions.

Things to consider when using Liquid Clustering

You can only specify columns with statistics collected for clustering keys. By default, the first 32 columns in a Delta table have statistics collected.
You can specify up to 4 columns as clustering keys.
Structured Streaming workloads do not support clustering-on-write.

Sandip Roy

Director & Practice Head - Big Data & Databricks || Databricks Certified Solution Architect Champion || Microsoft, AWS and Google Certified Cloud Solution Architect and Data Platform Professional

9 个月

Nice article...but one part I did not understand why do we have to turn off deletion vector or a row tracker property before we enable liquid clustering? Is it because LC turn them on by default and doing so older Delta clients might not be able to read/write to these LC enabled tables anymore?

查看更多评论

要查看或添加评论，请登录

Rahul Soni的更多文章

Delta Lake Deletion Vectors. Solution to the immutability challenges?

2024年11月18日

Delta Lake Deletion Vectors. Solution to the immutability challenges?

The advent of data lakes revolutionized the way we handle vast and unstructured data, offering unparalleled storage…

2 条评论
Benchmarking Photon Performance with TPC-DS Dataset on Databricks

2024年11月11日

Benchmarking Photon Performance with TPC-DS Dataset on Databricks

The Case for Photon As data teams seek ways to optimize processing speed and reduce costs, Databricks introduced Photon…

5 条评论
Unlocking Unified Data Governance with Microsoft Purview and Databricks Unity Catalog

2024年11月6日

Unlocking Unified Data Governance with Microsoft Purview and Databricks Unity Catalog

In today’s data-driven world, effective data governance is crucial for organizations to manage and secure their data…

8 条评论
Performance Benchmarking Microsoft’s Cobalt 100 VMs on Databricks

2024年11月4日

Performance Benchmarking Microsoft’s Cobalt 100 VMs on Databricks

In the ever-evolving landscape of cloud computing, performance is king. As data scientists and engineers, we constantly…
Row Level Filtering and Column Level Masking in Databricks

2024年10月30日

Row Level Filtering and Column Level Masking in Databricks

In today’s data-driven landscape, safeguarding sensitive information is a top priority for organizations across…

3 条评论
Interrupt a running thread from NiFi UI

2019年5月1日

Interrupt a running thread from NiFi UI

In this article, I will discuss one of the most exciting new features coming with NiFi 1.7.
Trigger-based/Serial Data processing in NiFi using Wait and Notify Processor

2019年4月30日

Trigger-based/Serial Data processing in NiFi using Wait and Notify Processor

In this article, I’m going to cover a simple solution to control the data processing in NiFi serially or based on an…

See all articles

Delta Lake Liquid Clustering

Rahul Soni

Senior Solutions Architect at Databricks

The Challenge of Fixed Data Layouts

领英推荐

So! How do I use it in my tables?

How to trigger clustering

What is liquid clustering used for?

Things to consider when using Liquid Clustering

Rahul Soni的更多文章

社区洞察

其他会员也浏览了

What Is Big Data?

Anatomy Of A Data Stack (2024 Update)

#143 Denormalized Data into Embeddings Lake Vectoria

Journey From Big Data to Smart Data: How Big Data Testing Can Help You?

The World of “Big Data”

We’re talking Data. Data, means Data means Nothing... Talk Information

Big Data: All About Finding Value

Impact of Big Data: Be Relevant or Be Redundant - Ronald van Loon Interview

Data Lake vs. Data Lakehouse: Simple Solutions to a Big Data Problem

Stuck in a swamp: is the era of Big Data over?

The Challenge of Fixed Data Layouts

领英推荐

So! How do I use it in my tables?

How to trigger clustering

What is liquid clustering used for?

Things to consider when using Liquid Clustering

Rahul Soni的更多文章

Delta Lake Deletion Vectors. Solution to the immutability challenges?

Benchmarking Photon Performance with TPC-DS Dataset on Databricks

Unlocking Unified Data Governance with Microsoft Purview and Databricks Unity Catalog

Performance Benchmarking Microsoft’s Cobalt 100 VMs on Databricks

Row Level Filtering and Column Level Masking in Databricks

Interrupt a running thread from NiFi UI

Trigger-based/Serial Data processing in NiFi using Wait and Notify Processor

社区洞察

其他会员也浏览了

What Is Big Data?

Anatomy Of A Data Stack (2024 Update)

#143 Denormalized Data into Embeddings Lake Vectoria

Journey From Big Data to Smart Data: How Big Data Testing Can Help You?

The World of “Big Data”

We’re talking Data. Data, means Data means Nothing... Talk Information

Big Data: All About Finding Value

Impact of Big Data: Be Relevant or Be Redundant - Ronald van Loon Interview

Data Lake vs. Data Lakehouse: Simple Solutions to a Big Data Problem

Stuck in a swamp: is the era of Big Data over?