登录查看更多内容

Custom Metadata in Delta Table History

Deepak Rout

Data Engineering | Data Architecture

发布日期: 2025年3月19日

If you’ve ever lived in the pre-cloud ETL world, you’ll remember how ETL tools or stored procedures were the backbone of data processing. Back then, there was this popular concept called Audit Balance and Control (ABC). It was all about capturing audit information, tracking record counts at every stage of the process, and reconciling them at the end. This data would often be stored in a set of audit tables, control tables or some static reference tables, depending on how the ABC framework was designed.

Why did we need all this? Because observability on tools and platforms wasn’t great back then. These little tricks were our way of keeping things in check.

Fast Forward to Databricks and Delta Lake

Now, let’s jump to today’s world of Databricks and Delta Lake. As clients migrate from legacy data warehouses to Databricks, they often want to replicate those familiar ABC functionalities. But here’s where things get tricky: Delta tables don’t work like your old-school audit tables. Multiple processes writing to the same table can lead to concurrency issues. Sure, you could explore partitioning or liquid clustering, but honestly, for something as small as an audit table, that feels like overkill.

So, what if I told you there’s a much better (and lesser-known) feature that Delta Lake offers? One that’s simple yet powerful enough to handle this need without adding unnecessary complexity.

Enter Custom Metadata in Delta Lake

Delta Lake allows you to attach your own custom metadata to each commit. Yes, you read that right! When you check the table’s history (DESCRIBE HISTORY), you already see useful details like:

Operation type: write, merge, etc.
Operation metrics: rows impacted, byte size, files affected.

But here’s the game-changer—you can inject your own metadata into that history! Imagine adding pipeline names, source systems, timestamps, or any other context you need. This opens up endless possibilities for automation and observability.

How Do You Set It Up?

It’s surprisingly easy. You can configure custom metadata at two levels: session-wide or for a specific operation.

Session-Level Metadata

If you want metadata applied across all operations in a session:

spark.conf.set(
  "spark.databricks.delta.commitInfo.userMetadata",
  '{"framework_version": "ETL_v2.1", "owner": "data_engineering_team"}'
)

Operation-Level Metadata

For more granular control over individual writes:

from datetime import datetime
import json

pipeline_metadata = {
    "pipeline_name": "long_running_pipeline",
    "source_system": "CRM",
    "processing_timestamp": datetime.now().isoformat(),
    "process_date": "2025-01-01"
}

(df.write.format("delta")
  .option("userMetadata", json.dumps(pipeline_metadata))
  .mode("append")
  .saveAsTable("catalog.schema.customer_data"))

SQL Configuration

For SQL folks who prefer simplicity:

SET spark.databricks.delta.commitInfo.userMetadata = 'audit_comment';
INSERT OVERWRITE target_table SELECT * FROM data_source;

What Happens Next?

Once your metadata is set up, it becomes part of your Delta table’s history. You can query it with your filter based on your own metadata.

Now you’ve got not just the default metrics but also your custom metadata right there in the history log.

Why Does This Matter?

This isn’t just about replicating old ABC frameworks—it’s about doing it smarter. With custom metadata:

You have richer audit logs for compliance or debugging.
You open doors for automation and deeper insights into your processes.
You are doing some manual addition and deletion and you can put a JIRA ticket number there to know the details in the table itself.

And the best part? It’s all built into Delta Lake—no extra tables or complex frameworks needed.

So go ahead - start experimenting with custom metadata in your Delta workflows, or reach out to Lovelytics . The possibilities are endless, and it’s time to leave those clunky legacy methods behind!

Puspendra Kumar

Data : Strategy, Architecture, Management & Governance

2 天前

ABC is mandatory for efficient data processing and building metadata driven frameworks. Informative article

1 次回应

要查看或添加评论，请登录

Deepak Rout的更多文章

Do You Know What the RELY Option in a Primary Key Does on Databricks?

2024年12月23日

Do You Know What the RELY Option in a Primary Key Does on Databricks?

If you're working with Databricks SQL and want to supercharge your query performance, it's time to explore the RELY…
Be Careful with NULL Handling During Database Migrations

2024年11月3日

Be Careful with NULL Handling During Database Migrations

When migrating databases, especially when using window functions like ROW_NUMBER(), RANK(), or DENSE_RANK(), one subtle…

1 条评论
YAML Engineers

2024年10月20日

YAML Engineers

In the data engineering field, YAML files have become a beloved tool, much like in DevOps. Over the years, data…
Lost in translation

2024年10月18日

Lost in translation

I was scrolling through my LinkedIn feed the other day when I stumbled upon this hilarious (and kinda sad) interaction.…
The Modern Alternative to Makefiles

2024年10月12日

The Modern Alternative to Makefiles

Have you ever stared at a Makefile, feeling lost in a sea of colons, tabs, and cryptic syntax? You're not alone. Many…
Where Is Everyone ?

2024年10月10日

Where Is Everyone ?

Ever sent out a status update, shared code for review, or posted in a team channel only to be met with..
SQL Productivity Hack

2024年10月5日

SQL Productivity Hack

As a data consultant, I often find myself writing SQL queries to move data between tables with some transformations…

1 条评论
"ASOF JOIN: Bridging the Time Gap in Data Analysis"

2024年10月2日

"ASOF JOIN: Bridging the Time Gap in Data Analysis"

You must have been hearing more and more about a new type of JOIN called . Modern databases are adding this feature to…
Value-Focused Framework

2024年9月30日

Value-Focused Framework

As leaders in the digital age, we must look beyond technology and patterns to embrace a holistic view of architecture…
MAX_BY Magic

2024年9月28日

MAX_BY Magic

MAX_BY, a powerful yet under utilized SQL function, is a game changer for writing readable and intuitive SQL. returns…

2 条评论

See all articles

Fast Forward to Databricks and Delta Lake

Enter Custom Metadata in Delta Lake

How Do You Set It Up?

Session-Level Metadata

Operation-Level Metadata

SQL Configuration

What Happens Next?

Why Does This Matter?

Deepak Rout的更多文章

Do You Know What the RELY Option in a Primary Key Does on Databricks?

Be Careful with NULL Handling During Database Migrations

YAML Engineers

Lost in translation

The Modern Alternative to Makefiles

Where Is Everyone ?

SQL Productivity Hack

"ASOF JOIN: Bridging the Time Gap in Data Analysis"

Value-Focused Framework

MAX_BY Magic