Custom Metadata in Delta Table History

Custom Metadata in Delta Table History

If you’ve ever lived in the pre-cloud ETL world, you’ll remember how ETL tools or stored procedures were the backbone of data processing. Back then, there was this popular concept called Audit Balance and Control (ABC). It was all about capturing audit information, tracking record counts at every stage of the process, and reconciling them at the end. This data would often be stored in a set of audit tables, control tables or some static reference tables, depending on how the ABC framework was designed.

Why did we need all this? Because observability on tools and platforms wasn’t great back then. These little tricks were our way of keeping things in check.

Fast Forward to Databricks and Delta Lake

Now, let’s jump to today’s world of Databricks and Delta Lake. As clients migrate from legacy data warehouses to Databricks, they often want to replicate those familiar ABC functionalities. But here’s where things get tricky: Delta tables don’t work like your old-school audit tables. Multiple processes writing to the same table can lead to concurrency issues. Sure, you could explore partitioning or liquid clustering, but honestly, for something as small as an audit table, that feels like overkill.

So, what if I told you there’s a much better (and lesser-known) feature that Delta Lake offers? One that’s simple yet powerful enough to handle this need without adding unnecessary complexity.

Enter Custom Metadata in Delta Lake

Delta Lake allows you to attach your own custom metadata to each commit. Yes, you read that right! When you check the table’s history (DESCRIBE HISTORY), you already see useful details like:

  • Operation type: write, merge, etc.
  • Operation metrics: rows impacted, byte size, files affected.

But here’s the game-changer—you can inject your own metadata into that history! Imagine adding pipeline names, source systems, timestamps, or any other context you need. This opens up endless possibilities for automation and observability.

How Do You Set It Up?

It’s surprisingly easy. You can configure custom metadata at two levels: session-wide or for a specific operation.

Session-Level Metadata

If you want metadata applied across all operations in a session:

spark.conf.set(
  "spark.databricks.delta.commitInfo.userMetadata",
  '{"framework_version": "ETL_v2.1", "owner": "data_engineering_team"}'
)        

Operation-Level Metadata

For more granular control over individual writes:

from datetime import datetime
import json

pipeline_metadata = {
    "pipeline_name": "long_running_pipeline",
    "source_system": "CRM",
    "processing_timestamp": datetime.now().isoformat(),
    "process_date": "2025-01-01"
}

(df.write.format("delta")
  .option("userMetadata", json.dumps(pipeline_metadata))
  .mode("append")
  .saveAsTable("catalog.schema.customer_data"))        

SQL Configuration

For SQL folks who prefer simplicity:

SET spark.databricks.delta.commitInfo.userMetadata = 'audit_comment';
INSERT OVERWRITE target_table SELECT * FROM data_source;        

What Happens Next?

Once your metadata is set up, it becomes part of your Delta table’s history. You can query it with your filter based on your own metadata.

Now you’ve got not just the default metrics but also your custom metadata right there in the history log.

Why Does This Matter?

This isn’t just about replicating old ABC frameworks—it’s about doing it smarter. With custom metadata:

  • You have richer audit logs for compliance or debugging.
  • You open doors for automation and deeper insights into your processes.
  • You are doing some manual addition and deletion and you can put a JIRA ticket number there to know the details in the table itself.

And the best part? It’s all built into Delta Lake—no extra tables or complex frameworks needed.

So go ahead - start experimenting with custom metadata in your Delta workflows, or reach out to Lovelytics . The possibilities are endless, and it’s time to leave those clunky legacy methods behind!

Puspendra Kumar

Data : Strategy, Architecture, Management & Governance

2 天前

ABC is mandatory for efficient data processing and building metadata driven frameworks. Informative article

要查看或添加评论,请登录

Deepak Rout的更多文章

  • Do You Know What the RELY Option in a Primary Key Does on Databricks?

    Do You Know What the RELY Option in a Primary Key Does on Databricks?

    If you're working with Databricks SQL and want to supercharge your query performance, it's time to explore the RELY…

  • Be Careful with NULL Handling During Database Migrations

    Be Careful with NULL Handling During Database Migrations

    When migrating databases, especially when using window functions like ROW_NUMBER(), RANK(), or DENSE_RANK(), one subtle…

    1 条评论
  • YAML Engineers

    YAML Engineers

    In the data engineering field, YAML files have become a beloved tool, much like in DevOps. Over the years, data…

  • Lost in translation

    Lost in translation

    I was scrolling through my LinkedIn feed the other day when I stumbled upon this hilarious (and kinda sad) interaction.…

  • The Modern Alternative to Makefiles

    The Modern Alternative to Makefiles

    Have you ever stared at a Makefile, feeling lost in a sea of colons, tabs, and cryptic syntax? You're not alone. Many…

  • Where Is Everyone ?

    Where Is Everyone ?

    Ever sent out a status update, shared code for review, or posted in a team channel only to be met with..

  • SQL Productivity Hack

    SQL Productivity Hack

    As a data consultant, I often find myself writing SQL queries to move data between tables with some transformations…

    1 条评论
  • "ASOF JOIN: Bridging the Time Gap in Data Analysis"

    "ASOF JOIN: Bridging the Time Gap in Data Analysis"

    You must have been hearing more and more about a new type of JOIN called . Modern databases are adding this feature to…

  • Value-Focused Framework

    Value-Focused Framework

    As leaders in the digital age, we must look beyond technology and patterns to embrace a holistic view of architecture…

  • MAX_BY Magic

    MAX_BY Magic

    MAX_BY, a powerful yet under utilized SQL function, is a game changer for writing readable and intuitive SQL. returns…

    2 条评论