Custom Metadata in Delta Table History
If you’ve ever lived in the pre-cloud ETL world, you’ll remember how ETL tools or stored procedures were the backbone of data processing. Back then, there was this popular concept called Audit Balance and Control (ABC). It was all about capturing audit information, tracking record counts at every stage of the process, and reconciling them at the end. This data would often be stored in a set of audit tables, control tables or some static reference tables, depending on how the ABC framework was designed.
Why did we need all this? Because observability on tools and platforms wasn’t great back then. These little tricks were our way of keeping things in check.
Fast Forward to Databricks and Delta Lake
Now, let’s jump to today’s world of Databricks and Delta Lake. As clients migrate from legacy data warehouses to Databricks, they often want to replicate those familiar ABC functionalities. But here’s where things get tricky: Delta tables don’t work like your old-school audit tables. Multiple processes writing to the same table can lead to concurrency issues. Sure, you could explore partitioning or liquid clustering, but honestly, for something as small as an audit table, that feels like overkill.
So, what if I told you there’s a much better (and lesser-known) feature that Delta Lake offers? One that’s simple yet powerful enough to handle this need without adding unnecessary complexity.
Enter Custom Metadata in Delta Lake
Delta Lake allows you to attach your own custom metadata to each commit. Yes, you read that right! When you check the table’s history (DESCRIBE HISTORY), you already see useful details like:
But here’s the game-changer—you can inject your own metadata into that history! Imagine adding pipeline names, source systems, timestamps, or any other context you need. This opens up endless possibilities for automation and observability.
How Do You Set It Up?
It’s surprisingly easy. You can configure custom metadata at two levels: session-wide or for a specific operation.
Session-Level Metadata
If you want metadata applied across all operations in a session:
spark.conf.set(
"spark.databricks.delta.commitInfo.userMetadata",
'{"framework_version": "ETL_v2.1", "owner": "data_engineering_team"}'
)
Operation-Level Metadata
For more granular control over individual writes:
from datetime import datetime
import json
pipeline_metadata = {
"pipeline_name": "long_running_pipeline",
"source_system": "CRM",
"processing_timestamp": datetime.now().isoformat(),
"process_date": "2025-01-01"
}
(df.write.format("delta")
.option("userMetadata", json.dumps(pipeline_metadata))
.mode("append")
.saveAsTable("catalog.schema.customer_data"))
SQL Configuration
For SQL folks who prefer simplicity:
SET spark.databricks.delta.commitInfo.userMetadata = 'audit_comment';
INSERT OVERWRITE target_table SELECT * FROM data_source;
What Happens Next?
Once your metadata is set up, it becomes part of your Delta table’s history. You can query it with your filter based on your own metadata.
Now you’ve got not just the default metrics but also your custom metadata right there in the history log.
Why Does This Matter?
This isn’t just about replicating old ABC frameworks—it’s about doing it smarter. With custom metadata:
And the best part? It’s all built into Delta Lake—no extra tables or complex frameworks needed.
So go ahead - start experimenting with custom metadata in your Delta workflows, or reach out to Lovelytics . The possibilities are endless, and it’s time to leave those clunky legacy methods behind!
Data : Strategy, Architecture, Management & Governance
2 天前ABC is mandatory for efficient data processing and building metadata driven frameworks. Informative article