ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Understanding Delta Table Format and Architecture

Arabinda Mohapatra

Pyspark, SnowFlake,AWS, Stored Procedure, Hadoop,Python,SQL,Airflow,Kakfa,IceBerg,DeltaLake,HIVE,BFSI,Telecom

å‘å¸ƒæ—¥æœŸ: 2024å¹´8æœˆ19æ—¥

+ å…³æ³¨

Delta Lake is an open-source storage framework that enables building a format agnostic Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, Hive, Snowflake, Google BigQuery, Athena, Redshift, Databricks, Azure Fabric and APIs for Scala, Java, Rust, and Python. With Delta Universal Format aka UniForm, you can read now Delta tables with Iceberg and Hudi clients.
Delta Lake, an open source storage layer, addresses these challenges by introducing a transactional storage layer on top of your data lake

Purpose of Each Component

lastcheckpoint: Keeps track of the last successful checkpoint, which is used for recovery in case of failures
COMMITTEDFILES: Lists all the files that have been successfully committed to the table
INPROGRESS: Stores files that are currently being written and not yet committed
_SUCCESS: Contains files that have been successfully written and are ready to be committed
_temp: Used for temporary storage during the write process

How Does Compaction Work?

Transaction Log Files: Initially, each transaction in Delta Lake is recorded as a separate JSON file in the _delta_log directory. These files contain detailed information about the changes made during each transaction.
CRC Files :The .crc files are checksum files that help in validating the integrity of the JSON files. They ensure that the transaction log files have not been corrupted during storage or transfer
Error Detection: By using CRC (Cyclic Redundancy Check), Delta Lake can detect errors in the transaction log files, ensuring that only valid and uncorrupted data is processed
Compaction Process: Periodically, Delta Lake compacts these JSON files into Parquet files. Parquet is a columnar storage format that is highly efficient for both storage and retrieval.
Consolidation: During compaction, multiple JSON files are read, and their contents are consolidated into a single Parquet file. This Parquet file then replaces the individual JSON files in the _delta_log directory.
Checkpointing: Along with compaction, Delta Lake also creates checkpoints. A checkpoint is a snapshot of the transaction log at a specific point in time, stored as a Parquet file. Checkpoints make it faster to read the transaction log because the system can start from the checkpoint rather than reading all the JSON files from the beginning

Step 1â€”Creating Databricks Delta Table:

CREATE TABLE students_info (
  id INT,
  name STRING,
  age INT
)
USING DELTA
LOCATION '/user/hive/warehouse/student_info'

Step 2â€”Inserting Data Into Databricks Delta Table

Using SQL:

INSERT INTO students_info
VALUES (1, "Elon", 25),
       (2, "Jeff", 30),
       (3, "Larry", 35)

Using the DataFrame:

Step 3â€”Query the Delta Table

Using SQL:

SELECT * FROM students_info;

Using the DataFrame:

Step 4â€”Perform DML Operations (Update, Delete, Merge)

1) Updating Records in a Delta Table:

Using SQL:

é¢†è‹±æŽ¨è

Exploring Data Operations with PySpark, Pandas, DuckDB, Polars, and DataFusion in a Python Notebook

Exploring Data Operations with PySpark, Pandasâ€¦

Alex Merced 5 ä¸ªæœˆå‰

Understanding JSON: The Backbone of Modern Data Exchange

Understanding JSON: The Backbone of Modern Dataâ€¦

Mohammad Jazim 5 ä¸ªæœˆå‰

Building a Robust Data Engineering Pipeline with Snowflake and Python

Building a Robust Data Engineering Pipeline withâ€¦

ARNAB MUKHERJEE ???? 5 ä¸ªæœˆå‰

UPDATE students_info
SET name = "Tony Stark"
WHERE id = 1;

Using the DataFrame:

%python
from pyspark.sql.functions import *

# Read the Delta table into a DataFrame
students_df = spark.read.format("delta").load("/user/hive/warehouse/students_info_dataframe")

# Update the DataFrame
updated_df = students_df.withColumn(
    "name",
    when(students_df.id == 1, lit("Tony Stark"))
    .otherwise(students_df.name)
)

# Write the updated DataFrame back to the Delta table
updated_df.write.format("delta").mode("overwrite").option("mergeSchema", "true").saveAsTable("students_info_dataframe")

2) Deleting Records in a Delta Table:

DELETE FROM students_info
WHERE age < 30;

Using the DataFrame:

%python
# Read the Delta Table into a DataFrame
delta_table_df = spark.read.format("delta").load("/user/hive/warehouse/students_info_dataframe")

# Filter the DataFrame to exclude rows to be deleted
filtered_df = delta_table_df.filter(delta_table_df.age <= 30)

# Write the filtered DataFrame back to the Delta Table
filtered_df.write.format("delta").mode("overwrite").save("/user/hive/warehouse/students_info_dataframe")

3) Merging Records in a Delta Table

Using SQL:

Using DataFrame:

%python
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Source DataFrame
source_data = [(1, 'Tony Stark', 35), (4, 'Bruce Wayne', 40)]
source_schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])
source_df = spark.createDataFrame(source_data, schema=source_schema)

# Loading target Databricks Delta table
target_df = spark.read.format("delta").table("student_info_dataframe")

# Merge operation
merged_df = target_df.alias("target").join(source_df.alias("source"), on="id",how= "outer") \
    .withColumn("id", coalesce(target_df["id"], source_df["id"])) \
    .withColumn("name", coalesce(source_df["name"], target_df["name"])) \
    .withColumn("age", coalesce(source_df["age"], target_df["age"]))

# Display resulting DataFrame
merged_df.show()

# Overwrite the target Delta table
merged_df.write.format("delta").mode("overwrite").saveAsTable("student_info_dataframe")

Step 5â€”Using Time Travel Option in Delta Table (Optional)

Databricks Delta Tables offer a robust feature called â€œtime travel,â€ which keeps a comprehensive history of all data changes. This capability allows you to query and revert to earlier versions of your data, making it extremely useful for tasks like auditing, debugging, and replicating experiments or reports.

Using the SQL:

To query a previous version of a Delta Table, you can use the SQL TIMESTAMP AS OF or VERSION AS OF clause.

SELECT * 
FROM students_info
VERSION AS OF "-------";

SELECT * 
FROM students_info
TIMESTAMP AS OF "-------";

Using the DataFrame:

using Time Travel Option in Databricks Delta Table using DataFrame

Step 6â€”Optimizing Databricks Delta Table (Optional)

Will write further:

https://delta.io/

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Arabinda Mohapatraçš„æ›´å¤šæ–‡ç«

A Deep Dive into Caching Strategies in Snowflake

2025å¹´3æœˆ22æ—¥

A Deep Dive into Caching Strategies in Snowflake

What is Caching? Caching is a technique used to store the results of previously executed queries or frequently accessedâ€¦
A Deep Dive into Snowflake External Tables: AUTO_REFRESH and PATTERN Explained

2025å¹´3æœˆ16æ—¥

A Deep Dive into Snowflake External Tables: AUTO_REFRESH and PATTERN Explained

An external table is a Snowflake feature that allows you to query data stored in an external stage as if the data wereâ€¦
Apache Iceberg

2025å¹´3æœˆ16æ—¥

Apache Iceberg

Apache Iceberg Apache Iceberg is an open-source table format designed to handle large-scale analytic datasetsâ€¦
Deep Dive into Snowflake: Analyzing Storage and Credit Consumption

2025å¹´2æœˆ24æ—¥

Deep Dive into Snowflake: Analyzing Storage and Credit Consumption

1. Table Storage Metrics select TABLE_SCHEMA,TABLE_CATALOG AS"DB",TABLE_SCHEMA, TABLE_NAME,sum(ACTIVE_BYTES) +â€¦

1 æ¡è¯„è®º
Continuous Data Ingestion Using Snowpipe in Snowflake for Amazon S3

2025å¹´2æœˆ23æ—¥

Continuous Data Ingestion Using Snowpipe in Snowflake for Amazon S3

USE WAREHOUSE LRN; USE DATABASE LRN_DB; USE SCHEMA LEARNING; ---Create a Table in snowflake as per the source dataâ€¦

1 æ¡è¯„è®º
Data Loading with Snowflake's COPY INTO Command-Table

2025å¹´2æœˆ18æ—¥

Data Loading with Snowflake's COPY INTO Command-Table

Snowflake's COPY INTO command is a powerful tool for data professionals, streamlining the process of loading data fromâ€¦
SNOW-SQL in SNOWFLAKE

2025å¹´2æœˆ17æ—¥

SNOW-SQL in SNOWFLAKE

SnowSQL is a command-line tool designed by Snowflake to interact with Snowflake databases. It allows users to executeâ€¦
Stages in Snowflake

2025å¹´2æœˆ9æ—¥

Stages in Snowflake

Stages in Snowflake play a crucial role in data loading and unloading processes. They serve as intermediary storageâ€¦
Snowflake Tips

2025å¹´2æœˆ8æ—¥

Snowflake Tips

??Tip 1: Use the USE statement to switch between warehouses Instead of specifying the warehouse name in every queryâ€¦
SnowFlake

2025å¹´2æœˆ8æ—¥

SnowFlake

??What is a Virtual Warehouse in Snowflake? ??A Virtual Warehouse in Snowflake is a cluster of compute resources thatâ€¦

See all articles

Understanding Delta Table Format and Architecture

Arabinda Mohapatra

Pyspark, SnowFlake,AWS, Stored Procedure, Hadoop,Python,SQL,Airflow,Kakfa,IceBerg,DeltaLake,HIVE,BFSI,Telecom

Step 1â€”Creating Databricks Delta Table:

Step 2â€”Inserting Data Into Databricks Delta Table

Using SQL:

Using the DataFrame:

Step 3â€”Query the Delta Table

Using SQL:

Using the DataFrame:

Step 4â€”Perform DML Operations (Update, Delete, Merge)

1) Updating Records in a Delta Table:

Using SQL:

é¢†è‹±æŽ¨è

Using the DataFrame:

2) Deleting Records in a Delta Table:

Using the DataFrame:

3) Merging Records in a Delta Table

Using SQL:

Using DataFrame:

Step 5â€”Using Time Travel Option in Delta Table (Optional)

Using the SQL:

Using the DataFrame:

Step 6â€”Optimizing Databricks Delta Table (Optional)

Arabinda Mohapatraçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Validate the Data - AWS Day13

I created an ETL pipeline using Python, BigQuery, and Apache Airflow

Accessing Columns in PySpark: A Comprehensive Guide

Python in Data Engineering: Powering Databricks, Snowflake, dbt, and Airflow for Big Data Pipelines

Apache Spark - Memory Allocation

4 Different Ways to fetch Apache Hudi Commit time in Python and PySpark

Exploring Apache Beam's ParDo Function: A Key for Parallel Processing

Building a simple ETL Pipeline in PySpark and S3 persistence: A SOLID Approach

Spark Tidbits - Lesson 9

Getting Started with Apache Airflow

Step 1â€”Creating Databricks Delta Table:

Step 2â€”Inserting Data Into Databricks Delta Table

Using SQL:

Using the DataFrame:

Step 3â€”Query the Delta Table

Using SQL:

Using the DataFrame:

Step 4â€”Perform DML Operations (Update, Delete, Merge)

1) Updating Records in a Delta Table:

Using SQL:

é¢†è‹±æŽ¨è

Using the DataFrame:

2) Deleting Records in a Delta Table:

Using the DataFrame:

3) Merging Records in a Delta Table

Using SQL:

Using DataFrame:

Step 5â€”Using Time Travel Option in Delta Table (Optional)

Using the SQL:

Using the DataFrame:

Step 6â€”Optimizing Databricks Delta Table (Optional)

Arabinda Mohapatraçš„æ›´å¤šæ–‡ç«

A Deep Dive into Caching Strategies in Snowflake

A Deep Dive into Snowflake External Tables: AUTO_REFRESH and PATTERN Explained

Apache Iceberg

Deep Dive into Snowflake: Analyzing Storage and Credit Consumption

Continuous Data Ingestion Using Snowpipe in Snowflake for Amazon S3

Data Loading with Snowflake's COPY INTO Command-Table

SNOW-SQL in SNOWFLAKE

Stages in Snowflake

Snowflake Tips

SnowFlake

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Validate the Data - AWS Day13

I created an ETL pipeline using Python, BigQuery, and Apache Airflow

Accessing Columns in PySpark: A Comprehensive Guide

Python in Data Engineering: Powering Databricks, Snowflake, dbt, and Airflow for Big Data Pipelines

Apache Spark - Memory Allocation

4 Different Ways to fetch Apache Hudi Commit time in Python and PySpark

Exploring Apache Beam's ParDo Function: A Key for Parallel Processing

Building a simple ETL Pipeline in PySpark and S3 persistence: A SOLID Approach

Spark Tidbits - Lesson 9

Getting Started with Apache Airflow

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†