Delta Lake
Delta Lake

Delta Lake

You have heard about Data Lake, now what is Delta Lake ??. Don't worry, it's nothing new but yes with new capabilities.

Let's decode it….

All of us who have tasted Data Lake or Big Data, know about batch and real-time streaming. Hadoop or S3 all face challenges to beat RDBMS performance as far as DML is concerned. For the same reason, we went into in-memory using Spark for fast DML and Analytics queries. But using Spark comes with the challenge of no ACID compliance so data loss was a huge concern. Though there are workarounds for data recovery, the default feature always supersedes all customized solutions.

Image: https://techcrunch.com/2019/10/15/databricks-brings-its-delta-lake-open-source-project-to-the-linux-foundation/

For a starter, Delta Lake is no more than using in-memory Spark to process data with ACID compliance.

For the next level of understanding, Delta Lake was developed by Databricks to cater ACID feature in Spark as a problem statement. It is an open-source storage layer, in Apache Parquet Format, which sits on top of Spark. It uses the same concept of Transaction Log as Delta Log in JSON Format to support ACID capability in Spark. It has concepts of Ingestion Tables, Refined Tables and Feature/ Agg Data Stores.

Delta Lake key points:

●???????Supports ACID in Spark: ACID compliance

●???????Metadata Handling: Able to manage metadata which will assist in data lineage.

●???????Real-Time and Batch Processing Unification: Able to process both real-time and periodic datasets.

●???????Schema Management: Comparing Data Structure with Table Structure before loading. And able to change Table structure at the memory level.

●???????Enables Time travel: To keep the history of change for rollback using Timestamp or Version number.

●???????Enables UPSERT and DELETE: Able to do Insert, Update and Delete.

●???????Delta Log [Transaction Log]: In-contrast with Data Lake, now you can pull CDC data from Delta Lake as well.

Cheers.

Muhammad Shoaib Siddiqui

Data Engineer | Big Data | Azure | Power BI | AWS | Python | Databricks | Fabric | DWH | Teradata | Hadoop | Spark | Hive | Docker | NoSQL

2 年

Certainly the best explanation of a Delta lake. Thanks for sharing!

要查看或添加评论,请登录

Mustafa Qizilbash的更多文章

社区洞察

其他会员也浏览了