Shorticle 982 – Schema evolution, Time travel and Hidden partitioning with Data lake
Traditional data warehouse services with Apache Hadoop is not user-friendly for querying and data analysis and hence an icing layer like Apache Hive framework is used to conventionally use query language like HQL to handle datasets behind the scenes in Apache Hadoop datastore. In recent times, this is more enhanced to provide massive scale data warehouse and data lake services.
In recent times, modern table formats in data lakes are introduced such as Apache Iceberg, Delta lake and Apache Hudi. They primarily support ACID properties in big data services for Hadoop datastore. It means it supports the following:
Atomicity where partial transaction failure doesn’t leave incomplete transaction to be available in data store and gets reverted fully and hence corrupted partial data will not be there in data store.
Consistency to enable operational excellence to provide transparency and visibility to operations and its results to its subsequent operations.
Isolation where one user session to datastore cannot affect any ongoing transactions with another user and hence data protection and transaction commitment can be ensured
Durability to enable complete enrichment of data transactions to the physical data store and hence data loss can be avoided by committed transaction completion.
领英推荐
Apache Iceberg is high-performing data format for huge data transaction management with analytical ability. It can work with most popular streaming services like Apache Spark, Trino, Flink and Presto and also can integrate with Hive for data management services. It supports expressive SQL to merge new data and update existing data?for faster transaction processing.
Delta lake supports full schema evolution and hidden partitioning like Apache Iceberg to provide column movement and rearranging and renaming columns easily. Hidden partitioning helps to retrieve data faster and queries will be faster without extra filters to get focussed data values.
Apache Hudi is popular to integrate with Apache Spark and Hive and enables data mutation for consistent performance in handling large volume of data. It can be used to incrementally scan data records and respond to queries faster.
Apache Iceberg, Delta lakes and Apache Hudi simplifies data management and data processing cycle like reporting and dashboard facilities. They can be used together as a combinational framework to enhance data warehouse or data lake solutions.