登录查看更多内容

Shorticle 982 – Schema evolution, Time travel and Hidden partitioning with Data lake

Dr. Magesh Kasthuri

Distinguished Member Of Technical Staff at Wipro Limited

发布日期: 2023年1月11日

Traditional data warehouse services with Apache Hadoop is not user-friendly for querying and data analysis and hence an icing layer like Apache Hive framework is used to conventionally use query language like HQL to handle datasets behind the scenes in Apache Hadoop datastore. In recent times, this is more enhanced to provide massive scale data warehouse and data lake services.

In recent times, modern table formats in data lakes are introduced such as Apache Iceberg, Delta lake and Apache Hudi. They primarily support ACID properties in big data services for Hadoop datastore. It means it supports the following:

Atomicity where partial transaction failure doesn’t leave incomplete transaction to be available in data store and gets reverted fully and hence corrupted partial data will not be there in data store.

Consistency to enable operational excellence to provide transparency and visibility to operations and its results to its subsequent operations.

Isolation where one user session to datastore cannot affect any ongoing transactions with another user and hence data protection and transaction commitment can be ensured

Durability to enable complete enrichment of data transactions to the physical data store and hence data loss can be avoided by committed transaction completion.

领英推荐

"Introduction to Apache Impala: A Comprehensive Guide"

Palmeto Solutions 2 年前

Fidel Vetino Deep Dive into the Concept and World of…

Fidel .V 1 年前

Hive Data Types

Darshika Srivastava 1 年前

Apache Iceberg is high-performing data format for huge data transaction management with analytical ability. It can work with most popular streaming services like Apache Spark, Trino, Flink and Presto and also can integrate with Hive for data management services. It supports expressive SQL to merge new data and update existing data?for faster transaction processing.

Delta lake supports full schema evolution and hidden partitioning like Apache Iceberg to provide column movement and rearranging and renaming columns easily. Hidden partitioning helps to retrieve data faster and queries will be faster without extra filters to get focussed data values.

Apache Hudi is popular to integrate with Apache Spark and Hive and enables data mutation for consistent performance in handling large volume of data. It can be used to incrementally scan data records and respond to queries faster.

Apache Iceberg, Delta lakes and Apache Hudi simplifies data management and data processing cycle like reporting and dashboard facilities. They can be used together as a combinational framework to enhance data warehouse or data lake solutions.

For further read: https://www.databricks.com/session_na20/a-thorough-comparison-of-delta-lake-iceberg-and-hudi

#magtechbytes #wipro #shorticle #shorticlebd #shorticleea

要查看或添加评论，请登录

Dr. Magesh Kasthuri的更多文章

Shorticle - Cloud+ and State of FinOps report 2025

2025年3月4日

Shorticle - Cloud+ and State of FinOps report 2025

Recently, FinOps released the State of FinOps report 2025, an annual survey conducted by the FinOps Foundation…

1 条评论
New Features from Microsoft Ignite 2024 and Their Industry Impact

2024年12月15日

New Features from Microsoft Ignite 2024 and Their Industry Impact

The Microsoft Ignite 2024 event in November 2024 introduced a host of innovative features and tools aimed at enhancing…

1 条评论
Shorticle 1003 - How do Cloud Strategy decides SaaS integration approach?

2023年6月20日

Shorticle 1003 - How do Cloud Strategy decides SaaS integration approach?

During Cloud transformation journey, an important question that affects the entire strategy for cloud adoption is SaaS…
Shorticle 1002 – AWS SageMaker, Azure ML and Google ML

2023年6月12日

Shorticle 1002 – AWS SageMaker, Azure ML and Google ML

Machine learning ability is very critical for new age IT solutions including Data science, AIOps, Data analytics to…
Shorticle 1001 – Structural code analysis for better software quality

2023年2月8日

Shorticle 1001 – Structural code analysis for better software quality

For a programmer or Architect, static code analysis is not a new topic and they would have encountered it in their…

1 条评论
Shorticle 1001 – Structural code analysis for better software quality

2023年2月8日

Shorticle 1001 – Structural code analysis for better software quality

For a programmer or Architect, static code analysis is not a new topic and they would have encountered it in their…
Shorticle 1000 – Technology as a Game changer in Sports world

2023年1月29日

Shorticle 1000 – Technology as a Game changer in Sports world

We know in modern sports including Tennis, Football and Cricket, technology plays a vital role in multiple ways like…

1 条评论
Shorticle 999 – New features introduced in Java 19

2023年1月28日

Shorticle 999 – New features introduced in Java 19

Java has been stable for more than two decades and inspiring for more programming languages and frameworks in terms of…
Shorticle 998 – TensorFlow and its applications

2023年1月27日

Shorticle 998 – TensorFlow and its applications

TensorFlow is a wonderful framework to enable voice/sound based Machine-learning recognition, text based sentiment…
Shorticle 997 – Release management with Google Cloud Deploy and Flagger

2023年1月26日

Shorticle 997 – Release management with Google Cloud Deploy and Flagger

Deploying to cloud services should be carefully designed to handle controlled management of cloud resources so that any…

See all articles

Shorticle 982 – Schema evolution, Time travel and Hidden partitioning with Data lake

Dr. Magesh Kasthuri

Distinguished Member Of Technical Staff at Wipro Limited

领英推荐

Dr. Magesh Kasthuri的更多文章

社区洞察

其他会员也浏览了

Hadoop Usage in Data Analytics: An Overview

Understanding Apache Hudi's MERGE INTO Command with Minio and HiveMetaStore

Hadoop Market - Forecast(2024 - 2030)

Concept Of Parallelism To Upload The Split Data While Fulfilling Velocity Problem Is Right Or Not

Setting Up a Delta Lake Solution in Hadoop HDFS and Azure Data Lake Storage: By Fidel Vetino

Apache Griffin with Cloudera Hadoop and Data Quality POC

Data Modeling in the Big Data Era: HDFS

Riding Hadoop on Docker - Running Hadoop in Pseudo distributed mode on Docker

To Hub or Not to Hub, That is the Question ...

领英推荐

Dr. Magesh Kasthuri的更多文章

Shorticle - Cloud+ and State of FinOps report 2025

New Features from Microsoft Ignite 2024 and Their Industry Impact

Shorticle 1003 - How do Cloud Strategy decides SaaS integration approach?

Shorticle 1002 – AWS SageMaker, Azure ML and Google ML

Shorticle 1001 – Structural code analysis for better software quality

Shorticle 1001 – Structural code analysis for better software quality

Shorticle 1000 – Technology as a Game changer in Sports world

Shorticle 999 – New features introduced in Java 19

Shorticle 998 – TensorFlow and its applications

Shorticle 997 – Release management with Google Cloud Deploy and Flagger

社区洞察

其他会员也浏览了

Hadoop Usage in Data Analytics: An Overview

Understanding Apache Hudi's MERGE INTO Command with Minio and HiveMetaStore

Hadoop Market - Forecast(2024 - 2030)

Concept Of Parallelism To Upload The Split Data While Fulfilling Velocity Problem Is Right Or Not

Setting Up a Delta Lake Solution in Hadoop HDFS and Azure Data Lake Storage: By Fidel Vetino

Apache Griffin with Cloudera Hadoop and Data Quality POC

Data Modeling in the Big Data Era: HDFS

Riding Hadoop on Docker - Running Hadoop in Pseudo distributed mode on Docker

To Hub or Not to Hub, That is the Question ...