Lakehouse, make Big Data great again

Lakehouse, make Big Data great again

Maybe the best title for this article is "The Data Lake is Dead; Long Live the Data Lake! " but Martin Willcox already owns it. Indeed, Data Lake's predecessor, Data Warehouse, was killed and buried years ago to give life to a new approach to managing data for analytics. And the same is happening now with Data Lake. This is not a prophecy or a "futurist-like" speech; just google "Lakehouse" and surprise yourself with recent breaking news. As you may have observed in your career, any new thing solves the previous technology issues, but at some point, it will face limitations, opening the door for a new thing to come up. This is the fuel for innovation, finding meaningful problems, and solving them by creating new solutions. It is an endless buckle.

Data Management is not exempt from this rule. The wave started with a Traditional Data Warehouse, the first approach to making data analytics an enterprise function and organizing the mess. Years later, Big Data required a new way to manage data, so the Data Lake approach gained wide adoption. Finally, we are now seeing the introduction of Lakehouse as an innovative way to manage data at scale. These cycles are even shorter, which means the speed of innovation is nothing but hallucinating.

Before digging into the Lakehouse approach and navigating its differentiators, it is essential to time travel—one of the most exciting features Lakehouse provides—from the early days of data management to today and understand the type of data problems each solution solved, as well its limitations. Lakehouse is not just another trendy buzzword.

The Data Lake legacy

https://www.qlik.com/us/data-lake/data-lakehouse

Data Warehouse

The Data Warehouse isn’t dead, but the Traditional Data Warehouse was finally buried. Legacy technologies such as Oracle and Teradata, among others, became very popular in helping companies centralize in one place, mainly structured data from a myriad of enterprise systems, including ERP, CRM, and financial ledgers. Back in the 80’s, this was the time of the mainframe, where software and hardware married each other. Kimball proposed a dimensional approach to organize data in Data Warehouse solutions, while Inmon developed a normalized theory to optimize storage space and data redundancy. For the first time, companies could set a single source of truth for decision-making and disseminate that information and knowledge through Business Intelligence tools.

This approach worked well for almost 30 years. At the beginning of the 2000s, the internet started to become very popular, promoting any sort of new businesses, consequently generating new types of high-volume data, adding pressure to Data Warehouses and bringing to light its limitations: inflexibility to manage unstructured data, complex to build and maintain ETL pipelines, high TCO. A second factor was the need to evolve analytics beyond Business Intelligence and descriptive analytics to more advanced techniques like Machine Learning and predictive analytics. Even though it is rare to see legacy implementations of Data Warehouses nowadays, its fundamentals to organize and model data for decision-making remain and are still in use.

Data Lake

Data deluge: the world is generating more data than ever before. Among other Vs, volume, variety, and velocity changed the rules for a new era in the data analytics industry. Traditional Data Warehouses cannot just keep up with the need to store more data in different formats and make data easily accessible by end-users. New frameworks were introduced to address those requirements. Apache Hadoop, a popular open-source project supported by Silicon Valley-based companies like Cloudera and Hortonworks, is a platform stack to ingest, store, and analyze petabytes of data. Time travel to 2005, commodity hardware was a significant paradigm shift proposed by Big Data solutions. The economics of data changed. Moving away from proprietary software and mainframes, storing and analyzing more data eventually became cheaper. HDFS was the preferred storage layer for unstructured and semi-structured data. The open-source community lived their golden years building new frameworks, like Spark, Hive, Kafka, and many others, to solve parts of the data pipeline. The industry evolved from ETL to ELT jobs, enhancing data usage.

https://dataedo.com/

But again, any new technology will suffer Gartner's hype cycle fate. Quickly, some Data Lakes became truly Data Swamps, missing enterprise data management capabilities at scale. Data consumers started to doubt the data quality and integrity to feed business-critical analytics. The idea of a single source of truth gave space for data inconsistency concerns. Some companies implemented a hybrid approach for data management: run massive data ingestion and processing on the Data Lake, obtaining flexibility for unstructured data and better TCO, but storing the gold layer in the Data Warehouse, gaining integrity and access speed. Data Lakes were designed for batch processing, but real-time data became mandatory for new use cases, uncovering even more of its limitations. Object storage - e.g., s3, adls, Ozone, and Min.io - gradually replaced HDFS and other legacy storage technologies to provide managed experience, better TCO, and cloud-native experience.

Lakehouse

The need for a new data management approach is obvious, and Lakehouse immediately got widespread adoption and community support. However, the challenge to overcome is complex: simplify data management in the era where companies are adopting more cloud-based infrastructure, data analytics spams from Business Intelligence to GenAI, and high volume of both real-time and unstructured data still relevant. The perfect storm to bring to light a new solution. Lakehouse is not reinventing the wheel: it combines the best of Data Warehouse and Data Lake in one data management solution. It offers ideal features and capabilities to harmonically co-live structured and unstructured data in one single place, a multi-format data layer. Pipelines are also simplified: processing jobs at scale can run on the same layer where data is consumed for analytics and end-users.

https://dipankar-tnt.medium.com/onetable-interoperability-for-apache-hudi-iceberg-delta-lake-bb8b27dd288d

At the time of this writing, three main open-source projects claim to enable Lakehouse solutions for the enterprise. Apache Hudi, Apache Iceberg, and Delta Lake all have similar features, but also differences. Data catalogs are also a critical component of the Lakehouse architecture, and the choices go from Hive Metastore to Nessie, Polaris, Unity Catalog, and others. The actual business data is compressed for performance purposes, and formats like Apache Parquet and Apache Avro give flexibility to address different use cases and scenarios. On the vendor side, Dremio, Cloudera, Databricks, Snowflake, and cloud service providers are Lakehouse enterprise-ready solutions.

What is Lakehouse about?

Despite the problems it solves, Lakehouse is not just a solution; it's a game-changer in data management practice. It proposes fundamental shifts in how data is organized, processed, and served to use cases, offering companies a path to redefine their data strategy. This strategy is designed for simplicity, modernization, and scalability.

It’s about Analytics

Yes, no one disagrees that Lakehouse is an enabler for analytics. But today, the world is fascinated by GenAI and all the breadth of new use cases it can build. Gartner proposed the Data Maturity Framework years ago to help companies measure their analytics capabilities, assess where they are now, and plan for the next phase leap. With a Lakehouse as part of the data architecture, serving use cases from descriptive analytics to GenAI is a smooth task, as all the structured and unstructured data, including its relevant metadata, is seamlessly accessible from one single place to any purpose. From a strategic point of view, Lakehouse sets the new foundations for data preparation, BI, ad-hoc analytics, ML, AI, and, eventually, any new sort of analytics in the future.

https://www.dhirubhai.net/pulse/from-business-intelligence-generative-ai-view-alex-campos/

The second aspect to bear in mind is the Medallion Architecture. The medallion is still considered a design pattern capable of handling most of the typical pipelines companies develop to make data valuable and consumable from ingestion to analytics. It proposes three phases or stages: Bronze for raw, as-is data; Silver for cleaned, enriched data; and finally, Gold to respond to business needs. In the era of “Data Warehouse + Lake”, Big Data platforms like Apache Hadoop were positioned to both Bronze and Silver layers due to their capability to manage high volumes of any data format and quickly scale for processing frameworks like Spark and Hive. The Gold layer continued to be served from traditional data warehouses due to its ability to force schema and data integrity and provide low-latency data access. Lakehouse challenges this “hybrid approach” as companies are confident Gold data can be served from the same place where the Bronze and Silver layers are stored, streamlining the data pipeline and reducing the time to make consumable data available.

https://blog.bismart.com/en/medallion-architecture

It’s about Scalability

Lakehouse is cloud-native. There is no need for self-managed solutions or heavy-load maintenance tasks. Cloud Service Providers play a pivotal role in providing reliable, secure, and, more importantly, scalable infrastructure to architect and run Lakehouse infrastructure. Cloud-native Object Storages are robust solutions for managing and storing virtually unlimited data volumes.

Compute resources are also improved in the cloud. Dynamic compute resource allocation powered by containers will allocate the right amount of resources to the workloads. Scalability means peace of mind to achieve SLA and ensure service continuity—no more broken pipelines.

It’s about Interoperability

Traditional Data Warehouses were about proprietary solutions, Data Lakes were about openness, and now, Lakehouses are about interoperability. Open source and open standards are the foundations in the Lakehouse landscape that avoid vendor lock-in. From open data formats (e.g., Parquet, Avro, etc.) to Open Data Catalogs, any organization adopting Lakehouse will have the flexibility to use its preferred processing layer, format, and metadata provider. Lakehouse solutions like Databricks, Snowflake, Cloudera, and Dremio will continue leveraging open components and standards to ensure interoperability, translated to freedom of choice.

https://dipankar-tnt.medium.com/onetable-interoperability-for-apache-hudi-iceberg-delta-lake-bb8b27dd288d

Apache XTable (https://xtable.apache.org/ ) is a great example of interoperability in the Lakehouse landscape. It allows companies to use Apache Iceberg, Apache Hudi, and Delta Lake simultaneously without any additional business data processing by automatically rewriting the metadata and manifest files. With this “write once, query everywhere” approach, XTable makes data universally available regardless of the engine, facilitating omnidirectional interoperability among various lakehouse table formats.

It’s about Efficiency

Managing a Data Lake for all enterprise data and a Data Warehouse for curated, aggregated data represents a challenge, not only from the functional perspective but also from the TCO and maintenance point of view. Lakehouse simplifies and streamlines data pipelines. The Medallion Architecture powered by Lakehouse avoids data duplication, as data layers will be in the same place and will not require additional movements or integration.

At the compute level, Apache Spark and Apache Flink deliver the power and scalability to process high volumes of data and flexibility to work with any format, the Data Lake zone. On the consumption side, MPP engines like Apache Impala, Trino, Snowflake, and others can serve data concurrently while keeping compatibility with business intelligence tools. In practice, Business Intelligence and Artificial Intelligence workloads can run seamlessly on top of the same storage by leveraging the right tool for the task.

https://aws.amazon.com/blogs/big-data/build-a-lake-house-architecture-on-aws/

Data pipeline development and maintenance are also improved with everything on one data storage. Governance is streamlined, ensuring end-to-end consistency and integrity on structured and unstructured data. Last but not least, security is simplified, increasing trust in the data ecosystem, avoiding breaches, and decreasing risks.

It’s about Cloud

Lakehouse was born in the age of cloud computing and leverages its relevant capabilities to ensure interoperability, efficiency, and scalability to run any analytics workload.

Nowadays, the market is talking about a Lakehouse war. Ultimately, the only winners of this war will be users and enterprises adopting Lakehouse.

Lakehouse Landscape

Update at 2024 year, Lakehouse Landscape is composed of these following parts:

Format: Open source is predominant, and Iceberg, Hudi, and Delta Lake are leading the enterprise solutions. XTable is worth mentioning for cross-table interoperability.

Catalog: as Lakehouse is a truly decoupled architecture, catalogs play a crucial role in maintaining the current table state and ensuring smooth operability with engines and tools, making them an integral part of the system's functionality.

Data: in the Lakehouse landscape, the actual business data is compressed for performance purposes and, again, interoperability. Parquet, a popular columnar compression format, leads the race and preferences in addressing common use cases.

Engines/Tools: The variety of tools and engines spans data ingestion, processing, integrations, ad-hoc querying, and more. Most of these tools were also very popular in the Data Lake era, and many of them are ready to work with Lakehouse.

Vendors: Enterprise support and world-class software are required for most industries, and Data Lake incumbents and new players, alongside cloud service providers, are making their space in the Lakehouse era.

Lakehouse Landscape 2024

Notably, open source and interoperability are prediminant in the Lakehouse landscape, avoiding vendor lock-in and providing freedom to use the company's preferred tools


Amit Kadam

Data Lake | Data Engineering | Azure AI Certified | NoSQL Certified | Cloudera Hadoop Certified

4 个月

Legacy is not still legacy and they offer lot of great benefits and flexibility towards working with open file formats as we see datalake are still moving more towards data warehouses where databases like oracle or Teradata still preferred and would probably hold still relevant as they provide lot of opportunities for optimization and tightly governed and still now very much extensibility to cloud and provide hybrid environment

回复
Ricardo Toledo

Sales and Pre-Sales | Customer Success Specialist | GDPR | LGPD | Governan?a de Dados | Membro ANPPD? | DPO | BI | Arquitetura de Dados | Compliance | Data Analytics | Apreciador de Boas Idéias | Investidor B3 | Trader.

4 个月

Artigo muito bom e didático.

Alex Merced

Co-Author of “Apache Iceberg: The Definitive Guide” | Senior Tech Evangelist at Dremio | LinkedIn Learning Instructor | Tech Content Creator

4 个月

For anyone wanting to get hands-on from their laptop (zero cloud costs and free learning) Ingest and query with Apache Iceberg https://bit.ly/end-to-end-de-tutorial

回复
Evan Smith

Technical Content Manager | Technical Writer | Instructional Designer

4 个月

Solid piece Alex Campos! I totally agree. Following the news that all major data platforms will now rally around Apache Iceberg, the era of the data lakehouse feels like it's finally here in full. It's certainly what Starburst is seeing with its Icehouse Architecture: https://youtu.be/-vDSc_d8NgI

Very interesting and historically accurate review of our journey, capturing both past and present excitement! Thank you Alex for sharing this! Looking forward to future innovations and continued success! #Cloudera #Lakehouse #Innovation #FutureGrowth #ReflectingOnThePast #ExcitingTimes

要查看或添加评论,请登录

社区洞察

其他会员也浏览了