Data Lakehouse 101: The Who, What and Why of Data Lakehouses
Alex Merced
Co-Author of “Apache Iceberg: The Definitive Guide” | Senior Tech Evangelist at Dremio | LinkedIn Learning Instructor | Tech Content Creator
The scale of data is growing every day, with storage now reaching petabyte and exabyte levels that need to be utilized in increasingly diverse ways. This evolution’s cost and practicality make the old paradigm — using data lakes to store both structured and unstructured data, then moving portions of that structured data into data warehouses for reporting, analytics, dashboards, and more — full of friction. The friction arises from storing multiple copies of data for each system used, keeping this data in sync and consistent, and delivering it at the high speeds modern data needs demand. Addressing these challenges is where a new architecture called data lakehouses comes into play.
WHAT is a Data Lakehouse?
A data lakehouse aims to bring the performance and ease of use of a data warehouse to the data already stored in your data lake. It establishes your data lake (a storage layer for storing data as files) as the source of truth, with the goal of keeping the bulk of your data on the lake. This is made possible through several key technologies:
By using a storage layer like Hadoop or object storage and leveraging the technologies above, you can construct a data lakehouse, which can also be thought of as:
At this point, you can use various tools like Dremio, Snowflake, Apache Spark, Apache Flink, and more to run workloads on your data lakehouse without duplicating your data across each platform you use.
+----------------------------+
| Data Lakehouse |
+----------------------------+
|
+------------------------------------+
| Storage Layer |
| (Hadoop, Object Storage, etc.) |
+------------------------------------+
|
+------------------------------------+
| File Formats |
| (Apache Parquet, etc.) |
+------------------------------------+
|
+------------------------------------+
| Table Formats |
| (Apache Iceberg, etc.) |
+------------------------------------+
|
+------------------------------------+
| Metadata Catalogs |
| (Nessie, Polaris, etc.) |
+------------------------------------+
|
+------------------------------------+
| Data Processing |
| (Dremio, Snowflake, Apache Spark, |
| Apache Flink, etc.) |
+------------------------------------+
WHY a data lakehouse?
Transitioning to a data lakehouse architecture offers significant advantages, particularly by reducing the need for multiple copies of your data. Traditional data architectures often require creating separate copies of data for different systems, leading to increased storage costs and complexities in data management. A data lakehouse centralizes your data storage, allowing various tools and applications to access the same data without duplication. This not only simplifies data governance and consistency but also reduces the storage overhead, resulting in substantial cost savings.
Moreover, a data lakehouse facilitates seamless toolset migration and concurrent use of multiple tools without incurring high migration costs. By leveraging open standards and interoperable technologies like Apache Iceberg and open-source catalogs, you can easily switch between different data processing and analytics tools such as Dremio, Snowflake, Apache Spark, and Apache Flink without needing to move or transform your data. This flexibility reduces the total cost of ownership, as you avoid the egress fees and compute expenses associated with transferring data between platforms. Consequently, you can achieve lower overall compute, storage, and egress costs, while simultaneously benefiting from the strengths of various tools operating on the same unified dataset.
HOW to Migrate to a Data Lakehouse
Migrating from existing systems to a data lakehouse is a process where Dremio, the data lakehouse platform , truly excels, thanks to its data virtualization features. Dremio provides an easy-to-use interface across all your data, wherever it resides. If Dremio supports your legacy and new data systems, you can follow this migration pattern to ensure a smooth transition.
Step 1: Apply Dremio Over Your Legacy System
Begin by implementing Dremio on top of your existing legacy system. This initial step offers immediate ease-of-use and performance improvements, allowing your teams to familiarize themselves with workflows that will persist throughout the migration process. This approach ensures minimal disruption, as teams can continue their operations seamlessly.
Step 2: Connect Both Old and New Data Systems to Dremio
Next, connect both your legacy and new data systems to Dremio. Start migrating data to your data lakehouse while maintaining minimal disruption to your end users, who will continue using Dremio as their unified interface. This dual connection phase enables a smooth transition by ensuring that data is accessible and manageable from a single point, regardless of its location.
Step 3: Retire Old Systems After Data Migration
Once the data migration is complete, you can retire your old systems. Thanks to Dremio’s unified interface, this step involves no major disruptions. Users will continue to access data seamlessly through Dremio, without adapting to new systems or interfaces. This continuity ensures that your operations remain efficient and uninterrupted.
Dremio’s power lies in providing a central, unified interface across all your data lakes, data warehouses, lakehouse catalogs, and databases. This means your end users don’t have to worry about where the data lives, allowing them to focus on deriving insights and driving value from the data.
+-------------------------------------------------------+
| Migration to a Data Lakehouse |
+-------------------------------------------------------+
| |
| Step 1: Apply Dremio Over Legacy System |
| +---------------------------------------------+ |
| | Legacy System | |
| |---------------------------------------------| |
| | | |
| | +-------------+ +-------------+ | |
| | | Data Store | | Data Store | | |
| | +-------------+ +-------------+ | |
| +---------------------------------------------+ |
| | |
| | |
| v |
| +-------------+ |
| | Dremio | |
| +-------------+ |
| |
+-------------------------------------------------------+
| |
| Step 2: Connect Both Old and New Systems to Dremio |
| +---------------------------------------------+ |
| | Legacy System | |
| |---------------------------------------------| |
| | | |
| | +-------------+ +-------------+ | |
| | | Data Store | | Data Store | | |
| | +-------------+ +-------------+ | |
| +---------------------------------------------+ |
| | |
| v |
| +-------------+ |
| | Dremio | |
| +-------------+ |
| | |
| v |
| +---------------------------------------------+ |
| | New Data Lakehouse | |
| |---------------------------------------------| |
| | | |
| | +-------------+ +-------------+ | |
| | | Data Store | | Data Store | | |
| | +-------------+ +-------------+ | |
| +---------------------------------------------+ |
| |
+-------------------------------------------------------+
| |
| Step 3: Retire Old Systems After Data Migration |
| +---------------------------------------------+ |
| | New Data Lakehouse | |
| |---------------------------------------------| |
| | | |
| | +-------------+ +-------------+ | |
| | | Data Store | | Data Store | | |
| | +-------------+ +-------------+ | |
| +---------------------------------------------+ |
| | |
| v |
| +-------------+ |
| | Dremio | |
| +-------------+ |
| |
+-------------------------------------------------------+
When Should you go for a data lakehouse?
Moving to a data lakehouse or staying with your existing systems depends on several critical factors. Here are key parameters to consider:
1. Data Volume and Growth
2. Data Complexity
3. Performance Needs
4. Cost Considerations
5. Tool Compatibility
6. Data Governance and Compliance
7. Business Objectives
By carefully evaluating these parameters, you can decide whether to transition to a data lakehouse or continue with your existing systems. Each organization’s needs are unique, so it’s essential to weigh these factors in the context of your specific requirements and objectives.
Where can I learn more about Data Lakehouse?
Below are several tutorial you can use to get hands-on with the data lakehouse on your laptop to see this architecture in action so you can then apply it to your own use case.