Data Lake vs Data Warehouse Architecture [5 out of 10]
Mahmoud Yassin
Senior Data Manager | Customer Data, Insights and AI | Helping Booking.com to modernize their data architecture on top of public cloud infrastructure
This is a?series?of articles to talk about the importance of having a solid data architecture in your business. The series will include the below articles:
1- Introduction to Data Architecture
2- OLAP vs OLTP
3- Data Warehouse Architecture deep dive
4- Data Lake Architecture deep dive
5- Data Lake vs Data Warehouse Architecture [current article]
6- Cloud computing effect on Data Architecture
7- Rise of Data Mesh Architecture
8- Data Mesh Vs the rest
9- Rise of DeltaLake Architecture
10- Which Data Architecture shall I choose?
A?data warehouse?stores current and historical data from one or more systems in a predefined and fixed schema, which allows business analysts and data scientists to easily analyse the data.
A?data lake?stores current and historical data from one or more systems in its raw form, which allows business analysts and data scientists to easily analyse the data.
Common patters between data lake and data warehouse:
t's important to note that while both data warehouses and data lakes can coexist in an organisations data pipeline, they serve different purposes and are used for different tasks.
Both the data architectures have a lot in common as both of them expecting that the data will be stored centrally in one place.
Data warehouses are optimised for fast querying and analysis of structured data, while data lakes are designed to store vast amounts of raw, unstructured data at low cost and allow for more flexible and cost-effective data processing.
Companies can use both solutions to meet their specific data needs, depending on the requirements of the specific use case.
Main differences between data lake and data warehouse:
1- Purpose
The data stored in a warehouse is processed for a specific application, such as log management, sales reporting, or security analysis. On the other hand, raw data stored in a data lake is kept without a defined purpose, but is preserved in case it becomes useful in the future.
2- Schema on read vs schema on write:
Schema-on-write is the central concept behind data warehouse architecture. It involves creating a schema or structure for the data before storing it in the database. This requires the use of Structured Query Language (SQL) and leads to a highly structured Relational Database Management System (RDBMS). One of the most time-consuming tasks in an RDBMS is performing Extract, Transform, Load (ETL) operations.
In contrast, schema-on-read is the central concept behind data lake architecture. It means creating a schema only when the data is being read, allowing unstructured data to be stored in the database. This makes it easier to incorporate new data sources quickly, as it is not necessary to define the schema beforehand.
The shift towards schema-on-read is driven by the growing amount of unstructured data and the overhead of ETL in an RDBMS. Analysts may not know the insights they can gain from new data sources, making the process of bringing in new data sources time-consuming.
3- Data preparation:
In data warehouse architecture, the data is cleaned, organised, processed, and transformed before being stored in the final layer of the warehouse. In contrast, data transformation is not necessary in data lake architecture, as the data is stored in its original form. The data will only be transformed when it is being read from the curated or consumption layers.
4- Supported data types:
Data warehouses primarily support structured data, as the data is expected to be stored in a structured manner. This creates difficulties when dealing with semi-structured or unstructured data.
领英推荐
In contrast, data lakes allow the storage of data in its original format, regardless of the data type, making it easier to store both structured and unstructured data.
5- Users
Another difference between data lakes and warehouses is the type of user. Data warehouses can often be used by data analysts or business analysts with an understanding of the processed data.
Data lakes, on the other hand, are more complex due to the large amount of unstructured data, and require the expertise of a data scientist or data engineer to interpret and structure the data.
6- Data size & security
Data lakes can store massive amounts of information, with each unit holding 1,000 terabytes. Due to their large size and the fact that they store all types of data, they are less secure than smaller, structured data warehouses.
Furthermore, data warehouse technology is well established and its security measures are mature, while big data security is still evolving. However, it is expected that security measures for data lakes will improve over time.
7- Cost:
The cost of storage in a data lake is usually less expensive compared to a data warehouse. In addition, data lakes are easier to manage and result in lower operational costs.
On the other hand, data warehouses tend to be more expensive and require more time to manage, leading to higher operational costs.
8- Data storage:
A data lake contains all the raw and unstructured data of an organisation and can store it indefinitely for immediate or future use.
A data warehouse, on the other hand, holds structured data that has been processed and cleaned, ready for analysis based on predefined business requirements..
9- Processing time:
Data lakes allow users to access data before it has been transformed, cleaned, and structured, enabling them to reach their results more quickly compared to traditional data warehouses.
Data warehouses, on the other hand, provide insights into predefined questions for predefined data types, and making changes to the data warehouse takes more time.
10- Data time travel:
Data lakes can store all data, including data that is currently in use as well as data that might be useful in the future. The data is kept indefinitely, allowing for historical analysis.
In contrast, the development process of a data warehouse involves significant time spent on analysing various data sources. Once the data is ingested into the warehouse, it is not possible to recreate it in the event of an error, as the original data is lost during the ETL process.
Cloud computing has indeed transformed the way data is stored and processed, offering scalability, accessibility and cost-effectiveness to organisations. I look forward to write about this on my next article on the role of cloud computing in data architecture. I am sure it will be informative and engaging
Feel free to comment and let's engage to exchange opinions and benefit from each-others experiences.?
Join a growing community of 1500+ Data Enthusiasts by subscribing to my ????????????????????:?Data Architecture History
Data Mesh Radio Host - Helping People Understand and Implement Data Mesh Since 2020 ??
2 年Mahmoud Yassin have you gotten in touch with Melissa Logan and team yet about the Data Mesh Learning blog and stuff?
Account executive @ The Data Intelligence Company
2 年Alexandre t'Kint Stijn (Stan) Christiaens
Lead Architect for Data & AI at Microsoft
2 年Mahmoud, next article, combine both!