Data Lake vs Data Warehouse Architecture [5 out of 10]
Data Lake vs Data Warehouse Architecture

Data Lake vs Data Warehouse Architecture [5 out of 10]

This is a?series?of articles to talk about the importance of having a solid data architecture in your business. The series will include the below articles:

1- Introduction to Data Architecture 

2- OLAP vs OLTP 

3- Data Warehouse Architecture deep dive 

4- Data Lake Architecture deep dive 

5- Data Lake vs Data Warehouse Architecture [current article]

6- Cloud computing effect on Data Architecture

7- Rise of Data Mesh Architecture

8- Data Mesh Vs the rest

9- Rise of DeltaLake Architecture

10- Which Data Architecture shall I choose?        

A?data warehouse?stores current and historical data from one or more systems in a predefined and fixed schema, which allows business analysts and data scientists to easily analyse the data.

A?data lake?stores current and historical data from one or more systems in its raw form, which allows business analysts and data scientists to easily analyse the data.


Common patters between data lake and data warehouse:

t's important to note that while both data warehouses and data lakes can coexist in an organisations data pipeline, they serve different purposes and are used for different tasks.

Both the data architectures have a lot in common as both of them expecting that the data will be stored centrally in one place.

No alt text provided for this image

Data warehouses are optimised for fast querying and analysis of structured data, while data lakes are designed to store vast amounts of raw, unstructured data at low cost and allow for more flexible and cost-effective data processing.

Companies can use both solutions to meet their specific data needs, depending on the requirements of the specific use case.


Main differences between data lake and data warehouse:

No alt text provided for this image
Data Lake vs Data Warehouse

1- Purpose

The data stored in a warehouse is processed for a specific application, such as log management, sales reporting, or security analysis. On the other hand, raw data stored in a data lake is kept without a defined purpose, but is preserved in case it becomes useful in the future.

2- Schema on read vs schema on write:

Schema-on-write is the central concept behind data warehouse architecture. It involves creating a schema or structure for the data before storing it in the database. This requires the use of Structured Query Language (SQL) and leads to a highly structured Relational Database Management System (RDBMS). One of the most time-consuming tasks in an RDBMS is performing Extract, Transform, Load (ETL) operations.

In contrast, schema-on-read is the central concept behind data lake architecture. It means creating a schema only when the data is being read, allowing unstructured data to be stored in the database. This makes it easier to incorporate new data sources quickly, as it is not necessary to define the schema beforehand.

The shift towards schema-on-read is driven by the growing amount of unstructured data and the overhead of ETL in an RDBMS. Analysts may not know the insights they can gain from new data sources, making the process of bringing in new data sources time-consuming.

No alt text provided for this image
schema on read VS schema on write

3- Data preparation:

In data warehouse architecture, the data is cleaned, organised, processed, and transformed before being stored in the final layer of the warehouse. In contrast, data transformation is not necessary in data lake architecture, as the data is stored in its original form. The data will only be transformed when it is being read from the curated or consumption layers.

4- Supported data types:

Data warehouses primarily support structured data, as the data is expected to be stored in a structured manner. This creates difficulties when dealing with semi-structured or unstructured data.

In contrast, data lakes allow the storage of data in its original format, regardless of the data type, making it easier to store both structured and unstructured data.

5- Users

Another difference between data lakes and warehouses is the type of user. Data warehouses can often be used by data analysts or business analysts with an understanding of the processed data.

Data lakes, on the other hand, are more complex due to the large amount of unstructured data, and require the expertise of a data scientist or data engineer to interpret and structure the data.

6- Data size & security

No alt text provided for this image

Data lakes can store massive amounts of information, with each unit holding 1,000 terabytes. Due to their large size and the fact that they store all types of data, they are less secure than smaller, structured data warehouses.

Furthermore, data warehouse technology is well established and its security measures are mature, while big data security is still evolving. However, it is expected that security measures for data lakes will improve over time.

7- Cost:

The cost of storage in a data lake is usually less expensive compared to a data warehouse. In addition, data lakes are easier to manage and result in lower operational costs.

On the other hand, data warehouses tend to be more expensive and require more time to manage, leading to higher operational costs.

8- Data storage:

A data lake contains all the raw and unstructured data of an organisation and can store it indefinitely for immediate or future use.

A data warehouse, on the other hand, holds structured data that has been processed and cleaned, ready for analysis based on predefined business requirements..

9- Processing time:

Data lakes allow users to access data before it has been transformed, cleaned, and structured, enabling them to reach their results more quickly compared to traditional data warehouses.

Data warehouses, on the other hand, provide insights into predefined questions for predefined data types, and making changes to the data warehouse takes more time.

10- Data time travel:

Data lakes can store all data, including data that is currently in use as well as data that might be useful in the future. The data is kept indefinitely, allowing for historical analysis.

In contrast, the development process of a data warehouse involves significant time spent on analysing various data sources. Once the data is ingested into the warehouse, it is not possible to recreate it in the event of an error, as the original data is lost during the ETL process.


Cloud computing has indeed transformed the way data is stored and processed, offering scalability, accessibility and cost-effectiveness to organisations. I look forward to write about this on my next article on the role of cloud computing in data architecture. I am sure it will be informative and engaging

Feel free to comment and let's engage to exchange opinions and benefit from each-others experiences.?

Join a growing community of 1500+ Data Enthusiasts by subscribing to my ????????????????????:?Data Architecture History

#data?#writing?#architecture?#cloudcomputing?#warehouse?#data?#dataarchitecture?#datawarehouse?#datalake?#datalakehouse?#datamesh?#datafabric?#dataanalytics?#ai?#bigdata?#knowledgesharing?#article

(on sabbatical) Scott Hirleman (back mid next year maybe but prob not)

Data Mesh Radio Host - Helping People Understand and Implement Data Mesh Since 2020 ??

2 年

Mahmoud Yassin have you gotten in touch with Melissa Logan and team yet about the Data Mesh Learning blog and stuff?

Pieter Pauwels

Account executive @ The Data Intelligence Company

2 年
Hicham Zmarrou, Ph.D

Lead Architect for Data & AI at Microsoft

2 年

Mahmoud, next article, combine both!

要查看或添加评论,请登录

Mahmoud Yassin的更多文章

社区洞察

其他会员也浏览了