{Classical} Data Warehouse Architecture [3 out of 10]
Mahmoud Yassin
Senior Data Manager | Customer Data, Insights and AI | Helping Booking.com to modernize their data architecture on top of public cloud infrastructure
This is a series of articles to talk about the importance of having a solid data architecture in your business. The series will include the below articles:
1- Introduction to Data Architecture [link]
2- OLAP vs OLTP [link]
3- Data Warehouse Architecture deep dive [current article]
4- Data Lake Architecture deep dive
5- Data Lake vs Data Warehouse Architecture
6- Cloud computing effect on Data Architecture
7- Rise of Data Mesh Architecture
8- Data Mesh Vs the rest
9- Rise of DeltaLake Architecture
10- Which Data Architecture shall I choose?
What is Data Warehouse?
Data Warehouse (DW) or sometimes enterprise Data Warehouse (EDW) is a very solid data architecture designed for storing the operational data in a very special/modeled way to support business intelligence and reporting needs.
DWs acts as a central place which helps to keep everything in one place. Data Warehouse ensures that everything remains in one place and not in disparate sources. All current and historical data are stored in one place.
For example, a business stores data about its customer’s information, products, employees and their salaries, sales, and invoices. If we have a use case to analyse the latest cost-reduction measures, and getting answers will require an analysis of all of the previously mentioned data.
Unlike basic operational data storage (OLTP),?Data Warehouses?contains both:
How does data warehouse works?
Data Warehouse has main 4 layers that contribute to the entire architicture:
A- Data sources (OLTP)
B- Staging layer (ETL or ELT)
C- Data Warehouse modeled layer
D- Analytical layer (OLAP)
A) Data sources layer (Operational systems):
This layer includes all the applications/systems that generate data. Systems like: Billing, Finance, CRM, HR ...etc.
Those systems are referred to as OLTP (Online Transactional Processing) systems. I have covered this a previous article "OLTP vs OLAP"
B) Staging layer :
On this layer, the data needs to move from the operational system towards the data warehouse system. The process that is used to move the data is called Extract, Transform and Load (ETL).
There is also another patten called ELT (Extract, Load, Transform): On ELT the extracted data from the OLTP system lands as is on the OLAP system and then data transformations happens on the OLAP system (utilising the target database compute power) itself instead of running the transformation before loading the data (utilising the ETL server compute power).
Examples of some technologies on this area:
Some ETL technologies:
Originally those are DB technologies but they can be used for ELT needs as well by pushing the raw data into the DB and then use the compute side to apply data transformations:
C) Data Warehouse layer :
This is the main layer where the data will be stored and prepared for reporting needs. Having said that, the data needs to be stored in a very structured format in order to derive maximum value out of it. This is important to ensure:
There are two famous approches to store the data in your warehouse:
Star schema VS Snowflake schema
Both the Star schema and Snowflake schema are great approaches to model your data in the data warehouse. It depends on what is your analytical data needs. I will explain the deference between the two and which one fits which use cases.
Star schema:
Star schema is the type of multidimensional model which is used for data warehouse. In star schema, The fact tables and the dimension tables are contained. In this schema fewer foreign-key join is used. This schema forms a star with fact table and dimension tables.
The Star Schema data model is the simplest type of Data Warehouse schema. It is also known as Star Join Schema and is optimised for querying large data sets.
To understand the data modeling behind a star schema, let us look at a retail example by keboola. Imagine you are running an international shopping brand and you want to analyze purchases across your physical locations.
You pull out data from your database as an Excel file:
领英推荐
But you soon realize there are too many rows, and the data needs to be?cleaned?before you can analyse it. You decide to turn the data into a star schema.
A star schema is a data model that stores information in multiple table types: a single fact table and multiple dimensional tables.?
In contrast to the classical database design of normalising tables, star schemas connect dimensional data with fact data in a shape resembling a star (hence the name), as can be seen from the following diagram:
In the diagram, we see a central fact table (holding all the facts of the sales) and four dimension tables - a separate table describing the customer, date (of purchase), store where the purchase happened, and product purchased.?
The fact table is linked via a foreign key relationship to the primary key of each dimension (aka, the id in each dimension table, for example, the customer_id links the customer from the dim_customer table to the fact_sales table).?
This type of data modeling allows us to query data faster and with simpler queries than the normalised database design.
Snowflake schema:
Star schema is the type of multidimensional model which is used for data warehouse. In star schema, The fact tables and the dimension tables are contained. In this schema fewer foreign-key join is used. This schema forms a star with fact table and dimension tables.
A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. The dimension tables are normalised which splits data into additional tables.
For example, the above diagram would show the customer_country field being split into further dimensional tables:
Each dimension is split until it is normalised - aka, there is no redundancy in the dimensional table, no repetition of values (except for identifier values, such as id’s).
Differences between star schema vs snowflake schema:
D) Analytical layer :
This layer is aimed for generating value over the stored data in the data warehouse by developing specific business data marts that can be later queried a cube by various data intelligence tools or by reporting directly on the stored data in the warehouse.
Data Mart
A?data mart?is considered a subset of a data warehouse and is usually oriented to a specific team or business line, such as finance or sales. It is subject-oriented, making specific data available to a defined group of users more quickly, providing them with critical insights. The availability of specific data ensures that they do not need to waste time searching through an entire data warehouse.A?Data Mart?is an area for storing data that serves a particular community or group of workers. They are storage areas with fixed data and deliberately under the control of one department within the organisation.
Data Cube
A?Data Cube?is software that stores data in matrices of three or more dimensions. Any transformations in the data are expressed as tables and arrays of processed information. After tables have matched the rows of data strings with the columns of data types, the data cube then cross-references tables from a single data source or multiple data sources, increasing the detail of each data point. This arrangement provides researchers with the ability to find deeper insights than other techniques.
Difference between data warehouse and?data mart:
Benefits of Data Warehouses
The primary benefit of a data warehouse is storing and analysing large amounts of variant data and extract value from them while keeping historical data for record-keeping.
Bill Inmon , the father of data warehousing, gave four unique characteristics of data warehouses, such as:
Well-designed data warehouses perform queries quickly and deliver high-quality data. It allows end-users to reduce the volume of data in order to examine a certain area closely. Achieving faster decisions is a critical aspect of data warehouses since it provides data in consistent formats ready to be analysed.
Also, data warehouses provide the analytical power and a complete dataset to enable data-driven decision-making based on high-quality data from all business areas.
Challenges of data warehouse:
In the upcoming article, we will deep dive into one of the second main data architecture: The Data Lake Architecture.
Feel free to comment and let's engage to exchange opinions and benefit from each-others experiences.?
Join a growing community of 1500+ Data Enthusiasts by subscribing to my ????????????????????:?Data Architecture History
Data Consultant at Systaems (The KPI Institute)
1 年Thank you for the informative series