{Classical} Data Warehouse Architecture [3 out of 10]

{Classical} Data Warehouse Architecture [3 out of 10]

This is a series of articles to talk about the importance of having a solid data architecture in your business. The series will include the below articles:

1- Introduction to Data Architecture [link]

2- OLAP vs OLTP [link]

3- Data Warehouse Architecture deep dive [current article]

4- Data Lake Architecture deep dive 

5- Data Lake vs Data Warehouse Architecture

6- Cloud computing effect on Data Architecture

7- Rise of Data Mesh Architecture

8- Data Mesh Vs the rest

9- Rise of DeltaLake Architecture

10- Which Data Architecture shall I choose?        

What is Data Warehouse?

Data Warehouse (DW) or sometimes enterprise Data Warehouse (EDW) is a very solid data architecture designed for storing the operational data in a very special/modeled way to support business intelligence and reporting needs.

DWs acts as a central place which helps to keep everything in one place. Data Warehouse ensures that everything remains in one place and not in disparate sources. All current and historical data are stored in one place.

For example, a business stores data about its customer’s information, products, employees and their salaries, sales, and invoices. If we have a use case to analyse the latest cost-reduction measures, and getting answers will require an analysis of all of the previously mentioned data.

Unlike basic operational data storage (OLTP),?Data Warehouses?contains both:

  • The full historical data (gathered from a variety of sources).
  • The aggregate historical data (highly useful data analysis on massive scale).


How does data warehouse works?


Data Warehouse has main 4 layers that contribute to the entire architicture:


A- Data sources (OLTP)

B- Staging layer (ETL or ELT)

C- Data Warehouse modeled layer 

D- Analytical layer (OLAP)        
No alt text provided for this image
Data warehouse main stages

A) Data sources layer (Operational systems):

This layer includes all the applications/systems that generate data. Systems like: Billing, Finance, CRM, HR ...etc.

Those systems are referred to as OLTP (Online Transactional Processing) systems. I have covered this a previous article "OLTP vs OLAP"

B) Staging layer :

On this layer, the data needs to move from the operational system towards the data warehouse system. The process that is used to move the data is called Extract, Transform and Load (ETL).

  • E (Extract): On this process, we connect to the source system like MS-SQL database
  • T (Transform): On this process, we apply any required transformation like making all the date formats as a standard format DD-MM-YYYY or adding system date records to capture the date/time of the extraction process. This is a very important step in the ETL process as it can change/alter the original content of the data or enrich it with more elements or data points.
  • L (Load): On this process, we define the target analytical system where the data will be stored for example a target system could be Azure Synapse Analytics.

There is also another patten called ELT (Extract, Load, Transform): On ELT the extracted data from the OLTP system lands as is on the OLAP system and then data transformations happens on the OLAP system (utilising the target database compute power) itself instead of running the transformation before loading the data (utilising the ETL server compute power).

Examples of some technologies on this area:

Some ETL technologies:

  • Ab Initio Software:?The product offers a full range of features including batch processing, data analysis, graphical user interface(GUI) based parallel processing software, etc. Their approach to data processing is evident from their 20 years experience.
  • IBM DataStage: DataStage is an industry-leading data integration tool that helps you design, develop and run jobs that move and transform data. At its core, the DataStage tool supports extract, transform and load (ETL) and extract, load and transform (ELT) patterns.?
  • Informatica: Informatica has a lot of core products including Enterprise Cloud Data Management and Data Integration.

Originally those are DB technologies but they can be used for ELT needs as well by pushing the raw data into the DB and then use the compute side to apply data transformations:

  • Teradata:?Teradata offers unique feature set for companies who want to make their data usable in a feature-set functionality. They offer analytics data platforms and other crucial functionalities required to make companies competitive.
  • Oracle:?Oracle offers data warehousing service for its customers with industry standards. The company’s main platform for maintaining the warehouse functions is the Oracle Exadata Machine
  • MarkLogic:?MarkLogic uses NoSQL database for the data warehouse solution. It recently got some attention and it was also included in the Gartner’s Magic Quadrant on DBMS.

C) Data Warehouse layer :

This is the main layer where the data will be stored and prepared for reporting needs. Having said that, the data needs to be stored in a very structured format in order to derive maximum value out of it. This is important to ensure:

  • Data integrity
  • Data quality
  • Data aggregations

There are two famous approches to store the data in your warehouse:

Star schema VS Snowflake schema

No alt text provided for this image
Star schema VS snowflake schema

Both the Star schema and Snowflake schema are great approaches to model your data in the data warehouse. It depends on what is your analytical data needs. I will explain the deference between the two and which one fits which use cases.

Star schema:

Star schema is the type of multidimensional model which is used for data warehouse. In star schema, The fact tables and the dimension tables are contained. In this schema fewer foreign-key join is used. This schema forms a star with fact table and dimension tables.

The Star Schema data model is the simplest type of Data Warehouse schema. It is also known as Star Join Schema and is optimised for querying large data sets.

To understand the data modeling behind a star schema, let us look at a retail example by keboola. Imagine you are running an international shopping brand and you want to analyze purchases across your physical locations.

You pull out data from your database as an Excel file:

No alt text provided for this image
Credits: https://www.keboola.com/blog/star-schema-vs-snowflake-schema

But you soon realize there are too many rows, and the data needs to be?cleaned?before you can analyse it. You decide to turn the data into a star schema.

A star schema is a data model that stores information in multiple table types: a single fact table and multiple dimensional tables.?

In contrast to the classical database design of normalising tables, star schemas connect dimensional data with fact data in a shape resembling a star (hence the name), as can be seen from the following diagram:

No alt text provided for this image
Credits: https://www.keboola.com/blog/star-schema-vs-snowflake-schema

In the diagram, we see a central fact table (holding all the facts of the sales) and four dimension tables - a separate table describing the customer, date (of purchase), store where the purchase happened, and product purchased.?

The fact table is linked via a foreign key relationship to the primary key of each dimension (aka, the id in each dimension table, for example, the customer_id links the customer from the dim_customer table to the fact_sales table).?

This type of data modeling allows us to query data faster and with simpler queries than the normalised database design.

Snowflake schema:

Star schema is the type of multidimensional model which is used for data warehouse. In star schema, The fact tables and the dimension tables are contained. In this schema fewer foreign-key join is used. This schema forms a star with fact table and dimension tables.

A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. The dimension tables are normalised which splits data into additional tables.

For example, the above diagram would show the customer_country field being split into further dimensional tables:

No alt text provided for this image
Credits: https://www.keboola.com/blog/star-schema-vs-snowflake-schema

Each dimension is split until it is normalised - aka, there is no redundancy in the dimensional table, no repetition of values (except for identifier values, such as id’s).

Differences between star schema vs snowflake schema:

No alt text provided for this image
https://www.guru99.com/star-snowflake-data-warehousing.html

D) Analytical layer :

This layer is aimed for generating value over the stored data in the data warehouse by developing specific business data marts that can be later queried a cube by various data intelligence tools or by reporting directly on the stored data in the warehouse.

Data Mart

A?data mart?is considered a subset of a data warehouse and is usually oriented to a specific team or business line, such as finance or sales. It is subject-oriented, making specific data available to a defined group of users more quickly, providing them with critical insights. The availability of specific data ensures that they do not need to waste time searching through an entire data warehouse.A?Data Mart?is an area for storing data that serves a particular community or group of workers. They are storage areas with fixed data and deliberately under the control of one department within the organisation.

Data Cube

A?Data Cube?is software that stores data in matrices of three or more dimensions. Any transformations in the data are expressed as tables and arrays of processed information. After tables have matched the rows of data strings with the columns of data types, the data cube then cross-references tables from a single data source or multiple data sources, increasing the detail of each data point. This arrangement provides researchers with the ability to find deeper insights than other techniques.

Difference between data warehouse and?data mart:

No alt text provided for this image
https://en.wikipedia.org/wiki/Data_warehouse#:~:text=The%20concept%20of%20data%20warehousing,systems%20to%20decision%20support%20environments.

Benefits of Data Warehouses

The primary benefit of a data warehouse is storing and analysing large amounts of variant data and extract value from them while keeping historical data for record-keeping.

Bill Inmon , the father of data warehousing, gave four unique characteristics of data warehouses, such as:

  • Being subject-oriented to focus on a particular area: Data warehouse is a great architecture that helps in dividing the business into subject areas with a little bit of knowledge, you can report on each business area in an easy and fast way.
  • Ability to integrate different data types from various sources: You can ingest different data types into your warehouse (mostly structured data). With a robust ETL/ELT processes, you can manage the data ingestion and data correctness in some special cases known as (late arrival records).
  • Time-variant, which analyses changes over time: You can look into the historical trend of your data easily in a well designed data warehouse to learn from the past and build analytics based on all the gathered data in your warehouse.

Well-designed data warehouses perform queries quickly and deliver high-quality data. It allows end-users to reduce the volume of data in order to examine a certain area closely. Achieving faster decisions is a critical aspect of data warehouses since it provides data in consistent formats ready to be analysed.

Also, data warehouses provide the analytical power and a complete dataset to enable data-driven decision-making based on high-quality data from all business areas.


Challenges of data warehouse:

  • One of the main challenges with the data warehouse architecture is that the operational data has to fit into pre-defined data model. This results in losing the original value of the raw data and makes it impossible to get back to the original source of the data if needed.
  • Schema on write: In the early days of digital data handling, databases worked on a schema-on-write basis. That meant that?data pieces had to be tailored to a template or plan at the time of storage. Then, when users went to retrieve the data, it was already in an easily manageable format.
  • Evolution of big data has affected the data warehouse abilities to store all kind of data sources. Mainly this happened because of the technologies that used to support data warehouse were not design't to cope with unstructured data like audio, videos, complex JSON, complex XML ...etc.
  • The integration of data (ETL or ELT) which is the most difficult and time consuming process as one need to touch the root of old legacies of corporates to derive useful integrated data. It is a painful step, but it is wrathful.
  • High volume of data created by data warehousing techniques which makes the process tedious. So, there comes a need to get rid of old data. But, for analyses of data they are so valuable and can’t be ignored.
  • Data warehouses can’t be created all at once like other operational applications. It must be developed iteratively, like one step at a time. This limitation has opened the door for another data architecture call Data Lake architecture that will be covered soon.


In the upcoming article, we will deep dive into one of the second main data architecture: The Data Lake Architecture.

Feel free to comment and let's engage to exchange opinions and benefit from each-others experiences.?

Join a growing community of 1500+ Data Enthusiasts by subscribing to my ????????????????????:?Data Architecture History

#data #writing #architecture #cloudcomputing #warehouse #data #dataarchitecture #datawarehouse #datalake #datalakehouse #datamesh #datafabric #dataanalytics #ai #bigdata #knowledgesharing #article

Islam Salahuddin

Data Consultant at Systaems (The KPI Institute)

1 年

Thank you for the informative series

要查看或添加评论,请登录

社区洞察

其他会员也浏览了