Data Warehousing Defined

Data Warehousing Defined

What is data warehouse?

A data warehouse is an electronic storage of an organization's historical data for the purpose of Data Analytics, such as reporting, analysis and other knowledge discovery activities.

Other than Data Analytics, a data warehouse can also be used for the purpose of data integration, master data management and more.

What is meant by Data Analytics?

Data analytics (DA) is the science of examining raw data with the purpose of drawing conclusions about that information. A data warehouse is often built to enable Data Analytics

What are the benefits of data warehouse?

A data warehouse helps to integrate data and store them historically so that we can analyze different aspects of business including, performance analysis, trend, prediction etc. over a given time frame and use the result of our analysis to improve the efficiency of business processes.

Why use a Data Warehouse?

Data warehouses are built to facilitate reporting on different key business processes of an organization, known as KPI (Key Performance Indicator). Today we often call this whole process of reporting data from data warehouses as "Data Analytics". Data warehouses also help to integrate data from different sources and show a single-point-of-truth values about the business measures (e.g. enabling Master Data Management).

Data warehouse can be further used for data mining which helps trend prediction, forecasts, pattern recognition and more.

What is the difference between OLTP and OLAP?

OLTP is the transaction system that collects business data. Whereas OLAP is the reporting and analysis system on that data.

OLTP systems are optimized for INSERT, UPDATE operations and therefore highly normalized. On the other hand, OLAP systems are deliberately denormalized for fast data retrieval through SELECT operations.

What is data mart?

Data marts are generally designed for a single subject area. An organization may have data pertaining to different departments like Finance, HR, Marketing etc. stored in data warehouse and each department may have separate data marts. These data marts can be built on top of the data warehouse.

What is ER model?

ER model or entity-relationship model is a particular methodology of data modeling wherein the goal of modeling is to normalize the data by reducing redundancy. This is different than dimensional modeling where the main goal is to improve the data retrieval mechanism.

What is dimensional modeling?

Dimensional model consists of dimension and fact tables. Fact tables store different transactional measurements and the foreign keys from dimension tables that qualifies the data. The goal of Dimensional model is not to achieve high degree of normalization but to facilitate easy and faster data retrieval.

Ralph Kimball is one of the strongest proponents of this very popular data modeling technique which is often used in many enterprise level data warehouses.

What is dimension?

A dimension is something that qualifies a quantity (measure).

Dimensions are mutually independent. Technically speaking, a dimension is a data element that categorizes each item in a data set into non-overlapping regions.

What is Fact?

A fact is something that is quantifiable (Or measurable). Facts are typically (but not always) numerical values that can be aggregated.

What are additive, semi-additive and non-additive measures?

Non-additive Measures

Non-additive measures are those which cannot be used inside any numeric aggregation function. One example of non-additive fact is any kind of ratio or percentage. Example, 5% profit margin, revenue to asset ratio etc. A non-numerical data can also be a non-additive measure when that data is stored in fact tables, e.g. some kind of varchar flags in the fact table.

Semi Additive Measures

Semi-additive measures are those where only a subset of aggregation function can be applied. Let’s say account balance. A SUM() function on balance does not give a useful result but max() or min() balance might be useful. Consider price rate or currency rate. Sum is meaningless on rate; however, average function might be useful.

Additive Measures

Additive measures can be used with any aggregation function like SUM(), AVG() etc. Example is Sales Quantity etc.

What is Star-schema?

This schema is used in data warehouse models where one centralized fact table references number of dimension tables so as the keys (primary key) from all the dimension tables flow into the fact table (as foreign key) where measures are stored. This entity-relationship diagram looks like a star, hence the name.

What is snow-flake schema?

This is another logical arrangement of tables in dimensional modeling where a centralized fact table references number of other dimension tables; however, those dimension tables are further normalized into multiple related tables.

Consider a fact table that stores sales quantity for each product and customer on a certain time. Sales quantity will be the measure here and keys from customer, product and time dimension tables will flow into the fact table. Additionally all the products can be further grouped under different product families stored in a different table so that primary key of product family tables also goes into the product table as a foreign key. Such construct will be called a snow-flake schema as product table is further snow-flaked into product family.

What is aggregation and what is the benefit of aggregation?

A data warehouse usually captures data with same degree of details as available in source. The "degree of detail" is termed as granularity. But all reporting requirements from that data warehouse do not need the same degree of details.

To understand this, let's consider an example from retail business. A certain retail chain has 500 shops across Europe. All the shops record detail level transactions regarding the products they sale and those data are captured in a data warehouse.

Each shop manager can access the data warehouse and they can see which products are sold by whom and in what quantity on any given date. Thus the data warehouse helps the shop managers with the detail level data that can be used for inventory management, trend prediction etc.

Now think about the CEO of that retail chain. He does not really care about which certain sales girl in London sold the highest number of chopsticks or which shop is the best seller of 'brown breads'. All he is interested is, perhaps to check the percentage increase of his revenue margin across Europe. Or maybe year to year sales growth on Eastern Europe. Such data is aggregated in nature. Because Sales of goods in East Europe is derived by summing up the individual sales data from each shop in East Europe.

Therefore, to support different levels of data warehouse users, data aggregation is needed.

What is slicing-dicing?

Slicing means showing the slice of a data, given a certain set of dimension (e.g. Product) and value (e.g. Brown Bread) and measures (e.g. sales).

Dicing means viewing the slice with respect to different dimensions and in different level of aggregations.

Slicing and dicing operations are part of pivoting.

What is drill-through?

Drill through is the process of going to the detail level data from summary data.

Consider the above example on retail shops. If the CEO finds out that sales in East Europe has declined this year compared to last year, he then might want to know the root cause of the decrease. For this, he may start drilling through his report to more detail level and eventually find out that even though individual shop sales has actually increased, the overall sales figure has decreased because a certain shop in Turkey has stopped operating the business. The detail level of data, which CEO was not much interested on earlier, has this time helped him to pin point the root cause of declined sales. And the method he has followed to obtain the details from the aggregated data is called drill through.





要查看或添加评论,请登录

Fred Knaut的更多文章

社区洞察

其他会员也浏览了