登录查看更多内容

Objectives of Data Warehousing

Vincent Rainardi

Data Architect & Data Engineer

发布日期: 2024年10月14日

We build a data warehouse because we need to integrate data from multiple different sources. We build a data warehouse because the source systems are too volatile to report on. We build a data warehouse because we can’t report historical values from the source systems.

Those objective are best expressed by Bill Inmon’s 1992 definition of a data warehouse: subject-oriented, integrated, non-volatile, time-variant collection of data in support of management’s decisions.

Picture this: you work in a bank. You use a credit management system to manage mortgage and loan applications from clients. When you login you can see the customer profiles, their applications and their payments. You also use a account management system to manage savings, current accounts, ISAs, overdrafts. You also have a credit card system to manage credit card applications and operations. Things like balance transfer, check credit score, assess customer risks, approve applications, update customer profiles and increase credit limits.

To report the revenues for the whole bank, you end up extracting data from all 3 systems, then manually combining the data for management reporting. You have 3 people full time doing this, and yet they can only report once a month. The breakdown is also limited: only by product lines and by branches.

This is a perfect case to build a data warehouse for. To manage the bank better you need the cost and revenue breakdown, daily. The word “daily” scares the reporting department because it is not possible for them to do it. It takes them a week to integrate the data, and another week to detail out the breakdown by product lines and by branches. The new requirement is to be able to drill down up to individual accounts and to give breakdown by customer profiles like age and status. And application pipeline analysis, i.e. how many customers currently applied for mortgages and credit cards, and how many are in each stage (as of today). And how many days on average each application stay in each stage (say in the last 3 months, 6 months and 12 months).

That’s banking. Let’s take a look at another case: asset management.

You work in a company who run hundreds of investment funds for tens of thousands of clients. They have 10 different investment desks managing different funds: credit desks, equity desks (US equity, European equities, Asian equities), money market desk, emerging market desk, and multi-asset desk. They use a portfolio management system to manage those funds, placing buy and sell orders, calculate Net Asset Value, review client mandates, run model portfolios, reviewing risks and exposures and tracking the Profit & Loss across all asset classes and investment strategy.

You also use another system to manage subscription and redemption from clients (subscriptions mean client putting money into the funds, redemptions mean clients withdrawing money out from the funds), costs and charges for each client, manage liquidity and client orders. You also use another system to do trades with the brokers. Placing trades, processing transactions and settlements, manage counterparty details and allocation to each portfolio.

It is inevitable that companies use multiple different applications to conduct their daily operations. If you go into hospital, same thing. Multiple different applications. If you look at retails (shops) same thing, multiple different applications. Manufacturing, constructions, mining, aviation, distribution, insurance, publishing, education, same thing. Every company, charity and government institution in every sector use multiple different applications to manage their business operations.

Therefore my friends, we build data warehouses. To integrate data from those applications into one place. This really sound easier than the reality. To integrate customers you need their name, date of birth, address, email and national insurance number. If your customers are companies you need their name and LEI. To integrate financial instruments (called “security”) you need their ISIN, SEDOL, CUSIP, Ticker, Bloomberg ID, Factset ID. That is only to match the customers. To match the securities. Once you match the customers, you need to integrate their attributes. They could have different phone numbers, different job titles, different address, etc. coming from different applications. Which values are you going to take? Are you going to take one value or all values?

Yes you can build another system for this, called Master Data Management (MDM). But your data warehouse need to be able to integrate data from many data sources / applications. In many cases, you can’t afford to defer your warehouse project by another year, just because you need to implement an MDM system first. Practically speaking, this means setting up a waterfall system for customer matching (or security matching). For example, match on Bloomberg ID first, then Factset ID, then CUSIP, then ISIN, then SEDOL, then Ticker. In the case of customer matching, matching on all 5 fields first, then on 4 fields, then on 3 fields. After you do matching, then you define source system hierarchy for attribute sourcing. For example, you take the customer/security attributes from system A first, then system B, then system C. You do this when building the customer dimension, or security dimension. If your source system have a common identifier (like customer ID or security ID) then you are in luck, because your matching is a lot simpler. But you still need to do attribute sourcing.

Anyway, that’s integration. The main purpose for building a data warehouse. The second one is because the source systems are too volatile to report on. Take for example the bank account management system or portfolio management system, in the two examples I outlined above. They change minute by minute. When you query the database of that application, the output of your query is not stable, it keep changing from one minute to the next. That’s why in 1992 Inmon put that “non-volatile” into the definition of a data warehouse. In a data warehouse the data is not changing minute by minute. It is stable. You query it today and you query it tomorrow the output is the same.

领英推荐

Buy-Side Financial Data Models (2) – Financial…

Linxiao Ma 9 个月前

Volume Testing: A Comprehensive Guide

Keploy ?? 2 个月前

Three ways to deal with the concept of data i.r.t…

Jan van Bon 9 个月前

But you have to specify a date of course. You want the revenue breakdown or customer profile as of 30th September 2024 for example. When you specify a date, the output of your query on the data warehouse is consistent. Whether you run it today or tomorrow the output is the same. And this brings us to the 3rd reason why we build a data warehouse, which is to report historical values. In the account management system or portfolio management system in my two examples above, you can’t query it as of 30th September 2024. You can’t ask those apps to give you the revenue breakdown or portfolio exposure for each week in the last 6 months. Or each day in the last 3 months. You can only query it as it stands today. To report historical figures like that you need a data warehouse.

A data warehouse is subject-oriented, integrated, non-volatile, time-variant collection of data in support of management’s decisions. Bill Inmon, 1992.

We’ve discussed in great detail about non-volatile, time-variant and integrated. What’s subject-oriented?

Business operations are organised by busienss lines. In insurance the business operations are organised by “underwriting classes” such as marine, auto, life and casualty. In banking the business lines are loans, mortgages, savings, current accounts, credit cards, investments. In asset management, we have equities, fixed income, multi asset and money market.

What “subject-oriented” means is: in a data warehouse the data should be organised by subject areas, not by business lines. In insurance the subject areas are premium, claims, customers, policies, risk, insured, brokers and underwriters. In banking, the subject areas are accounts, customers, transactions, branches, employees and products. And in asset management the subject areas are portfolios, securities, issuers, clients, risk, compliance and performance.

In data warehouse, the data should be subject-oriented. Meaning we have premium fact tables, claims fact tables, policy dimension, insured dimension, broker dimension, underwriter dimension. That’s in an insurance data warehouse. In a banking data warehouse we have account dimension, customer dimension, transaction fact tables, account balance fact table, branch dimension, employee dimension and product dimension. In asset management data warehouse we have portfolio dimension, security dimension, issuer dimension, client dimension, risk fact tables, compliance fact tables, performance fact tables, holdings fact tables, and so on.

In a data warehouse, the data should be subject-oriented.

A data warehouse is subject-oriented, integrated, non-volatile and time-variant collection of data in support of management’s decisions. Bill Inmon, 1992.

What is your reason for building a data warehouse? Is it one of the above? Or is it a different reason? Please let me know using the comment below.

Please let me know if you agree with my view above. Or if you disagree with it. Please correct me if I’m wrong. Thanks in advance folks. Have a nice day.

P. Azhagupandi

Architect of High-Volume, Highly Concurrent Enterprise Systems (OLTP, Batch and Analytical) specialised in Database Solutions, Data Migrations, Data Analytics and Data Governance Practices (Vice-President, CAMS)

1 个月

Vincent, Much precise way of defining data warehouse. I fully agree with you. In recent times, I see the gap between ODS Vs DWH narrowing down and we are talkiing about LakeHouses.

Rio Dwi Putra Perkasa

Data Engineer

4 个月

Such a great article, I can easily understand the concept of data warehousing in just with this short article. Thank you

1 次回应

Fridthjof-G Eriksen

Pragmatic Data Architect with 25+ years experience & Data Vault expert

5 个月

key word that most people miss: Integrated

5 次回应

Eric Janssens

I create Data & Analytics platforms, architectures and solutions - Certified Data Vault 2.0 Practitioner

5 个月

Thanks for sharing. Bill Inmon has extended his original definition to ‘Integrate by Business Key’, since the preferred model for his data warehouse shifted from 3NF to data vault. #datavault2

2 次回应

查看更多评论

要查看或添加评论，请登录

Vincent Rainardi的更多文章

DQ Engineering

2025年3月23日

DQ Engineering

DQ stands for Data Quality. If you don't have a background in data quality, read this first: https://www.

3 条评论
Data Product

2025年3月23日

Data Product

For those of you who don't know what a data product and “data as a product” are, please read this first:…

13 条评论
Snowflake vs SQL Server

2025年3月20日

Snowflake vs SQL Server

Sometimes we need to remind ourselves that Snowflake is not an OLTP database. I know today is the era of Hybrid tables…

6 条评论
Data engineer becoming solution architect

2025年3月18日

Data engineer becoming solution architect

Are you a data engineer thinking about transitioning to a cloud solution architect? Data engineer are good with…

2 条评论
Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

2025年3月17日

Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

If you work in banking or investment or any other sector in financial services, you might be wondering about the above.…
Data Warehousing Basics: Cost

2025年3月16日

Data Warehousing Basics: Cost

If you call yourself a data engineer you need to be aware of 2 additional things compared to a developer. The first one…

2 条评论
My Linkedin post & articles

2025年3月15日

My Linkedin post & articles

The list below goes back to Nov 2024. For older than that see here.

9 条评论
Data Warehousing Basics: Single Customer View

2025年3月15日

Data Warehousing Basics: Single Customer View

Imagine that you work for an insurance company who sell health insurance (HI), life insurance (LI), general insurance…

2 条评论
Data Warehousing Basics: NFR

2025年3月15日

Data Warehousing Basics: NFR

What I’m about to tell you today failed a lot of data warehousing projects which is why it’s worth paying attention so…

1 条评论
ML and AI - What's the difference?

2025年3月13日

ML and AI - What's the difference?

Machine Learning covers about 20-30 algorithms such as Logistic Regression, Decision Tree, Gradient Boosting, Random…

4 条评论

See all articles

Objectives of Data Warehousing

Vincent Rainardi

Data Architect & Data Engineer

领英推荐

Vincent Rainardi的更多文章

社区洞察

其他会员也浏览了

Integrated Reporting Framework (IReF) eventual adoption. Challenges and impact on regulatory reporting.

Mastering Data Quality: A Practical Framework for Business Success (A Comprehensive Approach) ????

'Put data into the hands of people and show them how to use it'

Data Tuesday: Enhancing Financial Services with Data Analytics

Four steps to better data management

How to maximize the value of shipping data!

Aligning Business Goals with Data Strategies

Are you managing your data or is your data managing you?

How to Maintain Data Furnishing Accuracy and Integrity

Episode 11 - Master data quality measuring

领英推荐

Vincent Rainardi的更多文章

DQ Engineering

Data Product

Snowflake vs SQL Server

Data engineer becoming solution architect

Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

Data Warehousing Basics: Cost

My Linkedin post & articles

Data Warehousing Basics: Single Customer View

Data Warehousing Basics: NFR

ML and AI - What's the difference?

社区洞察

其他会员也浏览了

Integrated Reporting Framework (IReF) eventual adoption. Challenges and impact on regulatory reporting.

Mastering Data Quality: A Practical Framework for Business Success (A Comprehensive Approach) ????

'Put data into the hands of people and show them how to use it'

Data Tuesday: Enhancing Financial Services with Data Analytics

Four steps to better data management

How to maximize the value of shipping data!

Aligning Business Goals with Data Strategies

Are you managing your data or is your data managing you?

How to Maintain Data Furnishing Accuracy and Integrity

Episode 11 - Master data quality measuring