Objectives of Data Warehousing

Objectives of Data Warehousing

We build a data warehouse because we need to integrate data from multiple different sources. We build a data warehouse because the source systems are too volatile to report on. We build a data warehouse because we can’t report historical values from the source systems.

Those objective are best expressed by Bill Inmon’s 1992 definition of a data warehouse: subject-oriented, integrated, non-volatile, time-variant collection of data in support of management’s decisions.

Picture this: you work in a bank. You use a credit management system to manage mortgage and loan applications from clients. When you login you can see the customer profiles, their applications and their payments. You also use a account management system to manage savings, current accounts, ISAs, overdrafts. You also have a credit card system to manage credit card applications and operations. Things like balance transfer, check credit score, assess customer risks, approve applications, update customer profiles and increase credit limits.

To report the revenues for the whole bank, you end up extracting data from all 3 systems, then manually combining the data for management reporting. You have 3 people full time doing this, and yet they can only report once a month. The breakdown is also limited: only by product lines and by branches.

This is a perfect case to build a data warehouse for. To manage the bank better you need the cost and revenue breakdown, daily. The word “daily” scares the reporting department because it is not possible for them to do it. It takes them a week to integrate the data, and another week to detail out the breakdown by product lines and by branches. The new requirement is to be able to drill down up to individual accounts and to give breakdown by customer profiles like age and status. And application pipeline analysis, i.e. how many customers currently applied for mortgages and credit cards, and how many are in each stage (as of today). And how many days on average each application stay in each stage (say in the last 3 months, 6 months and 12 months).

That’s banking. Let’s take a look at another case: asset management.

You work in a company who run hundreds of investment funds for tens of thousands of clients. They have 10 different investment desks managing different funds: credit desks, equity desks (US equity, European equities, Asian equities), money market desk, emerging market desk, and multi-asset desk. They use a portfolio management system to manage those funds, placing buy and sell orders, calculate Net Asset Value, review client mandates, run model portfolios, reviewing risks and exposures and tracking the Profit & Loss across all asset classes and investment strategy.

You also use another system to manage subscription and redemption from clients (subscriptions mean client putting money into the funds, redemptions mean clients withdrawing money out from the funds), costs and charges for each client, manage liquidity and client orders. You also use another system to do trades with the brokers. Placing trades, processing transactions and settlements, manage counterparty details and allocation to each portfolio.

It is inevitable that companies use multiple different applications to conduct their daily operations. If you go into hospital, same thing. Multiple different applications. If you look at retails (shops) same thing, multiple different applications. Manufacturing, constructions, mining, aviation, distribution, insurance, publishing, education, same thing. Every company, charity and government institution in every sector use multiple different applications to manage their business operations.

Therefore my friends, we build data warehouses. To integrate data from those applications into one place. This really sound easier than the reality. To integrate customers you need their name, date of birth, address, email and national insurance number. If your customers are companies you need their name and LEI. To integrate financial instruments (called “security”) you need their ISIN, SEDOL, CUSIP, Ticker, Bloomberg ID, Factset ID. That is only to match the customers. To match the securities. Once you match the customers, you need to integrate their attributes. They could have different phone numbers, different job titles, different address, etc. coming from different applications. Which values are you going to take? Are you going to take one value or all values?

Yes you can build another system for this, called Master Data Management (MDM). But your data warehouse need to be able to integrate data from many data sources / applications. In many cases, you can’t afford to defer your warehouse project by another year, just because you need to implement an MDM system first. Practically speaking, this means setting up a waterfall system for customer matching (or security matching). For example, match on Bloomberg ID first, then Factset ID, then CUSIP, then ISIN, then SEDOL, then Ticker. In the case of customer matching, matching on all 5 fields first, then on 4 fields, then on 3 fields. After you do matching, then you define source system hierarchy for attribute sourcing. For example, you take the customer/security attributes from system A first, then system B, then system C. You do this when building the customer dimension, or security dimension. If your source system have a common identifier (like customer ID or security ID) then you are in luck, because your matching is a lot simpler. But you still need to do attribute sourcing.

Anyway, that’s integration. The main purpose for building a data warehouse. The second one is because the source systems are too volatile to report on. Take for example the bank account management system or portfolio management system, in the two examples I outlined above. They change minute by minute. When you query the database of that application, the output of your query is not stable, it keep changing from one minute to the next. That’s why in 1992 Inmon put that “non-volatile” into the definition of a data warehouse. In a data warehouse the data is not changing minute by minute. It is stable. You query it today and you query it tomorrow the output is the same.

But you have to specify a date of course. You want the revenue breakdown or customer profile as of 30th September 2024 for example. When you specify a date, the output of your query on the data warehouse is consistent. Whether you run it today or tomorrow the output is the same. And this brings us to the 3rd reason why we build a data warehouse, which is to report historical values. In the account management system or portfolio management system in my two examples above, you can’t query it as of 30th September 2024. You can’t ask those apps to give you the revenue breakdown or portfolio exposure for each week in the last 6 months. Or each day in the last 3 months. You can only query it as it stands today. To report historical figures like that you need a data warehouse.

A data warehouse is subject-oriented, integrated, non-volatile, time-variant collection of data in support of management’s decisions. Bill Inmon, 1992.

We’ve discussed in great detail about non-volatile, time-variant and integrated. What’s subject-oriented?

Business operations are organised by busienss lines. In insurance the business operations are organised by “underwriting classes” such as marine, auto, life and casualty. In banking the business lines are loans, mortgages, savings, current accounts, credit cards, investments. In asset management, we have equities, fixed income, multi asset and money market.

What “subject-oriented” means is: in a data warehouse the data should be organised by subject areas, not by business lines. In insurance the subject areas are premium, claims, customers, policies, risk, insured, brokers and underwriters. In banking, the subject areas are accounts, customers, transactions, branches, employees and products. And in asset management the subject areas are portfolios, securities, issuers, clients, risk, compliance and performance.

In data warehouse, the data should be subject-oriented. Meaning we have premium fact tables, claims fact tables, policy dimension, insured dimension, broker dimension, underwriter dimension. That’s in an insurance data warehouse. In a banking data warehouse we have account dimension, customer dimension, transaction fact tables, account balance fact table, branch dimension, employee dimension and product dimension. In asset management data warehouse we have portfolio dimension, security dimension, issuer dimension, client dimension, risk fact tables, compliance fact tables, performance fact tables, holdings fact tables, and so on.

In a data warehouse, the data should be subject-oriented.

A data warehouse is subject-oriented, integrated, non-volatile and time-variant collection of data in support of management’s decisions. Bill Inmon, 1992.

We build a data warehouse because we need to integrate data from multiple different sources. We build a data warehouse because the source systems are too volatile to report on. We build a data warehouse because we can’t report historical values from the source systems.

What is your reason for building a data warehouse? Is it one of the above? Or is it a different reason? Please let me know using the comment below.

Please let me know if you agree with my view above. Or if you disagree with it. Please correct me if I’m wrong. Thanks in advance folks. Have a nice day.

P. Azhagupandi

Architect of High-Volume, Highly Concurrent Enterprise Systems (OLTP, Batch and Analytical) specialised in Database Solutions, Data Migrations, Data Analytics and Data Governance Practices (Vice-President, CAMS)

1 个月

Vincent, Much precise way of defining data warehouse. I fully agree with you. In recent times, I see the gap between ODS Vs DWH narrowing down and we are talkiing about LakeHouses.

回复

Such a great article, I can easily understand the concept of data warehousing in just with this short article. Thank you

Fridthjof-G Eriksen

Pragmatic Data Architect with 25+ years experience & Data Vault expert

5 个月

key word that most people miss: Integrated

Eric Janssens

I create Data & Analytics platforms, architectures and solutions - Certified Data Vault 2.0 Practitioner

5 个月

Thanks for sharing. Bill Inmon has extended his original definition to ‘Integrate by Business Key’, since the preferred model for his data warehouse shifted from 3NF to data vault. #datavault2

要查看或添加评论,请登录

Vincent Rainardi的更多文章

  • DQ Engineering

    DQ Engineering

    DQ stands for Data Quality. If you don't have a background in data quality, read this first: https://www.

    3 条评论
  • Data Product

    Data Product

    For those of you who don't know what a data product and “data as a product” are, please read this first:…

    13 条评论
  • Snowflake vs SQL Server

    Snowflake vs SQL Server

    Sometimes we need to remind ourselves that Snowflake is not an OLTP database. I know today is the era of Hybrid tables…

    6 条评论
  • Data engineer becoming solution architect

    Data engineer becoming solution architect

    Are you a data engineer thinking about transitioning to a cloud solution architect? Data engineer are good with…

    2 条评论
  • Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

    Asset Mgt vs Fund Mgt vs Investment Mgt vs Wealth Mgt: What's the difference?

    If you work in banking or investment or any other sector in financial services, you might be wondering about the above.…

  • Data Warehousing Basics: Cost

    Data Warehousing Basics: Cost

    If you call yourself a data engineer you need to be aware of 2 additional things compared to a developer. The first one…

    2 条评论
  • My Linkedin post & articles

    My Linkedin post & articles

    The list below goes back to Nov 2024. For older than that see here.

    9 条评论
  • Data Warehousing Basics: Single Customer View

    Data Warehousing Basics: Single Customer View

    Imagine that you work for an insurance company who sell health insurance (HI), life insurance (LI), general insurance…

    2 条评论
  • Data Warehousing Basics: NFR

    Data Warehousing Basics: NFR

    What I’m about to tell you today failed a lot of data warehousing projects which is why it’s worth paying attention so…

    1 条评论
  • ML and AI - What's the difference?

    ML and AI - What's the difference?

    Machine Learning covers about 20-30 algorithms such as Logistic Regression, Decision Tree, Gradient Boosting, Random…

    4 条评论

社区洞察

其他会员也浏览了