When do you not need a Data Warehouse?
Image courtesy of pxhere.com

When do you not need a Data Warehouse?

‘Data Warehouse’ (DWH) is the term used for the last 30 years by both technicians and business stakeholders to mean ‘the data system that we get our reports from’, replacing earlier terms such as Decision Support System (DSS) or Management Information System (MIS) in the vernacular. Despite their long-standing success and near ubiquity, there still remains a significant debate about the efficacy and cost-benefit of DWH.

I believe that much of the debate around DWHs and of resistance to the concept stems from a lack of understanding about when use of a DWH is appropriate to address data requirements, and when alternative approaches should be employed. In this article I hope to clarify some of these points and provide some guidance about how to choose.

DWH Benefits

A formal DWH is almost standard for every large enterprise these days, bringing huge benefits to an array of Enterprise-level data activities. The DWH may have started out to "support an organization's decision-making processes", but it has evolved into something much more crucial. A DWH is a central repository of all significant organisational data, with that data cleaned, transformed, conformed and labelled. A DWH is also designed specifically to make controlled access to quality-assured data easy and consistent, and is in a prime position to act as a clearing-house, communicating useful data all over the organisation. It is not just a great source for reports, but for advanced analytics and data science too.

The DWH is also an ideal place for the business to store and maintain critical metadata that is not mastered in any application, such as custom groupings, business rules, and hierarchical views of data, that would otherwise be mastered in spreadsheets. DWHs have survived and thrived because they deliver great business value, year after year. There are few other ways in which data from every operating unit and every application can be viewed and worked with together as one.

No alt text provided for this image

As every other application in the organisation undergoes change and transition over time - from minimalist pilot, to full-featured proprietary, to off-the-shelf, to SaaS, to ERP - the DWH remains the steady fulcrum around which the business is run.

Challenges

However, it can be a big challenge, both technically and organisationally, to bring large quantities of data from multiple, disparate sources into a single ‘integrated’ DWH data model. DWHs require good design up-front, careful development to implement, and diligent maintenance to upkeep afterwards, and costs associated with these efforts can be significant. Making changes to established DWH models is not always easy and it can be problematic to integrate data sources that are immature and/or subject to frequent changes.

The appearance on the scene about 10 years ago of ‘Big Data’ applications such as Hadoop that stored large quantities of data by means other than relational databases offered an alternative. The resultant data lakes can ingest and retain data from almost any source in a reliable manner, and generally require less up-front effort to design and implement than a DWH. They typically use semi-structured files (e.g. JSON or XML) to store data in a more flexible and scalable way. However, the trade-off is that stored data is generally in a pretty raw state, with limited cleaning, quality control or transformation applied. Getting data out of a data lake in the way you want it is usually not nearly so easy as from a DWH, requiring more specialist skills and tools.

Data Warehouses and Data Lakes are different, sometime overlapping, concepts that share the same overall solution space but they are not fundamentally opposed to one another - rather they are complementary, serving different use cases.

No alt text provided for this image

Data warehouses and data lakes are ultimately just components of a larger Enterprise data ecosystem, that includes other systems and applications responsible for data integration, enterprise reporting, data management, data quality, data science and so on.

Systems of Record

Stepping back for a moment, at a higher, more conceptual level, a data system that acts as an authoritative source of ‘truth’ is called a system of record (SoR). A SoR is simply a data system that can be relied upon to have the correct values. SoRs are vital for accurate, reliable reporting, and are essential components for management and financial reporting, and for satisfying regulatory requirements.

There is no standard definition or rule about the form a SoR must take, but there are inherent requirements that should be satisfied around how it is governed. By necessity a SoR should be tightly controlled. Changes must be prepared carefully, adhere to agreed standards, be well documented, and must be thoroughly tested and approved before deployment. If this were not the case then it would not be possible to verify or rely on the data they provide.

In addition to standard reporting, SoRs are a great place for data scientists and other analysts to get reliable, clean consistent data. A well-designed, well-managed relational DWH has a great deal in common with, and will typically qualify as, a ‘SoR’. However, more flexible, ‘agile’ systems are needed for the day-to-day conduct of activities like rapid analysis, prototyping and problem solving.

Systems of Innovation

In practice much of the actual data analysis done to empower management decision-making often happens 'downstream' of the SoR, in applications and systems that are more flexible and 'sandbox' in nature. These are the so-called 'systems of innovation' (SoIs) that allow users to more freely experiment and ‘innovate’ with data.

Common examples would be SAS or PowerBI environments, that draw data from SoRs and other systems into separate data structures and allow analysts to independently ‘wrangle’ the data to generate new outputs. SoIs can share form and infrastructure with SoRs. For example, they could occupy separate schemas on the same database server, allowing analysts a safe space alongside the official data warehouse in which to create data objects and alter data to their heart’s content – without negatively impacting the monthly Board report pack.

No alt text provided for this image

One of the common issues that people take with DWHs occurs when they mistakenly try to use their DWH as the means to build SoI-type solutions. A SoI can certainly be built using a relational, dimensional model and using DWH data as a source, but the fast-evolving and purpose-specific nature of SoIs means that the heavier governance and planning requirements for building DWHs are an unnecessary hindrance.

With a few basic rules to govern them, such as "don't use SoIs for long-term production purposes" and "archive data and code when the temporary need has passed", SoIs can be constructed using whatever approach or technology ticks the right boxes at the time. Once the 'proof of concept' phase of a longer term project is complete, solutions developed in a SoI can be properly productionised into a System of Record.

Choosing Your Approach

These two concepts (SoR/SoI) are very helpful for bringing clarity to the discussion. The flow chart below suggests how to determine which type of system best fits the use case:

No alt text provided for this image

If you have varied requirements around accountability, reliability etc. then it is perfectly acceptable to have both SoRs and SoIs running in parallel, to serve different needs.

Similarly, if your data systems need to serve both stable and variable data, then parallel DWH/DL systems, or a combined data ‘lake/house’ system that can provide both relational and file store data repositories might be in order. Microsoft’s Synapse offering is an example of this, and many vendors now offer products with the ability to cater for different types of storage within the same overall application and interface.

Conclusion

There is an adage that ‘if the only tool you have is a hammer, then every problem looks like a nail’. IT professionals accustomed to working with DWHs often have a strong inclination to always try to add new sources of data to the DWH, even if the data doesn’t really suit this approach, rather than to utilise a separate 'system of innovation' or data lake.

I have fallen into this trap myself once or twice, and learned painfully the error of my ways. Collectively professionals should stop arguing that DWHs are too rigid, or that data lakes are too messy, and instead recognise the advantages and limitations of each. Using the right tool for each job, we can employ different options to craft solutions that are best-suited to the needs of our partners.

Note: The terms SoR and SoI and the view of types of data systems we put forward in this article are inspired in part by the 'Pace layer' application architecture originally published by Gartner in 2012. In keeping with our 'data-centric' philosophy, our view is focused on the requirements that data systems must satisfy, rather than the commonality or otherwise of applications across industries.

Questions on Data Warehousing, Data Integration, Data Quality, Business Intelligence, Data Management or Data Governance??Click Here?to begin a conversation.

John Thompson is a Director with EY's Technology Consulting practice. His primary focus for many years has been the effective design, management and optimal utilisation of large analytic data systems.

Ian Bourne

BI Business Partner to Consumer at Vodafone

2 年

Very good John

回复

要查看或添加评论,请登录

John Thompson的更多文章

  • Enterprise Data - its just plumbing, right?

    Enterprise Data - its just plumbing, right?

    When I started as a data consultant many years ago, my first solo assignment was to resolve a number of issues a small…

    7 条评论
  • After Big Data

    After Big Data

    When Distributed File Systems came on the scene in the late noughties, everyone realised that something big was…

    4 条评论
  • The Big Power of Small Data

    The Big Power of Small Data

    We have all been so bombarded in recent years with information about 'Big Data' that the value of 'Small Data' is…

    1 条评论
  • Becoming Data Centric

    Becoming Data Centric

    I’ve spent the last two decades working with analysts to solve data problems in a systematic way and to create…

  • What is Data Entropy?

    What is Data Entropy?

    There is a common meme that LinkedIn regulars will know well. It shows a series of pictures of Lego, one with lots of…

    6 条评论
  • Schrems II: What Does it mean for EU Data Processors?

    Schrems II: What Does it mean for EU Data Processors?

    The Schrems 2 case has been long running and much discussed and its ultimate findings, while still being digested, will…

  • How is Data Management Different from IT Management?

    How is Data Management Different from IT Management?

    In a season where the Liverpool football team is about to win the Premier League for the first time in 30 years, a…

  • Rise of the (Data Science) Robots

    Rise of the (Data Science) Robots

    I started out at university studying Molecular Genetics and for a long time considered doing a doctorate and building a…

    5 条评论
  • Choosing a BI Tool

    Choosing a BI Tool

    Data reporting and visualisation ‘BI’ tools come in many flavours, with a bewildering variety of features to confuse…

    7 条评论
  • Why Do We Need Analytic Data Platforms?

    Why Do We Need Analytic Data Platforms?

    When talking to customers I often encounter the same questions repeatedly. One of the most common is "Why do we need a…

    3 条评论

社区洞察

其他会员也浏览了