When do you not need a Data Warehouse?
‘Data Warehouse’ (DWH) is the term used for the last 30 years by both technicians and business stakeholders to mean ‘the data system that we get our reports from’, replacing earlier terms such as Decision Support System (DSS) or Management Information System (MIS) in the vernacular. Despite their long-standing success and near ubiquity, there still remains a significant debate about the efficacy and cost-benefit of DWH.
I believe that much of the debate around DWHs and of resistance to the concept stems from a lack of understanding about when use of a DWH is appropriate to address data requirements, and when alternative approaches should be employed. In this article I hope to clarify some of these points and provide some guidance about how to choose.
DWH Benefits
A formal DWH is almost standard for every large enterprise these days, bringing huge benefits to an array of Enterprise-level data activities. The DWH may have started out to "support an organization's decision-making processes", but it has evolved into something much more crucial. A DWH is a central repository of all significant organisational data, with that data cleaned, transformed, conformed and labelled. A DWH is also designed specifically to make controlled access to quality-assured data easy and consistent, and is in a prime position to act as a clearing-house, communicating useful data all over the organisation. It is not just a great source for reports, but for advanced analytics and data science too.
The DWH is also an ideal place for the business to store and maintain critical metadata that is not mastered in any application, such as custom groupings, business rules, and hierarchical views of data, that would otherwise be mastered in spreadsheets. DWHs have survived and thrived because they deliver great business value, year after year. There are few other ways in which data from every operating unit and every application can be viewed and worked with together as one.
As every other application in the organisation undergoes change and transition over time - from minimalist pilot, to full-featured proprietary, to off-the-shelf, to SaaS, to ERP - the DWH remains the steady fulcrum around which the business is run.
Challenges
However, it can be a big challenge, both technically and organisationally, to bring large quantities of data from multiple, disparate sources into a single ‘integrated’ DWH data model. DWHs require good design up-front, careful development to implement, and diligent maintenance to upkeep afterwards, and costs associated with these efforts can be significant. Making changes to established DWH models is not always easy and it can be problematic to integrate data sources that are immature and/or subject to frequent changes.
The appearance on the scene about 10 years ago of ‘Big Data’ applications such as Hadoop that stored large quantities of data by means other than relational databases offered an alternative. The resultant data lakes can ingest and retain data from almost any source in a reliable manner, and generally require less up-front effort to design and implement than a DWH. They typically use semi-structured files (e.g. JSON or XML) to store data in a more flexible and scalable way. However, the trade-off is that stored data is generally in a pretty raw state, with limited cleaning, quality control or transformation applied. Getting data out of a data lake in the way you want it is usually not nearly so easy as from a DWH, requiring more specialist skills and tools.
Data Warehouses and Data Lakes are different, sometime overlapping, concepts that share the same overall solution space but they are not fundamentally opposed to one another - rather they are complementary, serving different use cases.
Data warehouses and data lakes are ultimately just components of a larger Enterprise data ecosystem, that includes other systems and applications responsible for data integration, enterprise reporting, data management, data quality, data science and so on.
Systems of Record
Stepping back for a moment, at a higher, more conceptual level, a data system that acts as an authoritative source of ‘truth’ is called a system of record (SoR). A SoR is simply a data system that can be relied upon to have the correct values. SoRs are vital for accurate, reliable reporting, and are essential components for management and financial reporting, and for satisfying regulatory requirements.
There is no standard definition or rule about the form a SoR must take, but there are inherent requirements that should be satisfied around how it is governed. By necessity a SoR should be tightly controlled. Changes must be prepared carefully, adhere to agreed standards, be well documented, and must be thoroughly tested and approved before deployment. If this were not the case then it would not be possible to verify or rely on the data they provide.
In addition to standard reporting, SoRs are a great place for data scientists and other analysts to get reliable, clean consistent data. A well-designed, well-managed relational DWH has a great deal in common with, and will typically qualify as, a ‘SoR’. However, more flexible, ‘agile’ systems are needed for the day-to-day conduct of activities like rapid analysis, prototyping and problem solving.
领英推荐
Systems of Innovation
In practice much of the actual data analysis done to empower management decision-making often happens 'downstream' of the SoR, in applications and systems that are more flexible and 'sandbox' in nature. These are the so-called 'systems of innovation' (SoIs) that allow users to more freely experiment and ‘innovate’ with data.
Common examples would be SAS or PowerBI environments, that draw data from SoRs and other systems into separate data structures and allow analysts to independently ‘wrangle’ the data to generate new outputs. SoIs can share form and infrastructure with SoRs. For example, they could occupy separate schemas on the same database server, allowing analysts a safe space alongside the official data warehouse in which to create data objects and alter data to their heart’s content – without negatively impacting the monthly Board report pack.
One of the common issues that people take with DWHs occurs when they mistakenly try to use their DWH as the means to build SoI-type solutions. A SoI can certainly be built using a relational, dimensional model and using DWH data as a source, but the fast-evolving and purpose-specific nature of SoIs means that the heavier governance and planning requirements for building DWHs are an unnecessary hindrance.
With a few basic rules to govern them, such as "don't use SoIs for long-term production purposes" and "archive data and code when the temporary need has passed", SoIs can be constructed using whatever approach or technology ticks the right boxes at the time. Once the 'proof of concept' phase of a longer term project is complete, solutions developed in a SoI can be properly productionised into a System of Record.
Choosing Your Approach
These two concepts (SoR/SoI) are very helpful for bringing clarity to the discussion. The flow chart below suggests how to determine which type of system best fits the use case:
If you have varied requirements around accountability, reliability etc. then it is perfectly acceptable to have both SoRs and SoIs running in parallel, to serve different needs.
Similarly, if your data systems need to serve both stable and variable data, then parallel DWH/DL systems, or a combined data ‘lake/house’ system that can provide both relational and file store data repositories might be in order. Microsoft’s Synapse offering is an example of this, and many vendors now offer products with the ability to cater for different types of storage within the same overall application and interface.
Conclusion
There is an adage that ‘if the only tool you have is a hammer, then every problem looks like a nail’. IT professionals accustomed to working with DWHs often have a strong inclination to always try to add new sources of data to the DWH, even if the data doesn’t really suit this approach, rather than to utilise a separate 'system of innovation' or data lake.
I have fallen into this trap myself once or twice, and learned painfully the error of my ways. Collectively professionals should stop arguing that DWHs are too rigid, or that data lakes are too messy, and instead recognise the advantages and limitations of each. Using the right tool for each job, we can employ different options to craft solutions that are best-suited to the needs of our partners.
Note: The terms SoR and SoI and the view of types of data systems we put forward in this article are inspired in part by the 'Pace layer' application architecture originally published by Gartner in 2012. In keeping with our 'data-centric' philosophy, our view is focused on the requirements that data systems must satisfy, rather than the commonality or otherwise of applications across industries.
Questions on Data Warehousing, Data Integration, Data Quality, Business Intelligence, Data Management or Data Governance??Click Here?to begin a conversation.
John Thompson is a Director with EY's Technology Consulting practice. His primary focus for many years has been the effective design, management and optimal utilisation of large analytic data systems.
BI Business Partner to Consumer at Vodafone
2 年Very good John