登录查看更多内容

Why Do We Need Analytic Data Platforms?

John Thompson

Director, Data Engineering, EY

发布日期: 2019年9月17日

When talking to customers I often encounter the same questions repeatedly. One of the most common is "Why do we need a specially designed analytic data platform?" In order to understand why we need analytic data models it is first necessary to understand some of the features of operational data models that make them an awkward prospect for enterprise analytics.

1. Mixed Workload and Operational Performance

It should be no surprise that operational databases are designed to deliver good operational performance. This typically means rapid completion of transactions involving individual records, e.g. to amend a customers billing address, or to finding of a few records e.g. a list of recent transactions for the customer than has just called the help-desk. In these circumstances a delay of only a few seconds can be significant. On the other hand, analytic queries tend to look across large sets of data and large time periods to touch on multiple objects and records. I.e. tell me how many customers in city A between the ages of 18 and 30 bought products in category B last summer but not this summer.

It doesn't take much knowledge of databases to guess that analytic queries are 'heavier' and demand more processing power and time than operational ones. Where there is a 'mixed workload' of operational and analytic activity, analytics will frequently gum up the works, slowing down business critical activities like customer service.

If nothing else, many organisations will at this point realise the need to create a separate copy of the operational database for analytic purposes, in order to protect operational performance.

2. Complexity

While some 'power users' will be competent in SQL or another data access tool and will understand intimately the data models of one or two systems, this knowledge is hard to acquire and is beyond the reach of most business professionals. Data scientists may scoff, but in fact the vast majority of analysis is done by ordinary business users, who don't know SQL and don't know the operational data models.

They do, however, understand their businesses very well and typically just want to get the data they need into Excel, or a similar tool, where their real skills will kick in.

For these users, the complexities typically inherent in operational data models, particularly normalisation and overloading, are often a bridge too far, forcing them to rely on intermediates to get the data for them.

Normalisation

Normalisation is a set of 'decomposition' techniques used to avoid the repetition of data in a database. Decomposing things into small units means that those small units can be searched more quickly, and that data inserts, deletes or updates will be completed more rapidly. The primacy of performance in high-throughput operational systems makes this approach essential.

Unfortunately, normalisation introduces complexity. If we have separate database tables for customer, customer age, customer name, customer address, customer type and so on, the number of tables quickly increases and the number of joins between objects that are required to present all of the related data increases also.

Over-Loading

'Over-loading' is the reuse of the same data structures for more than one purpose, based on context. The applications adding to or editing operational data are usually custom-built programs working with only one record at a time. As long as the data structure supports it, each process is capable of storing different types of data within the same structure in their own unique way.

For example, if you have a structure for customer address, you could reuse this for employee address, or indeed as a generic address table that serves all address requirements. In fact, you could have a table that stores each component of the address in a single column with some contextual keys, so that a single address is spread over multiple rows in the same table, with one row each for street, city, country etc. In fact, we don't need to limit this type of structure to address information - if it is a text field we can store anything in there, right?

The difficulty of 'unwinding' overloading means that complex code is needed to make sense of each context in which an object is used. E.g. multiple self-joins may be required to 'flatten' a self-referential hierarchy, or custom 'role views' must be constructed to surface different data sets from overloaded base tables. You can't just hand a table like this to a business user and say 'the customer address data is all there, knock yourself out'.

Normalisation, overloading and other design features of operational databases pose significant challenges for business analysis of data.

3. Architectural Scalability

As organisational scale and IT complexity increases, bottlenecks and conflicts begin to arise for business analysts seeking to use data.

Stage 1: At the outset in small organisations, the business users have the necessary skills and knowledge to access and use operational data themselves.
Stage 2: As the business grows, not all staff will have the necessary skills or knowledge to use the source systems directly, and will begin to rely on individual 'Power Users' to get data for them.
Stage 3: As new source systems come online and business activities increase, additional Power Users will be required to meet demand. Reliance on Power Users creates bottlenecks for the business, who have to wait in line to get what they need. Each Power User will access source systems in their own way according to their personal skills and knowledge, and business users will increasingly find that the data they have differs from the data that others have, leading to disagreement, confusion and conflict.
Stage 4: The introduction of an analytic data platform resolves bottlenecks and disagreements, by making source data available to all users in a consistent and easily understood way. Any new source systems will be integrated into the platform, meaning that this architectural approach can scale almost indefinitely.

4. Combination, Federation and Abstraction

A common activity for analysts is the combination of data from multiple sources. Imagine you are an analyst creating a sales reports for a retail organisation and do so by pulling information from a copy of the transactional system into Excel and formatting it there. Then your firm acquires another retailer with a different set of systems and the CEO wants a single combined report showing all sales.

You can either do the combination manually by generating separate reports from each system and combining the outputs (e.g. in Excel) or you use employ a programmatic approach to combine the two sources in the data layer by writing code that will combine the objects from both systems into a single model - an approach I refer to as 'Federation', (even though the model is often physical rather than virtual).

The manual approach is laborious, requiring regular and repeated manual effort, and is vulnerable to both manual error and resource constraints. The federation approach is technically more challenging initially but provides a more stable and responsive long term solution.

If the analytic data model has been created as a replica of the first source then incorporating into a second source system with a different way of representing data could be very difficult. The model may need to change significantly, and all of the dependent reports and data processes downstream may need to change too - a nightmare.

A better approach is to 'abstract' the analytic model from the source at the outset, so that is closer to an logical/conceptual model of the business, than to any specific technical representation.

Users familiar with the business concepts will quickly be able to understand the abstracted model and find their way around. The specific details and complexities of the source data become immaterial, and therefore new source systems can be integrated in future more easily and with less disruption.

This idea of placing a layer of conceptual separation between the source systems and the downstream outputs is called 'abstraction'. It can empower and stabilise business activities over the long term, and insulate the business from technical and infrastructural change.

Conclusion

It is for all of these reasons and more that the analytic data platform has long been the default approach to delivering on the reporting and analysis requirements of the enterprise.

In subsequent articles I will talk more about different approaches to the architecture and design of analytic data platforms and how best to face the many challenges that analytic workloads generate.

Questions on Data Strategy, Data Warehousing, Data Integration, Data Quality, Business Intelligence, Data Management or Data Governance? Click Here to begin a conversation.

John Thompson is a Managing Partner with Client Solutions Data Insights Division. His primary focus for the past 15 years has been the effective design, management and optimal utilisation of large analytic data systems.

Anne O'Regan

Senior SAP Consultant at Client Solutions

5 年

Hi John. Thanks it's very good. The challenge is to have the Analytic Data Platforms updated in real-time.

Leo Murphy

5 年

Great piece John. Very well explained.?

Conor Daly

Senior Manager, Data Engineering at EY

5 年

Very nice summary John!

查看更多评论

要查看或添加评论，请登录

John Thompson的更多文章

Enterprise Data - its just plumbing, right?

2023年2月15日

Enterprise Data - its just plumbing, right?

When I started as a data consultant many years ago, my first solo assignment was to resolve a number of issues a small…

7 条评论
After Big Data

2022年12月13日

After Big Data

When Distributed File Systems came on the scene in the late noughties, everyone realised that something big was…

4 条评论
The Big Power of Small Data

2022年11月29日

The Big Power of Small Data

We have all been so bombarded in recent years with information about 'Big Data' that the value of 'Small Data' is…

1 条评论
When do you not need a Data Warehouse?

2022年4月12日

When do you not need a Data Warehouse?

‘Data Warehouse’ (DWH) is the term used for the last 30 years by both technicians and business stakeholders to mean…

2 条评论
Becoming Data Centric

2022年3月14日

Becoming Data Centric

I’ve spent the last two decades working with analysts to solve data problems in a systematic way and to create…
What is Data Entropy?

2022年2月8日

What is Data Entropy?

There is a common meme that LinkedIn regulars will know well. It shows a series of pictures of Lego, one with lots of…

6 条评论
Schrems II: What Does it mean for EU Data Processors?

2020年7月28日

Schrems II: What Does it mean for EU Data Processors?

The Schrems 2 case has been long running and much discussed and its ultimate findings, while still being digested, will…
How is Data Management Different from IT Management?

2020年2月18日

How is Data Management Different from IT Management?

In a season where the Liverpool football team is about to win the Premier League for the first time in 30 years, a…
Rise of the (Data Science) Robots

2020年1月23日

Rise of the (Data Science) Robots

I started out at university studying Molecular Genetics and for a long time considered doing a doctorate and building a…

5 条评论
Choosing a BI Tool

2019年11月11日

Choosing a BI Tool

Data reporting and visualisation ‘BI’ tools come in many flavours, with a bewildering variety of features to confuse…

7 条评论

See all articles

Why Do We Need Analytic Data Platforms?

John Thompson

Director, Data Engineering, EY

1. Mixed Workload and Operational Performance

2. Complexity

3. Architectural Scalability

4. Combination, Federation and Abstraction

Conclusion

John Thompson的更多文章

社区洞察

其他会员也浏览了

Data & Technology Revolution: Impact on Data & Analytics Jobs

Is Inconsistent Data an Accepted Problem? It Shouldn’t Be.

What is the need for transforming data?

Unlocking the Power of Direct Lake Mode in Power BI: A Game-Changer for Large-Scale Data Analysis

Data Won. The History of Data Transformation

We’re talking Data. Data, means Data means Nothing... Talk Information

Data Democratization: Empowering Teams Beyond IT (No More Data Bottlenecks!)

The Data Matrix: It’s not enough to be capable, you also have to be ready

Dispel the Misconceptions: 10 Data Analytics Myths That Threaten Your Business Success

Data Strategy: A Guide to Developing and Executing a Successful Plan

1. Mixed Workload and Operational Performance

2. Complexity

3. Architectural Scalability

4. Combination, Federation and Abstraction

Conclusion

John Thompson的更多文章

Enterprise Data - its just plumbing, right?

After Big Data

The Big Power of Small Data

When do you not need a Data Warehouse?

Becoming Data Centric

What is Data Entropy?

Schrems II: What Does it mean for EU Data Processors?

How is Data Management Different from IT Management?

Rise of the (Data Science) Robots

Choosing a BI Tool

社区洞察

其他会员也浏览了

Data & Technology Revolution: Impact on Data & Analytics Jobs

Is Inconsistent Data an Accepted Problem? It Shouldn’t Be.

What is the need for transforming data?

Unlocking the Power of Direct Lake Mode in Power BI: A Game-Changer for Large-Scale Data Analysis

Data Won. The History of Data Transformation

We’re talking Data. Data, means Data means Nothing... Talk Information

Data Democratization: Empowering Teams Beyond IT (No More Data Bottlenecks!)

The Data Matrix: It’s not enough to be capable, you also have to be ready

Dispel the Misconceptions: 10 Data Analytics Myths That Threaten Your Business Success

Data Strategy: A Guide to Developing and Executing a Successful Plan