登录查看更多内容

Federate before you Replicate

Duncan Foster

Google Cloud | Africa | AI, Analytics & Data

发布日期: 2024年10月22日

Far more organisations bring data into a central data location (whether data warehouse, data lake or data lakehouse) than need to. Often, because they take a bottom-up view of their data inventory - seeing their first step as bringing all information together into one place, only afterwards to figure out what to do with it. Frequently because technology (or IT) leads the way, rather than business or data.

Unfortunately, the alternative tends towards data chaos. Analytic tools, such as PowerBI and Tableau (and perhaps even Google Cloud’s Looker Studio), encourage independent data collection - where multiple data sets can be combined in an ad hoc manner. The advantage of this top-down approach is only pertinent data sets are retrieved - those required for the value-creating output. Yet the issue created is data inconsistency, with different people attempting the same activity ending up with different results - due to discrepancies in how they source, retrieve and combine data sets. Distrust of data ensues.

We need both governance and accessibility. Which is why organisations typically jump to the conclusion a centralised, singular data store is required. But they are often wrong.

Certainly, you need to combine relevant data together for analyses. Finding intriguing linkages between disparate information is where novel value is uncovered. However, this does not necessitate all those data sets be colocated and stored within the same system. It does mean information about how those data sets should be harnessed should be contained within a singular source representing the organisational knowledge of the canonical approach to sourcing, understanding and combining that data.

Business Objects Universes were an early example of this. They did not retain data but defined where data (tables) was located and how each related (joined) to one another. Google Cloud’s modern equivalent is Looker’s Semantic Layer (using LookML). However, both technologies are designed primarily to describe the organisational understanding of data within the same storage location. Far less semantics of combining data sets located within different databases (both being structured data technologies). Whilst Looker does have options to Merge Results, this is still capped beyond certain data sizes and exists outside of its governed semantic layer.

So whilst Universes and LookML tackle the challenge of a governed semantic layer for consistency of information pan-organisation (and providing a platform for self-service in the process), they do so predominantly for a single data location at a time. They are not optimised to present a singular view of all data, irrespective of the underlying data store.

This is where data federation technologies come in. Google Cloud’s BigQuery has this capability when the data resides in Google Cloud sources (e.g. AlloyDB, CloudSQL, Spanner) and even cross-cloud for open file types (e.g. Iceberg in AWS S3 or Parquet in Azure Blob Storage). Indeed, for Google Cloud-centric customers, I frequently recommend (especially when their data sets are small) always starting with federated analysis and only progressing to ingesting data physically into BigQuery if federation cannot meet their demands. For customers with data spread more widely across their organisation, third-party capabilities like Denodo offer an alternative approach (and integrate nicely with Looker) to expose the totality of an organisation’s data without needing physical movement.

Whilst seeming very similar, semantic layers and data federation play different roles. The former provides the consistency of downstream information provision, acting as the canonical definition of information for the entire organisation (and avoiding data chaos from a plethora of individuals manipulating data according to their understanding). The latter is the engine that optimises the retrieval of data, wherever it may be located, making it readily available for use (preferably by the semantic layer).

Data warehouses (the same applies to more recent nomenclature like data lakes) are like legacy supply chains. Factories would send their goods to a central location, the warehouse, for onward distribution to their point of eventual consumption. This bulk movement of goods was necessitated because supply chain sophistication was immature and needed well-planned and forecasted large delivery cycles. Data federation is like a modernised supply chain relying on just-in-time drop-shipment. Whilst consumers still request data from a centralised location (or API), rather than being ‘warehoused’ in advance, the required subset of data is delivered on-demand, in real-time and just-in-time from the ‘factory’ (the source system). And like supply chains, the advancement is not just the ability of the factory (source system) to produce its output (data) just-in-time (performantly) but also the supply chain (network) ability to send those goods (data) where they are needed almost as quickly (bandwidth) as if they were coming directly from the warehouse.

Data federation is a better pattern. Not only is data latency improved (data is read directly from the source system on-demand, so is real-time) but complexity is reduced (connectivity is needed to the source system but not connectivity plus a data pipeline into a secondary data store). Federation should always be the starting point for an organisation’s approach to data - for many, the only one they will ever need.

Only if it is truly demanded does data need to flow into another location (e.g. a data lakehouse). There are really only ever two reasons for doing so. One is cost, which usually translates to performance; the originating system may not be optimal for analytical query workloads and so either be non-performant or require excessive system resources (and hence cost) to deliver acceptable performance. So data must be migrated into an alternative analytic store that provides better performance characteristics (at the given cost), although it is important that the cost and additional complexity be justified (versus, for example, just upgrading the source system to handle these workloads directly).

The second reason is retention. Originating systems may only retain the current state of the information but the entirety of its historical evolution may have value. Therefore, the data’s entire historical record could be retained in an alternative location to preserve it for posterity (or regulatory reasons). This has the added benefit of simplifying the future retirement of those source systems, as their retirement value lies in their data (as their processes will have been discontinued).

All other reasons are really subsumed into these two. For example, harnessing the data for the latest AI purposes is something that could be done on the original source system - but may be far too slow either to implement (hence development cost) or execute (hence infrastructure cost, or even opportunity cost).

The key question as to whether or not federation becomes the preferred data architecture is whether or not data is growing faster than our ability to process it. If data is growing faster, then there will continue to be demand for specialised data processing engines, which can handle the increasingly large data sets demanded of it (and far more than the systems which are originating that data). If, however, computation speed/cost grows at a faster rate than the generation of data, then an ability to compute in-place becomes increasingly common and so obviates the need for the replication of data into a dedicated location. Related to this is the networking capabilities; whilst the speed of light remains a constraint, the faster large amounts of data can be brought together on-demand, the less need there is for combination of these data sets in advance of retrieval.

I cannot state categorically where future data processing trends will take us. But I can state with confidence that the vast majority of organisations should start their data journey with federation; only replicating when the necessity becomes apparent.

The Error of Data Gravity Dictums

2024年11月26日

Data must break free from IT chains

2024年10月1日

Bad Fashion: Open Data Lakehouses

2024年9月11日

Suicidal AGI: Truly Terrifying

2024年8月20日

The best LLM? The platform

2024年8月14日

Google is wrong; BigQuery is SaaS (not PaaS)

2024年8月6日

Over-Building: The Tech Firm Failure

2016年8月17日