The Error of Data Gravity Dictums

The Error of Data Gravity Dictums

For over a decade, Data Gravity has been a byword for the agglomeration of data stores (primarily analytic stores, like data lakehouses but also application-specific databases) into a common location. Common justifications include minimisation of latency, complexity and (especially in a cloud context) egress. Hence, in a cloud context, the most common IaaS platform for application services becomes the default location for the analytic platform.

Putting aside the challenge of the data function kowtowing to IT (which I have also commented on) and hence the deployment of the best analytic platform, is this cloud co-location rationale justified? No, it is not.


First, let’s consider structured data. Often this is the core business data for any organisation, sitting in CRM, ERP, HCM etc. applications. Increasingly, these applications have shifted from legacy on-premises deployments to modern SaaS offerings. Even when a ‘lift-and-shift’ migration has taken place to move those applications onto an organisation’s preferred cloud IaaS platform, subsequent upgrades often tout the need to adopt a newer SaaS model. Today, it is estimated that around 70% of all software in an organisation is SaaS. And that is only projected to grow, with the 2025 figure expected to hit 85%.

The whole point with SaaS is that an organisation does not deal with any of the infrastructure. The SaaS offering may by happenstance be using the same IaaS provider. But they may not. And there are no guarantees that the SaaS provider will not change in future (indeed, their flexibility in managing and optimising their underlying infrastructure is part of their value proposition). Whilst an organisation’s proportion of SaaS applications may not be identical to their proportion of data - it is easy to conceive of data-heavy applications remaining on-premises - it is reasonable to assume at an aggregated organisational level the ratio of applications to data holds. Put another way, in future, perhaps as little as 15% of an organisation's data will originate from their cloud IaaS-hosted applications. Hardly a compelling case for using data gravity as a justification to mandate that same cloud for their analytic platform.

Yet structured data is a tiny subset of an organisation’s total data. Emails, documents, images, audio files and even social media posts make up the majority - estimated at around 80% of an organisation’s data. Rarely does this data already originate from an organisation’s IaaS applications. More commonly, it surfaces from an entirely separate source (although this may be coincidentally from the same vendor), which may or may not also be SaaS. Furthermore, the proportion of unstructured data is only growing, estimated at 4x the pace of structured data.

If we make the reasonable assumption that a negligible amount of an organisation’s unstructured data originates from their managed IaaS platform (especially when considering sources like emails with attachments, contact centre audio logs or video footage recordings), then only structured data remains. Which itself is 20% of an organisation’s data. So by continuing to apply our calculations, only 3% of an organisation’s total data originates from their cloud-hosted IaaS applications.

But even this is only a subset. Hitherto, we’ve discussed structured and unstructured data that originates from an organisation - their first-party data. More and more organisations are harnessing external data from second-party and third-party sources. Google Cloud has one of the largest marketplaces for public and private data. These can dwarf the total data volume of an organisation - for example, Google Earth Engine has over 50 Petabytes of data!

Whilst statistics on what proportion of data an organisation uses is first versus other-party, it is reasonable to assume this will only increase with time. Some industries (e.g. financial services with exchange tick data and customer credit bureau details) have been using this for a very long time. Others are more nascent. But given the multitude of potential partners and sources, it is reasonable to assume that - if not now, then at least in future - external data will exceed internal data. For our calculations, let’s assume a balanced 50:50 proportion. So a paltry 1.5% of the data an organisation will analyse originates from their cloud-hosted IaaS applications. Such tiny a proportion as to make dictums of co-location for analytic platforms risible.


Then why does this error persist? Perhaps because so many organisations start with their internal, structured, hosted data - as it is typically the easiest to access. The organisation is responsible for the underlying data store (usually a database) so pulling data from there is somewhat straightforward. Much easier than integrating SaaS APIs or applying Machine Learning to unstructured data or brokering partnerships with second-party data providers via clean rooms. Plus it is the only data many organisations have historically analysed.

So that small 1.5% (and shrinking) of total data dominates the immediate decision-making process. Then once an analytic platform dictum is made to align deployment with the preferred cloud for IaaS applications, organisations find reversing this decision intractable. Even when they encounter challenges with integrating SaaS, unstructured and external data into their approach. They coalesce around this local maxima, foregoing the advantages of exploring the search space for a global maxima.

The remedy is forcing a big picture conversation. Examine not just the immediate data an organisation wishes to incorporate into their analytic platform but the long-term totality - especially unstructured and external data (which make up the majority). Consider where data might originate from several years hence, not several weeks. Understand the potential capabilities for deriving insights and value from the totality of data and use this to determine the best analytic platform - treating co-location as the irrelevancy it truly is.


P.S. Despite my personal advocacy for Google Cloud’s data platform, this principle applies in all instances. So even if Google Cloud is your preferred cloud for IaaS applications, that does not mean a dictum to use our data platform is an appropriate decision-making process. Rather, select the best analytic platform for your needs - independent of prior IaaS cloud commitments.

Victor Mumo Titus

Engineering Manager at WorkPay Africa

3 天前

Very informative Duncan Foster

Meir Fox

Google Cloud | Data Analytics and AI Sales Specialist

4 天前

Duncan Foster this is beautifully reasoned, even more so when quantified (1.5%!) Egress and latency arguments fall further flat when organizations consider the need of multimodality. If unstructured data constitutes the majority, and is the fuel behind Generative AI/ ML initiatives (which are a priority for many orgs), then any decision around choosing an analytics center of 'gravity' should entail the big picture like you mention- e.g. a platform's ability to integrate, scale, and govern that data alongside structured, second, and third-party data.

要查看或添加评论,请登录

Duncan Foster的更多文章

  • Federate before you Replicate

    Federate before you Replicate

    Far more organisations bring data into a central data location (whether data warehouse, data lake or data lakehouse)…

  • Data must break free from IT chains

    Data must break free from IT chains

    Often, Data is misconstrued as a subset of IT. This is wrong.

    2 条评论
  • Bad Fashion: Open Data Lakehouses

    Bad Fashion: Open Data Lakehouses

    A great number of companies I speak with are enthusiastic about adopting the open data lakehouse pattern. Some already…

    2 条评论
  • Suicidal AGI: Truly Terrifying

    Suicidal AGI: Truly Terrifying

    Recent AI advances revitalised interest in AI Alignment and control, with Existential Risk from AGI capturing…

    2 条评论
  • The best LLM? The platform

    The best LLM? The platform

    The best LLM today no longer matters. ChatGPT went viral in November 2022 and captured the world’s imagination.

    3 条评论
  • Google is wrong; BigQuery is SaaS (not PaaS)

    Google is wrong; BigQuery is SaaS (not PaaS)

    I work for Google. Google is wrong.

    7 条评论
  • Over-Building: The Tech Firm Failure

    Over-Building: The Tech Firm Failure

    TLDR; Just because you can, doesn't mean you should Startups face many constraints. Time, money, resource.

    2 条评论