A Simple Guide to Considering Data Gravity in a Hybrid & Edge Computing World
I recently asked a few questions of my incredibly smart LinkedIn community. It starting with a discussion about on-prem vs public cloud when I decided to get more specific and ask a question about Data Gravity. My concern is that some of us, OK, maybe just me, are, in the abstract, oversimplifying the discussion of where data needs to live, especially when you consider the hybridized nature of modern IT. Well, I asked for it and I got it, dozens of great comments and answers and the following is just a sampling. If you’d like the whole story, which includes contributions I didn't include here or would like to participate, feel free to come visit the discussion on LinkedIn .
The Questions
- What is the best method to review your #Datagravity requirements?
- What process would you use to continue to validate decisions on location or application design?
- Would you consider AI or fixed policies on data replication, latency or location?
- Would you consider having a review process or tool for managing your application ecosystem?
When you think about your data, the following are likely to be your key considerations as to placement for best value.
Val Bercovici CEO/Founder (@Valb00)
Data Domesticity Regulations: The Law trumps size, elasticity and performance influences on Data Gravity
Ralph Loura CTO/CIO @RalphLoura
Traffic patterns to/from the data along with latency requirements
Would prefer having a tool to monitor/enforce and potentially automate data management, with humans setting the policy. Not opposed to the idea of AI, but not necessarily in favor yet either
Paul Clark Enterprise Data Architect
Regarding review and understanding of data gravity, from a data architecture perspective, we should first understand data as an organism within the enterprise. For this discussion, there are three essential natures of data: pulsatile, persisted and residual. Pulsatile data, much like blood being pumped to and from the heart, lives close to the action. As Dave mentioned, in an e-commerce system where customers are placing orders. Persisted data, such as customer and product data, doesn't have a pulse necessarily but is essential for supporting business function. Residual data is the footprint of activity left behind by the running of the business such as completed orders and stale customer data. If we can accurately classify our enterprise data, we can then know where data needs to live and what performance and accessibility demands are expected. For instance, pulsatile data requires proximity to the action where residual data does not.
Lori MacVittie Principle Technical Evangelist @lmacvittie
Gravity is affected by the weight (mass) of data sources. The bigger the data store, the less likely it - and highly dependent apps - are going to move off prem. Data on new app dev says only 15% or so are cloud native. The rest are traditional architectures. Makes that 85% harder (but not impossible) to migrate.
Rick Parker Senior Systems Engineer @parkercloud
Data gravity, location proportionally closest to biggest users validate design against cost comparisons vs public clouds Fluid policies on replication latency and location. AI or ML preferred. I use monitoring as a review process and tool
Dave McCrory VP Software Eng for Machine Learning / IIoT & father of “Data Gravity” @mccrory
First, Data in flight if frequently sampled has both gravity and inertia, regardless of persistence. Best method of identifying #Datagravity requirements is to measure end to end (and point to point) request/response latency and bandwidth vs required/desires latency and bandwidth response. This could be API calls, DB requests, App responsiveness, etc. The process answer is somewhat baked in, but additional tools or monitoring could be used. There are also elements of data governance/provenance, costs, and outside factors that need to be accounted for. AI wouldn’t be considered IMO, but yes to easily understood policies regarding replication, latency, bandwidth, and location. A review process is absolutely needed, a tool managing an all ecosystem? I haven’t seen such a tool that doesn’t have a crazy amount of overheard and maintenance involved.
Jason Collier Founder @bocanuts
Here is an example of a customer of ours in Europe. This customer is a large global retailer with over 6000 stores worldwide. They deploy both private cloud and edge. Each one of the stores requires an inventory tracking system, point-of-sale systems, security camera archive, and other misc. resources. The data that is then relevant to the corporate goals is processed at the edge and then relayed up to HQ. In their case the relevancy of the data determines its locality. Also, some of these stores are smaller and in remote regions with very poor connectivity options making it a requirement that they must be able to operate independent of a connection to HQ for extended periods.
Rick Drescher Data Center, Interconnection & Cloud Strategy Consultant @Rick_Drescher
Speaking from the perspective of a recent project, this issue came into play with a massive data warehousing/analytics platform that had hundreds of terabytes of historical data, with the growth of that data dramatically slowing over the past few years. The initial thought was that the behemoth data set, living on a pricey SSD SAN would be treated like a boat anchor in the enterprise data center, closely coupled with the compute required to run analytics on the data as required by clients for the foreseeable future (this was a SaaS data analytics platform). With the help of data analytics tools and some really smart developers, it was determined that a consolidated data set, less than 10% of the size of the full data set, produced analytics that were within 0.004% of accuracy of the full data set, at an almost unbelievable performance improvement of more than 700%. This made it instantly plausible to migrate the platform to the public cloud, which avoided a repeated sizable capital investment that the client had undertaken every 5-7 years or so simply to be able to continue purchasing maintenance on the hardware running it, and updating operating system and underlying software to keep security current.
Ryan Fay CIO @RyancFay
Data gravity is relative to the use case and often decided based on numerous factors such as compliance and regulations. Same with location proportionally with my preference to keep it as close to the EDGE/ FOG as physically possible thus reducing latency and cost. We currently validate our designs against both private, public and hybrid multi-cloud value/cost metrics. We concentrate on automating fluid policies based on replication latency, use case, and location. GCP offers fantastic AI and ML tools that have saved my teams an immense amount of time and energy. As I stated in another comment below, the more your applications use native platform or cloud features, the less likely that your apps will be easily portable. The reason is many desirable capabilities (that I truly value) are tied to a specific PaaS, IaaS, SaaS, and TaaS, and those just can’t be migrated as is or in some use cases should not be relocated for many reasons
Yuval Dimnik Co-Founder @yuvaldim
Datagravity can be somewhat factored by the following to compare complexities across organizations and projects: Number of data classes you have (this is mostly factored for complexity which affects TCO). Net size of each data class - The more data you have the more gravity you have. Required performance - both in data creation as well as in consumption. Higher requirements mean more gravity. Generators/consumer location - to be factor against the next two items. Provided performance for each of the clients location - the better performance you can actually get the lower the gravity. Cost of consumption (egress cost...) - if it is expensive to move data to that client - gravity is higher. Cost of management to achieve the above - dependent on the number of data classes, clients, locations, vendors, external and internal regulations and more. If an org has 30% of an FTE to optimize data that has egress of 5K a year - it doesn't make sense. If they don't optimize and have egress of $200K - well... same problem.
Whether you’ve already put a data management plan in place for modern deployment of applications or are just considering one, I would highly recommend taking some of the advice from this incredible group of technologists and strategists.
Why
I asked the questions because I like to challenge myself and my assumptions. There is a ton of FUD out there and if you haven’t looked at how to solve a data gravity issue in the last 6 months, you’ve probably missed out of some important strategies and technologies that would help create opportunities for you in new and unique ways.
Shout out to Dave McCrory for the use of the term #DataGravity
Great Summary Mark. BTW, I'm also the founder of NooBaa ;)
IMHO, one large driver of data gravity, or rather data inertia, is that the applications that create and access the data are rarely architected from the onset with planning for how to deal with the ever-increasing amount of data, thus resulting in the proverbial "data lake" that is challenging to migrate. This is somewhat analogous to how Las Vegas was built. Instead of building the roads first, the city (data in this case) was built first, and then the ways to access it were constructed.
Building AI Factories, Open Source & Cloud Native
6 年What a great summary post Mark! Thanks for level-setting where Cloud is roughly a decade after it entered our consciousness :)
IT Strategy, Architecture & IT enabled process improvement, ITIL & LEAN Six σ accredited
6 年What I truly loved Mark is the chorus of "it depends" and 'Your mileage may vary"? I was never fond of the extremist views of 'all cloud all the time' - workloads will vary, and I love the term data gravity - because it accurately sums up the issue. With the average large organization running at least hundreds if not thousands of legacy apps - the concept of data gravity will give architecture teams a great pointer on what to look at first.