Should you try to take your cloud with you?
In their 1973 research paper Availability: a heuristic for judging frequency and probability, Daniel Kahneman and Amos Tversky introduced the availability heuristic to the world. Roughly stated, the availability heuristic says that, when making decisions or judging probabilities, we give disproportionate weight to information that comes readily to mind. For example, if lots of people who live down my street drive a particular make of car, I may judge that that make of car is very popular, when it may just be that I live in a street with people of a certain wealth, age, social class and so on. Or maybe I live next to the owner’s club.
I believe that the availability heuristic may be responsible for some of the ways we think and talk about one of the biggest shifts under way in enterprise technology today: the shift from on-premise infrastructure to public cloud platforms. I think this because much of our talk about public cloud is dominated by concepts of movement: we talk about portability, about vendor lock-in, and about various types of exit strategies, to a much greater degree than we do for our on-premise infrastructure. And this is because, right now, the story of public cloud is a story of migration. The adoption of public cloud is still in its early stages across most industries, and much of the work that is being done is to move workloads and data from one platform to another. The idea is prominent in our minds, and leads us to dwell on the thought: ‘If I have to put this much effort into getting in, then what will I need to do to get out?’
These questions are not isolated to public cloud. On-premise, we worry about what happens when a data centre fails, and plan ways to switch work from one data centre to another in the event of a disaster. But we don’t, typically, when building a new data centre, give priority to the way we’re going to get out of it: we regard it as the new home of our technology for a long time to come.
I believe that seeking a degree of portability, the ability to move workloads and data between cloud providers, is a reasonable way of addressing a particular set of risks. However, I also think that we have to think through this topic very carefully if we want to avoid being too strongly guided by the availability heuristic and mistaking the means for the end.
In particular, I think that we should keep asking ourselves one question: what do we intend to achieve? Portability cannot be a goal in itself: there is no point in moving workloads between cloud providers just for the fun of it (and it’s no fun in any case). Here are three reasons often given for seeking portability, and my view of their importance.
To avoid vendor lock-in: like portability, I don’t think that avoiding vendor lock-in is an end in itself. If we are honest with ourselves, we can admit that most enterprises are already ‘locked in’ to many of their vendors. How easy would it be to change your OS, database or ERP provider? Vendor lock in for cloud assumes prominence (and triggers the availability heuristic) because it involves a big deal with a big provider, or because it involves procuring services that we used to do ourselves. But there are established ways of mitigating this concern, through thoughtful and thorough supplier management. If we do that well, vendor lock in may just be another term for partnership.
To apply commercial leverage: on one hand, there is more than one cloud provider and competition is fierce. On the other hand, if we have invested heavily in getting onto a cloud platform, it might feel as if we are at the mercy of the provider of that platform. Isn’t it sensible to maintain flexibility to get the best price? Once again, this is not an irrational or unreasonable concern. But I believe that it is better to mitigate this risk through commercial and contractual means, rather than through engineering. In my experience of outsourcing deals (and, however we present it, cloud is a form of outsourcing), it’s rare for technology to be the deciding factor in switching vendors. Furthermore, we should remember that the cloud market is still young and unsaturated: even the partnership deals being struck with large enterprises currently only address a fraction of their technology spend. The motivation is still with cloud providers to encourage their customers to put more on their platforms.
To mitigate against platform failure: I think that this is the most sensible reason for seeking portability - within limits. The disciplines of reliability (building systems that survive routine failure) and resilience (building systems that survive widespread catastrophes) are ever more important parts of enterprise technology. When we place our systems on a public cloud platform, we make their reliability and resilience dependent on the stability of that platform. It is rational to plan for disaster, and one way of planning for disaster is to have somewhere else to go.
领英推荐
However, when planning for platform failure, we must be careful once more not to be unduly influenced by the availability heuristic. Remember that the heuristic influences our judgement of the probability of events. When we spend our time thinking and talking about portability and exit plans, those ideas are prominent in our minds. But this does not mean that the events are frequent or likely. In fact, despite a relatively small number of well publicised widespread outages on global cloud platforms, there has not yet been a failure sufficiently broad or extended to trigger any reasonable exit plan.
I believe that there are two remedies for the undue influence of the availability heuristic on this problem.
One is data. When considering our response to potential failures, we should gather as much data as possible about the performance of cloud platforms. Of course, this data is incomplete: the cloud industry is still new, and we do not have that many years of performance data, even less for cloud platforms operating at their current scale and rate of growth. However, we can supplement actual performance data with data about the ways in which cloud providers and their tenants have assessed and addressed risk.
The second remedy is scenario planning. We can get beyond simple questions such as ‘what if cloud provider x goes down?’ by asking what we mean by that question. What would the precise circumstances be? How widespread would the outage be? What is the likelihood of that? What are the cloud provider’s own recovery plans? A considered set of scenarios gives our assessment of risk a degree of precision we cannot achieve with general concerns.
Of course, there is nothing new here. Risk management entails thoughtful responses to risks on the basis of likelihood and impact. Gathering data about cloud platform outages, risks and mitigants, relative to on-premise outages, risks and mitigants, helps us to judge likelihood. Defining and thinking through scenarios helps us judge impact, both of potential failures, and of our responses to those failures (sometimes the mitigant is worse than the risk). This is a lot of work - but it’s less work, and I hope more valuable work, than pursuing portability without a true understanding of the risks we aim to mitigate.
Kahneman and Tversky’s work on the availability heuristic was popularised in their best selling book, Thinking Fast and Slow. When considering complex problems such as cloud risk, portability and exit planning, it’s time to think slow.
(Views in this article are my own.)
Banking and Payments Architect
2 年Interesting perspective on this topic .. thanks for sharing your thoughts
Industry Lead, Financial Services, Google Cloud
2 年David. As always a brilliant thought provoking article. One of our customers had this classic vendor lock in / proprietary vs portability dilemma. It all boiled down to a simple business case of what we get as benefits vs exit costs. And invariably the benefits outweigh the costs when you factor in probability and the need to exit. Edge cases should not be the primary driver of our choices.
Executive Director, Group Head of Data Product & Mesh Enablement @ UBS | Board Trustee @ Community Action Redbridge | Advisory Councillor @ Harvard Business Review Advisory Council
2 年Thanks for sharing - definitely an interesting place to be, at the rapid ramp-up of effectively a high-stakes resiliency wise and high-cost of entrance, industry. I imagine it will take some risk events and corporate failures to flush out unforseen weaknesses/ control gaps.
Chief Data and Analytics Officer at Lloyds Banking Group and Non-Exec Director at the Information Commissioner's Office. #2 in 2024DataIQ Top100
2 年Alex Nicholson
Payments & Technology Innovation | Banking & Financial Services | Product & Program Management |
2 年Thanks for reflecting the reality Thought provoking .