Highly Available Luck

Highly Available Luck

Highly Available Luck

The established architectural approach when designing infrastructure to is to work top-down. For example, the SLAs, RTO and RPO guide the designer in their choice of availability features, such as replication and clustering. This allows you to verify that what you’ve designed will meet the NFRs. Having this level of traceability promote a high level of confidence in the design and allows Design Authorities to rubber stamp designs as “fit-for-purpose”.

There’s nothing wrong with this approach, in fact it’s the way I personally have been working for over twenty years, but note that I stated that SLAs should be a guide, not rule book. Why is this? I’ve completed my ITIL training, every service is defined by its SLA, right? Well, no, at least that not the whole picture, and if you’re designing something that the owners consider important enough to require high availability, then you need to consider the whole picture. So, what else should be factored in?

Zero-Access

It should come as no surprise to see Oracle’s most recent availability figures, which show that the most likely cause of an Oracle database failure is human error.

We’ve all seen outages caused by simple mistakes; a Sysadmin patches the wrong server, a DBA tries to find the performance bottleneck in a live database and ends up making things worse. This is just bad luck though isn’t it? Even the most highly trained, high motivated operational staff will still make mistakes sometimes; we’re all only human after all. Surely you can’t design a system to factor out bad luck?

Well, how about completely denying these people access to the live systems? They can’t bring down a system that they can’t access. Providing minimal access has been a standard aim for quite some time now, but recent developments allow us to get closer to a zero-access model.

Consider a private cloud of Linux servers. Their gold build server is a copy of the latest Linux build, with patches and localisation applied by Puppet. The private cloud servers are then built to “expire”, so that each one rebuilds itself overnight using the gold build on a monthly basis. No more outages from patching the wrong server, because Sysadmins no longer patch live servers.

What about our luckless DBA? It’s true that tools such as SQL Tuning Advisor can do more harm than good, but organisations are increasingly examining end-to-end performance away from the affected live systems; dumping collected stats into external data pools for analysis, or getting Splunk to forward performance logs and create dashboards of the results. The net result is little or no access to the live systems for the DBAs.

Luck

So, maybe you can design out bad luck. Maybe the best companies in the world, with the highest levels of availability, are actually just designed to be the luckiest? Google may well have something to say about this statement. The Google cloud lost data last month, when one of their datacentres in Belgium was hit by lightning: four times in one storm! Now that’s bad luck by anyone’s standard!

So, next time you’re designing some highly-available infrastructure; don’t just design the technology, design the human interactions with it too. Then print it out and pin a four-leafed clover to the design, just for luck.

Raza Sheikh

Data & Digital Architect | Consultant

1 年

Mike, thanks for sharing!

回复
Julian Bedford

Working as contract Business Analyst

9 年

Great article mate. Have to say that adequate analytics around risk aversion should mitigate some element of luck you speak of.... If clients are willing to pay for it.

回复
Christiana H.

Head of Enterprise Architecture @ Sciensus | Technology Leader | Continuous Improvement | Soft Systems Methodology | Problem Solving

9 年

Nice article :)

回复

要查看或添加评论,请登录

Mike Charles的更多文章

  • Testing

    Testing

    Reflecting on an excellent evening talking about testing serverless in the serverless meetup. It struck me that people…

  • Someone Broke the Internet (Again)

    Someone Broke the Internet (Again)

    It all started back in November of course, when Kim Kardashian’s butt broke the Internet. (Except that it didn’t break,…

    1 条评论

社区洞察