CloudOps, SRE and Uptime
No-one likes paying for insurance because no-one ever thinks it’ll happen to?them.
We humans definitely have an optimism bias when it comes to imagining potential future negative events, which might go some way to explain why IT Operations has traditionally been the poor relation in an organisation...the “we begrudge paying for this but might need it someday” commodity. The plumbing. The tin and wires.
But with the shift to cloud and the coming together of the IT Support and Dev worlds, things started to change. Organisations realised that they could speed up deployment if they merged these two functions together - despite how suspicious each teams were of one another. After all, infrastructure was just code now too, right?
So what is CloudOps anyway? Well, in our view it can be defined as bringing together the best tooling, the best people and the best techniques to deliver and support the business objectives and outcomes. But isn't that just DevOps? Well, sort of - but CloudOps is a little different to DevOps in that the latter is primarily focused on the shipping of code, incorporating Continuous Integration/Continuous Delivery (CI/CD), whereas CloudOps as we see it here at SPG doesn't differentiate between the two; DevOps is part of CloudOps.
When I started my career in IT, my first “real” job was in support. It wasn’t even about just resetting passwords back then - nope, my many tasks included cabling, changing 10-BaseT connectors and mostly hovering over someone’s shoulder asking them to “click there”.
But even way back in the heady days of 1994 we still had to worry about uptime - although in those days, outages weren’t just tolerated - they were sort of expected. Over the years we, as an “industry” (although I didn’t see it as that - I was just fixing computers and writing code) got better at keeping things running; first, we had replicated data, then we had clusters and high availability, then we had virtualisation and of course underpinning all of this was the network infrastructure that was designed to be resilient throughout, with huge pairs of multi-rack core switches (one network in particular named “Colossus” - if you know, you know).
In the years since, we’ve had many false-dawns in the search for the holy grail that is continual, uninterrupted uptime at a price the business can afford. When virtualisation became popular, the sages told us that IT support would die. “AIOps” wasn’t really a thing back then, but the concepts were the same with terms such as “self healing” being bandied around. Next up came the move to cloud, whose mass adoption again assured us that IT Operations was dead. Why have IT support people around when “it just works!”. More recently, we got to the subject of scaling and of course everyone’s favourite buzzword (in 2019), Kubernetes - or K8s as the cool kids like to call it. Surely these hugely scalable mass computing nodes wouldn’t need much in the way of IT Ops administration?
Turning to Site Reliability Engineering (SRE), which despite not being new (erm, how about 2003?) is a discipline that's gained a lot of traction in the industry recently, especially in cloud-based or cloud-native organisations. At its heart, SRE is all about that holy grail - making systems more reliable and efficient by bridging the gap between development and operations (but wait, isn’t that also DevOps?). Well again, for us SRE is part of CloudOps.
The Godfather of SRE - Ben Treynor Sloss (the mastermind at Google where the discipline was first adopted), says that “SRE is what happens when you ask a software engineer to design an operations team.” I can’t help think there’s a bit more to it than that, but who am I to argue?
Google's SRE teams have some key practices that help them manage massive systems. They limit manual work to 50% of their time, spending the rest on engineering activities. If the manual work goes beyond 50%, they hand it over to the product teams. This motivates those teams to build systems that don't rely on manual operations and running it themselves - or “eating your own dog food” as it became known.
Google also uses something called an error budget to manage risk. They understand that striving for 100% reliability is impractical and costly…in fact, we all do - though that hasn’t stopped us being sold the dream of zero downtime since my early days on the job in 1994. But realising this was too big an ambition in terms of time/cost, Google instead set a reliability target for each service within a system being operated. Any downtime within that target is considered to have consumed part of the system’s error budget. This budget can then also be used for experimenting and innovating, like a slush-fund.
Of course, monitoring is a big deal in SRE, and this is where we go way back to basics. In the olden days, some of the team would spend their time looking at a TV on a desk (a CRT of course) showing various lines and charts with either mostly red or mostly green items. If they were green? Great - no problems. If they were red? Well everyone was running around trying to fix the root cause. Or rather, if there were too many reds, the alert thresholds were simply increased beyond the point of trigger, just to get rid of the annoying red alerts. All entirely ethical of course.
Google's SRE teams on the other hand focus on four golden signals: latency, traffic, errors, and saturation. They have a great monitoring and alerting system that helps them identify real problems and only alerts the on-call engineer for urgent issues. This makes troubleshooting and debugging more efficient, which in the real-world is the key to operating well. It’s rare nowadays for things to simply go pop - most of our “stuff” is in the cloud, running on highly available kit, consisting of solid-state components. But things do go wrong, and given the vast scale and the world’s dependency on technology, these outages can - in some cases - be a matter of life or death.
领英推荐
So what can you do? Look for the degradation first. Performance degradation is a crucial indicator that something may later go offline. Or costs may go up. Performance is a great barometer of stability - if the system maintains performance irrespective of volume or scale, then the architecture is more than likely sound. And if it’s not? Then it’s time to get to work.
To summarise, SRE has some core principles that set it apart from traditional IT operations, as alluded to above:
So, how does CloudOps help when it comes to running your IT estate and ultimately, the business?
Well, it brings several benefits:
That all sounds great in theory, but how difficult is it to implement in practice? The answer is to start small, but move quickly. The reality is that there are myriad quick-wins to be had, even just by optimising operational processes. From there you can really start to delve deep into the art of the possible (but watch out for hidden costs like log storage and retention, which is often overlooked).
You might be wondering what inspired me to write this article in the first place. I’m not technical anymore (some would say I never was), so why am I evangelising about what are ultimately deemed to be low-level tech and/or organisational functions? There are a couple of reasons for the inspiration.
The first has been hearing about multiple outages suffered by multiple organisations recently, and the “needle in a haystack” way in which root cause analysis was performed. Not only did the initial outages cost the businesses financially and reputationally, but the isolation and recovery approach further worsened the impact. It didn’t have to be that way.
The other reason I’m writing about this topic is because I love it. The quest for affordable reliability has never really gone away, despite the ever increasing magnitudes of complexity. Over the years I’ve been fascinated to watch how continually expanding computing systems are designed, implemented and operated and this shows no sign of slowing down, especially given the increasing reliance on cloud for AI.
In 1994, 17 year old me would be simply amazed.