CloudOps, SRE and Uptime
“It is a capital mistake to theorise before one has data."

CloudOps, SRE and Uptime

No-one likes paying for insurance because no-one ever thinks it’ll happen to?them.

We humans definitely have an optimism bias when it comes to imagining potential future negative events, which might go some way to explain why IT Operations has traditionally been the poor relation in an organisation...the “we begrudge paying for this but might need it someday” commodity. The plumbing. The tin and wires.

But with the shift to cloud and the coming together of the IT Support and Dev worlds, things started to change. Organisations realised that they could speed up deployment if they merged these two functions together - despite how suspicious each teams were of one another. After all, infrastructure was just code now too, right?

So what is CloudOps anyway? Well, in our view it can be defined as bringing together the best tooling, the best people and the best techniques to deliver and support the business objectives and outcomes. But isn't that just DevOps? Well, sort of - but CloudOps is a little different to DevOps in that the latter is primarily focused on the shipping of code, incorporating Continuous Integration/Continuous Delivery (CI/CD), whereas CloudOps as we see it here at SPG doesn't differentiate between the two; DevOps is part of CloudOps.

When I started my career in IT, my first “real” job was in support. It wasn’t even about just resetting passwords back then - nope, my many tasks included cabling, changing 10-BaseT connectors and mostly hovering over someone’s shoulder asking them to “click there”.

But even way back in the heady days of 1994 we still had to worry about uptime - although in those days, outages weren’t just tolerated - they were sort of expected. Over the years we, as an “industry” (although I didn’t see it as that - I was just fixing computers and writing code) got better at keeping things running; first, we had replicated data, then we had clusters and high availability, then we had virtualisation and of course underpinning all of this was the network infrastructure that was designed to be resilient throughout, with huge pairs of multi-rack core switches (one network in particular named “Colossus” - if you know, you know).

In the years since, we’ve had many false-dawns in the search for the holy grail that is continual, uninterrupted uptime at a price the business can afford. When virtualisation became popular, the sages told us that IT support would die. “AIOps” wasn’t really a thing back then, but the concepts were the same with terms such as “self healing” being bandied around. Next up came the move to cloud, whose mass adoption again assured us that IT Operations was dead. Why have IT support people around when “it just works!”. More recently, we got to the subject of scaling and of course everyone’s favourite buzzword (in 2019), Kubernetes - or K8s as the cool kids like to call it. Surely these hugely scalable mass computing nodes wouldn’t need much in the way of IT Ops administration?

Turning to Site Reliability Engineering (SRE), which despite not being new (erm, how about 2003?) is a discipline that's gained a lot of traction in the industry recently, especially in cloud-based or cloud-native organisations. At its heart, SRE is all about that holy grail - making systems more reliable and efficient by bridging the gap between development and operations (but wait, isn’t that also DevOps?). Well again, for us SRE is part of CloudOps.

The Godfather of SRE - Ben Treynor Sloss (the mastermind at Google where the discipline was first adopted), says that “SRE is what happens when you ask a software engineer to design an operations team.” I can’t help think there’s a bit more to it than that, but who am I to argue?

Google's SRE teams have some key practices that help them manage massive systems. They limit manual work to 50% of their time, spending the rest on engineering activities. If the manual work goes beyond 50%, they hand it over to the product teams. This motivates those teams to build systems that don't rely on manual operations and running it themselves - or “eating your own dog food” as it became known.

Google also uses something called an error budget to manage risk. They understand that striving for 100% reliability is impractical and costly…in fact, we all do - though that hasn’t stopped us being sold the dream of zero downtime since my early days on the job in 1994. But realising this was too big an ambition in terms of time/cost, Google instead set a reliability target for each service within a system being operated. Any downtime within that target is considered to have consumed part of the system’s error budget. This budget can then also be used for experimenting and innovating, like a slush-fund.

Of course, monitoring is a big deal in SRE, and this is where we go way back to basics. In the olden days, some of the team would spend their time looking at a TV on a desk (a CRT of course) showing various lines and charts with either mostly red or mostly green items. If they were green? Great - no problems. If they were red? Well everyone was running around trying to fix the root cause. Or rather, if there were too many reds, the alert thresholds were simply increased beyond the point of trigger, just to get rid of the annoying red alerts. All entirely ethical of course.

Google's SRE teams on the other hand focus on four golden signals: latency, traffic, errors, and saturation. They have a great monitoring and alerting system that helps them identify real problems and only alerts the on-call engineer for urgent issues. This makes troubleshooting and debugging more efficient, which in the real-world is the key to operating well. It’s rare nowadays for things to simply go pop - most of our “stuff” is in the cloud, running on highly available kit, consisting of solid-state components. But things do go wrong, and given the vast scale and the world’s dependency on technology, these outages can - in some cases - be a matter of life or death.

No alt text provided for this image
DevOps, Platform Engineering, SRE = CloudOps

So what can you do? Look for the degradation first. Performance degradation is a crucial indicator that something may later go offline. Or costs may go up. Performance is a great barometer of stability - if the system maintains performance irrespective of volume or scale, then the architecture is more than likely sound. And if it’s not? Then it’s time to get to work.

To summarise, SRE has some core principles that set it apart from traditional IT operations, as alluded to above:

  1. Error Budgets and SLAs:?SRE teams set service-level agreements (SLAs) to define how reliable a system needs to be. This informs the error budget, which is the maximum threshold for errors and outages. The development team can "spend" this error budget as they see fit. If the product is running smoothly, they can launch new features. If they exceed the error budget, all launches are put on hold until they reduce the errors.
  2. SREs Can Code:?SRE teams are a mix of developers and sys-admins who can find and fix problems. They spend most of their time writing code and building systems to improve performance and efficiency.
  3. Developers Get Involved:?The development team takes on some of the operations workload. This helps them stay connected to their product, understand its performance, and make better coding and release decisions.
  4. SREs Are Free Agents:?SREs have the freedom to move to different projects as they please, which keeps teams healthy and happy.

So, how does CloudOps help when it comes to running your IT estate and ultimately, the business?

Well, it brings several benefits:

  • Improved Reliability:?CloudOps - and by extension SRE - sets clear SLAs and error budgets, ensuring that cloud services are reliable.
  • Increased Efficiency:?CloudOps teams focus on coding and improving systems, which boosts the efficiency of cloud operations.
  • Better Collaboration:?The conjoined team puts an end to the battle between developers and operations, leading to improved collaboration and more stable cloud services. And that battle can be a real thing - ask any infrastructure engineer or dev.
  • Faster Innovation:?With well-defined error budgets, developers can innovate faster in a CloudOps context, knowing they have room to introduce new features as long as they stay within the error budget.

That all sounds great in theory, but how difficult is it to implement in practice? The answer is to start small, but move quickly. The reality is that there are myriad quick-wins to be had, even just by optimising operational processes. From there you can really start to delve deep into the art of the possible (but watch out for hidden costs like log storage and retention, which is often overlooked).

You might be wondering what inspired me to write this article in the first place. I’m not technical anymore (some would say I never was), so why am I evangelising about what are ultimately deemed to be low-level tech and/or organisational functions? There are a couple of reasons for the inspiration.

The first has been hearing about multiple outages suffered by multiple organisations recently, and the “needle in a haystack” way in which root cause analysis was performed. Not only did the initial outages cost the businesses financially and reputationally, but the isolation and recovery approach further worsened the impact. It didn’t have to be that way.

The other reason I’m writing about this topic is because I love it. The quest for affordable reliability has never really gone away, despite the ever increasing magnitudes of complexity. Over the years I’ve been fascinated to watch how continually expanding computing systems are designed, implemented and operated and this shows no sign of slowing down, especially given the increasing reliance on cloud for AI.

In 1994, 17 year old me would be simply amazed.

要查看或添加评论,请登录

Gareth Humphreys的更多文章

  • What's a CISO anyway (nowadays)?

    What's a CISO anyway (nowadays)?

    It's widely accepted that Steve Katz became the first CISO, way back in 1995. It's well-told story and there have been…

  • "Robotic" Process Automation?

    "Robotic" Process Automation?

    A vision of the future we were told..

    1 条评论
  • Is it *really* Digital Transformation?

    Is it *really* Digital Transformation?

    Ah yes. That ubiquitous term that's been floating around boardrooms and included in business strategies for years now;…

    3 条评论
  • I think, therefore IAM.

    I think, therefore IAM.

    My first introduction to real identity management was around 25 years ago. A public sector customer that the company I…

    2 条评论
  • The Death of Traditional Enterprise Architecture: Why It's Time for Change

    The Death of Traditional Enterprise Architecture: Why It's Time for Change

    Like a lot of people who started their technology career in the mid-90s, I ended up falling into an Architecture role…

    28 条评论
  • What Q4 '24 Shows Us About the UK Job Market

    What Q4 '24 Shows Us About the UK Job Market

    If you've been paying attention to the UK job market lately, you might have noticed something interesting happening…

  • "Lift and Shift" is the worst...

    "Lift and Shift" is the worst...

    Lift and shift cloud migrations are the worst way to deal with legacy technology. You'd think that by this stage in…

    7 条评论
  • "Why is it so hard to recruit in tech at the moment? Where is everyone?"

    "Why is it so hard to recruit in tech at the moment? Where is everyone?"

    Not only have I said this myself recently, but I'm hearing it a lot too. But why? The common - but in my opinion…

    2 条评论

社区洞察

其他会员也浏览了