Reliability Lessons from the End of the World - Part II
Photos by Rod Anami

Reliability Lessons from the End of the World - Part II

In my last?post, I talked about the lessons I learned at the "end of the world" - Ushuaia Patagonia - from the lens of a site reliability engineer (SRE). Today I conclude this article with the last three remaining SRE responsibilities applied to this fantastic breathtaking place. Without further due, let's get to the business.

SREs identify, measure, and reduce toil arising from operational and engineering work

When we need to survive on icy plains where you can see any trace of civilization for kilometers, we don't have too much space for errors. We try to minimize manual tasks to the minimum possible level to save calories and stamina. Even though we are just tourists in such cold regions, no one likes to do a lot of repetitive chores such as cleaning shoes as they are not suitable for snow walks.

Toil is a magical word SREs use to describe the "evil" manual and repetitive tasks. But by any means, we are mere automation developers. Eliminating toil is up and front for any healthy IT endeavor. Toil devours precious time and depletes any team's energy.

SREs implement test cases, execute software delivery tests, and stay ahead with capacity planning

Keeping equipment fully operational might avoid unnecessary injuries, even fatal accidents, in a sub-polar landscape. Testing them before using, ensuring security locks are working, and checking for rusty or fissures, are everyday activities.?

SREs help develop and implement test cases to ensure systems work as designed from the earlier lifecycle stages. That includes when the application is available to its users. SREs know everything about tests, you can see monitoring as a type of on-the-fly testing if you give it deep thought, but it should never be used as software development testing. Not just that, they need to pave the road ahead by projecting future system resource consumption. They rely on capacity planning to avoid disastrous crashes due to legit resources' starvation.

SREs employ chaos engineering to unveil systemic weaknesses in production

Although we don't use any chaos engineering practices on ice pieces of equipment, we monitor their utilization on actual usage. Chaos does exist when we have an unforecast heavy storm, but the idea here is the contrary; we want a planned and controlled way of inserting disorder.

While unit testing is the first defense against defects, chaos experiments are the last. SREs never underestimate the injection of chaos into their managed systems to uncover and reinforce the weakest links.?

The SRE Manifesto

If you like those site reliability engineering primitives, we made them public for adoption on this GitHub?repository. If you love them, any SRE can sign off the SRE manifesto by making a pull request.

I hope you enjoyed this second (and last) part blog post, and if you liked it, please connect to me on?LinkedIn?or follow me on Twitter?@ranami. Also, my book "Becoming a Rockstar SRE " is out; check it out at this?link.


要查看或添加评论,请登录

Rod Anami ?的更多文章

  • Crossing the continent to attend LFMS23

    Crossing the continent to attend LFMS23

    I felt blessed when I received a letter from the Linux Foundation saying: "Congratulations! Your submission, Building…

    10 条评论
  • Reliability Lessons from the End of the World - Part I

    Reliability Lessons from the End of the World - Part I

    Ushuaia (pronounced oo·swai·uh) in Argentina is the most southern city in the world and just 1,100 km from Antarctica…

    2 条评论
  • Will ChatGPT replace SREs?

    Will ChatGPT replace SREs?

    First of all, let me level set the expectations here. I'm not an artificial intelligence (AI) scientist but a site…

    2 条评论
  • What SREs have to do with project-based services?

    What SREs have to do with project-based services?

    A lot has been said and written about how site reliability engineers (SRE) are shaping (and reshaping) the IT…

    3 条评论
  • Creating my first Chatbot – Part II

    Creating my first Chatbot – Part II

    This is the second part of creating a chatbot for Slack from scratch! If you missed the first part check it out here! I…

  • Creating my first Chatbot – Part I

    Creating my first Chatbot – Part I

    I’ve been talked about ChatOps, chatbots, and how they can help speed resolution and knowledge transfer up for service…

    1 条评论
  • ChatOps as a collaborative model to incidents

    ChatOps as a collaborative model to incidents

    In my previous posts, I explained what is ChatOps and how chatbots relate to it. Today, we are going to understand one…

  • Chatbots! Chatbots everywhere!

    Chatbots! Chatbots everywhere!

    I talked about how you can easily explain what ChatOps is in my previous post. If you didn’t read it yet, I recommend…

  • ChatOps! Bless you!

    ChatOps! Bless you!

    I know we’re all afraid of hearing an achoo close to us nowadays, but that’s not the case here. Actually, this almost…

    2 条评论
  • Last Morning Walking

    Last Morning Walking

    And like a blink of my eyes, I was leaving Prague. One month passed away and without warnings or signs, I took my last…

    2 条评论

社区洞察

其他会员也浏览了