Reliability Lessons from the End of the World - Part I
Photos by Rod Anami

Reliability Lessons from the End of the World - Part I

Ushuaia?(pronounced oo·swai·uh) in Argentina is the most southern city in the world and just 1,100 km from Antarctica. It's also called the "end of the world," located on an icy island of the south Patagonia region. But what a site reliability engineer (SRE) can learn from passing winter vacation there?

I'm going to divide this post into two parts. Today I'm discussing the first four primary responsibilities and fundamental pillars for all SREs. I compared real-life situations in cold temperatures to the site reliability engineering work. Imagine that tourists are our system end-users and the park rangers are our SREs for the sake of a fun anecdote.

SREs make sure their systems are reliable and not just available and resilient.

When walking over the snow or ice, it's not enough to have snow racquets or crampons. Will they work as designed? Can you trust that if you cross a reasonable distance over a glacier, they will keep you safe from slipping, for instance? Just having good equipment is insufficient. First, their quantity should be adequate for the expected number of tourists and available for uncounted visitors to a certain point; they must perform well on frost fields and be rugged enough to endure the subpolar climate.

Moreover, such gear is safe and tested for this type of usage. Also, in the worst-case scenario, the types of equipment can be easily replaced by a new one. Reliability in systems combines availability, resiliency, robustness, scalability, performance, and security more than anything else.?

SREs guarantee all systems and applications are observable and under monitoring.

How do we make sure snow equipment is ready for use? We do monitoring of their signals by inspecting them from time to time. Is there any rust in the crampons? If we twist the snow racquets, will they break or tear apart? Were there any complaints from the last tourist using it? There are plenty of signals to check often. Park rangers are experienced professionals who know how to read the signs of aging and deterioration of ice gear. They also have the expertise to set the threshold for a rust area size to replace that equipment.

Furthermore, they monitor the climate and forecasts constantly; they need to know when external conditions make the environment hazardous for tourists. The rangers also apply surveys to check on tourists' moods and delight.

Observability is a multiple-layer endeavor, from watching each equipment component to the climate forecasts to tourist satisfaction. The same happens to the system's observability; it's a full-stack holistic approach.

SREs manage systems, services, and infrastructure to learn how to automate toil.

I don't want to sound dramatic, but a reliable snow/ice tool may be the difference between life and death if you're looking for rescue after an incident. Park rangers must ensure they're always ready for action by managing their lifecycle, from acquiring to maintenance to discarding. They have checklists to ensure the requirements for minimal reliable tourism operations and procedures to ensure all tourists' safety and, beyond that, receive good services.

Those rangers can only devise ways of automating manual activities by understanding and standardizing operational practices. Fine automation is impossible without a deep comprehension of the tasks and processes necessary for the operations.

SREs use data science and statistical methods to understand the observability of data.

Park rangers may have the best monitoring platform for lakes, trails, and glaciers with many data points, including temperature, barometric pressure, humidity, and ice presence. For instance, they can tell whether a particular path is frozen. Or if the tourists liked one attraction more than the other through simple polls. However, park guides should be able to analyze all data sets and even correlated data points to come up with an interpretation. They have a rudimentary sense of data science and apply simple statistical methods to extract insightful conclusions from the data. They learn this methodology by experience, and if something is still unclear, they rely on installing a new meteorological sensor array or getting additional reports from other data sources.

SREs have data science skills to guide where to increase instrumentality and modify the monitoring platform to provide more exciting data.

If you like the above four of seven site reliability engineering primitives, we made them public for adoption on this GitHub?repository. If you loved them, any SRE is welcome to sign off the SRE manifesto by making a pull request. Next time I'll talk about the remaining three pillars.

I hope you enjoyed this first part blog post, and if you liked it, please connect to me on?LinkedIn?or follow me on Twitter?@ranami. Also, my book Becoming a Rockstar SRE is out; check it out at this?link.

See you soon, Rod.


Chandrachood Raveendran

Intrapreneur & Innovator | Building Private Generative AI Products on Azure & Google Cloud | SRE | Google Certified Professional Cloud Architect | Certified Kubernetes Administrator (CKA)

1 年

It's interesting that designing for resilience in technology is as close to how real world works .

KPJ Pramod Konki

Sr Lead ,Middleware Architect, Project Lead at Kyndryl with expertise in Information Technology Project Management

1 年

Good learning

要查看或添加评论,请登录

Rod Anami ?的更多文章

  • Crossing the continent to attend LFMS23

    Crossing the continent to attend LFMS23

    I felt blessed when I received a letter from the Linux Foundation saying: "Congratulations! Your submission, Building…

    10 条评论
  • Reliability Lessons from the End of the World - Part II

    Reliability Lessons from the End of the World - Part II

    In my last post, I talked about the lessons I learned at the "end of the world" - Ushuaia Patagonia - from the lens of…

  • Will ChatGPT replace SREs?

    Will ChatGPT replace SREs?

    First of all, let me level set the expectations here. I'm not an artificial intelligence (AI) scientist but a site…

    2 条评论
  • What SREs have to do with project-based services?

    What SREs have to do with project-based services?

    A lot has been said and written about how site reliability engineers (SRE) are shaping (and reshaping) the IT…

    3 条评论
  • Creating my first Chatbot – Part II

    Creating my first Chatbot – Part II

    This is the second part of creating a chatbot for Slack from scratch! If you missed the first part check it out here! I…

  • Creating my first Chatbot – Part I

    Creating my first Chatbot – Part I

    I’ve been talked about ChatOps, chatbots, and how they can help speed resolution and knowledge transfer up for service…

    1 条评论
  • ChatOps as a collaborative model to incidents

    ChatOps as a collaborative model to incidents

    In my previous posts, I explained what is ChatOps and how chatbots relate to it. Today, we are going to understand one…

  • Chatbots! Chatbots everywhere!

    Chatbots! Chatbots everywhere!

    I talked about how you can easily explain what ChatOps is in my previous post. If you didn’t read it yet, I recommend…

  • ChatOps! Bless you!

    ChatOps! Bless you!

    I know we’re all afraid of hearing an achoo close to us nowadays, but that’s not the case here. Actually, this almost…

    2 条评论
  • Last Morning Walking

    Last Morning Walking

    And like a blink of my eyes, I was leaving Prague. One month passed away and without warnings or signs, I took my last…

    2 条评论

社区洞察