登录查看更多内容

Reliability Lessons from the End of the World - Part II

Rod Anami ?

发布日期: 2023年8月8日

In my last?post, I talked about the lessons I learned at the "end of the world" - Ushuaia Patagonia - from the lens of a site reliability engineer (SRE). Today I conclude this article with the last three remaining SRE responsibilities applied to this fantastic breathtaking place. Without further due, let's get to the business.

SREs identify, measure, and reduce toil arising from operational and engineering work

When we need to survive on icy plains where you can see any trace of civilization for kilometers, we don't have too much space for errors. We try to minimize manual tasks to the minimum possible level to save calories and stamina. Even though we are just tourists in such cold regions, no one likes to do a lot of repetitive chores such as cleaning shoes as they are not suitable for snow walks.

Toil is a magical word SREs use to describe the "evil" manual and repetitive tasks. But by any means, we are mere automation developers. Eliminating toil is up and front for any healthy IT endeavor. Toil devours precious time and depletes any team's energy.

SREs implement test cases, execute software delivery tests, and stay ahead with capacity planning

Keeping equipment fully operational might avoid unnecessary injuries, even fatal accidents, in a sub-polar landscape. Testing them before using, ensuring security locks are working, and checking for rusty or fissures, are everyday activities.?

SREs help develop and implement test cases to ensure systems work as designed from the earlier lifecycle stages. That includes when the application is available to its users. SREs know everything about tests, you can see monitoring as a type of on-the-fly testing if you give it deep thought, but it should never be used as software development testing. Not just that, they need to pave the road ahead by projecting future system resource consumption. They rely on capacity planning to avoid disastrous crashes due to legit resources' starvation.

领英推荐

Site Reliability Engineering: Fundamental Concepts And…

KWAN 1 年前

The Human Side of Reliability

Performance Consulting Associates Inc. 1 年前

Building a reliability culture

K-IIoT 1 年前

SREs employ chaos engineering to unveil systemic weaknesses in production

Although we don't use any chaos engineering practices on ice pieces of equipment, we monitor their utilization on actual usage. Chaos does exist when we have an unforecast heavy storm, but the idea here is the contrary; we want a planned and controlled way of inserting disorder.

While unit testing is the first defense against defects, chaos experiments are the last. SREs never underestimate the injection of chaos into their managed systems to uncover and reinforce the weakest links.?

The SRE Manifesto

If you like those site reliability engineering primitives, we made them public for adoption on this GitHub?repository. If you love them, any SRE can sign off the SRE manifesto by making a pull request.

I hope you enjoyed this second (and last) part blog post, and if you liked it, please connect to me on?LinkedIn?or follow me on Twitter?@ranami. Also, my book "Becoming a Rockstar SRE " is out; check it out at this?link.

要查看或添加评论，请登录

Rod Anami ?的更多文章

Crossing the continent to attend LFMS23

2023年10月14日

Crossing the continent to attend LFMS23

I felt blessed when I received a letter from the Linux Foundation saying: "Congratulations! Your submission, Building…

10 条评论
Reliability Lessons from the End of the World - Part I

2023年7月11日

Reliability Lessons from the End of the World - Part I

Ushuaia (pronounced oo·swai·uh) in Argentina is the most southern city in the world and just 1,100 km from Antarctica…

2 条评论
Will ChatGPT replace SREs?

2023年2月11日

Will ChatGPT replace SREs?

First of all, let me level set the expectations here. I'm not an artificial intelligence (AI) scientist but a site…

2 条评论
What SREs have to do with project-based services?

2021年11月30日

What SREs have to do with project-based services?

A lot has been said and written about how site reliability engineers (SRE) are shaping (and reshaping) the IT…

3 条评论
Creating my first Chatbot – Part II

2020年6月19日

Creating my first Chatbot – Part II

This is the second part of creating a chatbot for Slack from scratch! If you missed the first part check it out here! I…
Creating my first Chatbot – Part I

2020年5月6日

Creating my first Chatbot – Part I

I’ve been talked about ChatOps, chatbots, and how they can help speed resolution and knowledge transfer up for service…

1 条评论
ChatOps as a collaborative model to incidents

2020年4月26日

ChatOps as a collaborative model to incidents

In my previous posts, I explained what is ChatOps and how chatbots relate to it. Today, we are going to understand one…
Chatbots! Chatbots everywhere!

2020年4月13日

Chatbots! Chatbots everywhere!

I talked about how you can easily explain what ChatOps is in my previous post. If you didn’t read it yet, I recommend…
ChatOps! Bless you!

2020年3月20日

ChatOps! Bless you!

I know we’re all afraid of hearing an achoo close to us nowadays, but that’s not the case here. Actually, this almost…

2 条评论
Last Morning Walking

2018年6月24日

Last Morning Walking

And like a blink of my eyes, I was leaving Prague. One month passed away and without warnings or signs, I took my last…

2 条评论

See all articles

Reliability Lessons from the End of the World - Part II

Rod Anami ?

SREs identify, measure, and reduce toil arising from operational and engineering work

SREs implement test cases, execute software delivery tests, and stay ahead with capacity planning

领英推荐

SREs employ chaos engineering to unveil systemic weaknesses in production

The SRE Manifesto

Rod Anami ?的更多文章

社区洞察

其他会员也浏览了

The Power of Reflection: How a Junior Control Engineer Found Focus and Achieved More

Accendo Weekly Update #381

Accendo Weekly Update #324

Why do Engineers insist on making things complicated?

How Site Reliability Engineering Spearheads Competitive Customer Experience

Lessons Learned via Golden Nuggets

Service Reliability Is More Than Just Uptime: A Deep Dive Into the Math Behind It

Reliability Rhythm #20

TL;DR Site Reliability Engineering

Impact of GenAI on Site Reliability Engineering (SRE)

SREs identify, measure, and reduce toil arising from operational and engineering work

SREs implement test cases, execute software delivery tests, and stay ahead with capacity planning

领英推荐

SREs employ chaos engineering to unveil systemic weaknesses in production

The SRE Manifesto

Rod Anami ?的更多文章

Crossing the continent to attend LFMS23