Reliability Lessons from the End of the World - Part II
In my last?post, I talked about the lessons I learned at the "end of the world" - Ushuaia Patagonia - from the lens of a site reliability engineer (SRE). Today I conclude this article with the last three remaining SRE responsibilities applied to this fantastic breathtaking place. Without further due, let's get to the business.
SREs identify, measure, and reduce toil arising from operational and engineering work
When we need to survive on icy plains where you can see any trace of civilization for kilometers, we don't have too much space for errors. We try to minimize manual tasks to the minimum possible level to save calories and stamina. Even though we are just tourists in such cold regions, no one likes to do a lot of repetitive chores such as cleaning shoes as they are not suitable for snow walks.
Toil is a magical word SREs use to describe the "evil" manual and repetitive tasks. But by any means, we are mere automation developers. Eliminating toil is up and front for any healthy IT endeavor. Toil devours precious time and depletes any team's energy.
SREs implement test cases, execute software delivery tests, and stay ahead with capacity planning
Keeping equipment fully operational might avoid unnecessary injuries, even fatal accidents, in a sub-polar landscape. Testing them before using, ensuring security locks are working, and checking for rusty or fissures, are everyday activities.?
SREs help develop and implement test cases to ensure systems work as designed from the earlier lifecycle stages. That includes when the application is available to its users. SREs know everything about tests, you can see monitoring as a type of on-the-fly testing if you give it deep thought, but it should never be used as software development testing. Not just that, they need to pave the road ahead by projecting future system resource consumption. They rely on capacity planning to avoid disastrous crashes due to legit resources' starvation.
领英推荐
SREs employ chaos engineering to unveil systemic weaknesses in production
Although we don't use any chaos engineering practices on ice pieces of equipment, we monitor their utilization on actual usage. Chaos does exist when we have an unforecast heavy storm, but the idea here is the contrary; we want a planned and controlled way of inserting disorder.
While unit testing is the first defense against defects, chaos experiments are the last. SREs never underestimate the injection of chaos into their managed systems to uncover and reinforce the weakest links.?
The SRE Manifesto
If you like those site reliability engineering primitives, we made them public for adoption on this GitHub?repository. If you love them, any SRE can sign off the SRE manifesto by making a pull request.
I hope you enjoyed this second (and last) part blog post, and if you liked it, please connect to me on?LinkedIn?or follow me on Twitter?@ranami. Also, my book "Becoming a Rockstar SRE " is out; check it out at this?link.