The DevOps Digest: 2022-03-18

The DevOps Digest: 2022-03-18

?

This week, we cover Inclusion and Working Together, Mood Booster Visuals, Learning from Outages, /dev/null, Breaking Changes and COVID-19 and Wastewater

Enjoy!

?Quote: Inclusion and Working Together

“We need to understand that if we all work on inclusion together, it’s going to be faster, broader, better, and more thorough than anything we can do on our own.”
Ellen Pao

15 Quotes From Women in Tech That Will Inspire You | by KaylaMatthews | Code Like A Girl


Tweet: Mood Booster Visuals

álex - Visual illustrator ???? on Twitter: "10 mood booster visuals. 1. It's all a matter of perspective. https://t.co/d6QeQZs3XV" / Twitter


Technical Article/Presentation: How We Turned Our Company’s Worst Outage into a Powerful Learning Opportunity (London 2020)

How We Turned Our Company’s Worst Outage into a Powerful Learning Opportunity - CSG | Devops Enterprise Summit London 2020 (itrevolution.com)

This was a great presentation by CSG's Erica Morrison about how we took one of our worst incidents ever and used it to get better.

No alt text provided for this image

LinkedIn: Note that this video is only available by subscribing to the DevOps Enterprise Summit Video Library. A free membership(10 videos/month) is available as well as individual and corporate memberships.

FYI: IT Revolution announced 2022 Conference Dates. I'm happy to say that the flagship event will be back in Las Vegas this year and in person! Additionally, registration and CFPs for the May Europe Event are now open!?

2022 Conference Dates

DevOps Enterprise Summit Virtual - Europe

10-12 May 2022?|??Registration Open?|??CFP Open

DevOps Enterprise Summit Virtual - US

August 2-4, 2022

DevOps Enterprise Summit?US Flagship Event?

The Cosmopolitan of?Las?Vegas

October 18-20, 2022


Podcast: /dev/null

I'm still catching up from last week's offsite and my podcast listening was on hold.


Books: Kill It with Fire / 9: BREAKING CHANGES

We build our computer systems the way we build our cities: over time, without a plan, on top of ruins. —Ellen Ullman

Amazon.com: Kill It with Fire: Manage Aging Computer Systems (and Future Proof Modern Ones) (Audible Audio Edition): Marianne Bellotti, Katie Koster, Random House Audio: Books

In this chapter, Marianne discusses design Breaking Changes and selling changes while being honest about the risks.?I find this topic very pertinent and one I'm passionate about.?For me, I always feel more comfortable "running towards the risk/problem" vs. waiting for "the problem to run over you."?I also look at these problems as opportunities to make dramatic improvements and release untapped potential.?I often quote my dear friend Mauricio Zamora, saying: "You can't possibly make it worse, right?"

In this chapter, Marianne hits on the following:

  • Inertia is real and prevents organizations from moving forward.
  • It is impossible to improve legacy systems without breaking them.
  • "Air cover" from leaders and creating psychological safety is critical, but to be successful, you need to alter the organization's perception of risk.
  • Understanding "how people get seen" and behaviors that get noticed.
  • Positive re-enforcement in the form of social recognition tends to be a more effective motivator than traditional methods (bonuses, rewards, promotions).
  • Creating incremental social rewards that show progress can be a great motivator.?Use incremental "kudos" to recognize small wins.
  • Celebrating failures is a great way to build just cultures. Blameless postmortems is a good place to start.
  • The closer you can push accountability to the people maintaining systems, the greater the resilience.?Allow operators to exercise discretion to modify procedures.
  • "The highest probability of success comes from having as many people engaged and empowered to execute as possible."
  • Breaking something proactively is generally uncomfortable, but should be embraced in modernization and other operational contexts (my emphasis).
  • Systems that are too reliable can be taken for granted and fail to rack up "observations of resilience."
  • Perfectly running systems create false senses of security that lead to lack of continued improvement.
  • "Occasional system problems that are resolved quickly —can actually boost the user’s trust and confidence. The technical term for this effect is the service recovery paradox."
  • Being fast, professional and transparent about outages and the resolution improves relationships with stakeholders.
  • Having a system no one understands is a weakness, and breaking a system to understand behavior is a powerful mechanism to learn and build resilience.
  • Waiting for something to fail is applying "hope" but planning for a timed failure allows you to bring the right resources, planning and timing to a failure.
  • To investigate failures, look to system logs.?If there aren't any, look to add telemetry to understand behavior.
  • For planned failures, look to have a quick rollback or "kill switch" to revert to previous state.?Communicate and level set this plan with stakeholders.

I thought this was a great chapter in bringing forward some key ideas around DevOps -- specifically ideas around Psychological Safety, Resilience Engineering, Failures as Opportunities as well as Planning and Practicing Failures.?These ideas are not only useful and powerful for modernization, but also for improving software systems and the socio-technical environments that surround them.?At CSG, we implemented several of these techniques through:

  • Incident Swarming, Team Incident Retros (Local Learning) and Group Retros (Global Learning).?Swarming brings the right knowledge and expertise to the problem as quickly as possible (run towards the problem).?Retros at multiple levels change the culture of how we view failure and builds both learning and resilience.
  • Implementing the Incident Management System(IMS). Post a large failure in 2019 we dug in, embraced the failure and came out stronger.?See Erica's great video above: How We Turned Our Company’s Worst Outage into a Powerful Learning Opportunity (London 2020). We also wrote a paper about improving Incident Response:?A Framework for Incident Response (itrevolution.com)

No alt text provided for this image

  • Feature/kill-switches and planned "outages".?We learned several years ago that "big batch migrations/modernizations" were dangerous and started approaching many software and operational activities as incremental approaches that were likely to fail in some sort of way.?During many of our ports, we planned and communicated switchovers during the day when folks were fresh and we could monitor as well as rollback quickly.?Given the complexity of our systems and integrations, it was not possible to design or code away all edge cases.?We needed to have safe ways to fail quickly, roll back, fix and do it again.?This practice built great system understanding, resilience and credibility with stakeholders.??

No alt text provided for this image


Something Else: COVID-19 levels detected in Illinois Wastewater Plants

https://www.axios.com/newsletters/axios-chicago-e6b1b1e8-7529-40ca-9b38-d969539997c1.html

This week's Axios Local highlighted potential trouble ahead.?COVID-19 levels in wastewater have dramatically increased (1000%)… Yikes.?I'm wary to go towards panic from these numbers as there could be other things at play like getting better at testing, immunity, etc.?But, this trend will be important and interesting to watch.

?Also, see the US tracker here: CDC COVID Data Tracker: SARS-CoV-2 RNA Levels in Wastewater in the United States

álex Maese Juárez

Director General en Sensation Apartments ??

2 年

Awesome to be here! Thanks for this Scott.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了