Code, Interrupted: Building your "fault-tolerant" flow

Code, Interrupted: Building your "fault-tolerant" flow

How do you equip yourself more importantly your team or entire company for disruption?

Over my years working in climate tech and growing engineering teams, I've come to take disruption as a given. Recognizing this allows teams to develop the tools and culture to work with disruption as a matter of course.

Recently, I was invited to facilitate a workshop on this very topic by Zak L. and Ehsan Mokhtari (CEO and CTO, ChargeLab) for their May 2023 company offsite. Throughout the workshop, the ChargeLab team and I embarked on an exploration, digging into their key workflows, their susceptibility to disruptions, and how they could become more "fault-tolerant".

To learn more about what I mean, check out the 30-minute edited version of the workshop recording, or continue reading for a summary.

No alt text provided for this image
The ChargeLab team, working on new ways to become fault tolerant.

The cost of disruption

No matter how well we plan, the world often behaves in unexpected ways. In the engineering world, we speak of "interrupts" and "faults" in our programs, but we find ourselves surprised at all scales, from our personal thought processes, to our team's flow, to our company's strategy.

No alt text provided for this image

In our workshop, the ChargeLab team surfaced examples from across the company, from bug reports and production issues to miscommunications around specifications or design assets killing team productivity.

The examples above are very engineering-heavy, but the issue is far from engineering-specific. For example, a ChargeLab marketing team noted how disruptive it can be to have major priorities shift mid-project, creating wasted work and stalling the team's momentum.

"Fault Tolerance"

Without the right tools, we often struggle through these disruptions, painfully rebuilding our momentum each time. But new opportunities open up when we recognize that these failures will occur, and design for them in advance. Enter "fault tolerance":

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure... [Wikipedia]

The concept of fault tolerance dates to the earliest days of computing, but it applies across all the scales of human process, above. With this mindset, teams will not only anticipate disruptions, but also put robust recovery systems in place. Teams can be fault tolerant at all levels, from individual tasks to team projects, right up to company-wide initiatives.

At the ChargeLab offsite, we broke into small-group brainstorming sessions. Each group's facilitator then shared key findings with the whole room (see 17:38 in the workshop video). I've summarized many of the ChargeLab findings and incorporated them, along with my own notes, in the following section. These have also been edited as slides into the workshop video, itself.

At the individual level, I find that a key indicator of seniority in engineers is the ability to develop and apply fault-tolerant strategies. Great engineers know that any metaphorical cliff worth climbing will challenge them, and therefore how to use climbing gear (and climbing partners) to limit how far they fall when they inevitably slip.

The ChargeLab team and I discussed a range of such strategies. For instance, in-the-moment practices like making quick notes can preserve a train of thought. Good testing practices have the side benefit of leaving the "next step" in focus, even when you walk away from your desk. And, conducting "technical spikes" can prevent us from moving too far in the wrong direction.

No alt text provided for this image
Workshop notes on how individuals can become more fault-tolerant.

At the team level, there are also many tools that organizations develop as they mature. Much of what the ChargeLab teams discussed revolved around how many asks can come into a team that aren't related to their main project flow -- e.g. bug reports, feature requests, production issues, and questions about that team's part of the product.

Just as we can expect disruptions in individual work processes, we should also anticipate them in team processes. The ChargeLab folks had a number of ideas for handling disruptions as a matter of course. For example, teams can isolate certain types of disruptions to established queues or on-call processes. They can also block off dedicated time for certain types of inbound requests ("Office Hours"). They can also create clear accountability for parts of their work, so that triaged issues can be directed quickly, while highlighting where the team may need to address knowledge bottlenecks.

No alt text provided for this image
Workshop notes on how teams can become more fault-tolerant.

From the wealth of real-world examples we discussed, it was clear that ChargeLab folks were willing to have a lot of vulnerable conversation in their small groups. Their candor really shed light on the reality of how interruptions affect daily workflows, and allowed us to get to many best practices that point the way to much more fault-tolerance across the company.

Thank you

I want to express my deep gratitude to Zak, Ehsan, and the rest of the ChargeLab team for the opportunity to open this topic at their company offsite.

It's exciting to help build a resilient culture, since it's such a major lever for most organizations. But beyond this, ChargeLab sits squarely in my focus, working with organizations at the forefront of our climate transition. With ChargeLab, I get to further the mission of a company that's building the software to drive a necessary 5x growth in EV charging capacity by 2030.

I'm eager to see how ChargeLab continues to develop fault-tolerant strategies and I look forward to your ongoing growth!


Eric Nguyen

Engineering & Product Leader. Focused on Climate.

1 年

Revisiting this after some conversations with engineers... I gave earlier versions of this talk because engineers were asking me, "How do I make more impact?" Yes, fault tolerance is a valuable cultural shift that benefits the entire organization. But it's also a major factor that distinguishes more effective, senior engineers. A nuts-and-bolts example: Let's say you're a software engineer who keeps getting interrupted (by Slack messages, fires, new priorities, etc.) You recognize that this is killing your productivity, but also that most of these interruptions are out of your control. To mitigate this, you adopt test-driven development (TDD). Now your flow can be disrupted, but you'll always have your next step ready in the form of a red test. The beauty of the best fault-tolerant practices (and why the talk got broader over time) is that what starts by benefiting you ends up benefiting the whole team. You're shipping more, of course, but you're also better able to explain your approach to others. Plus, the code is now tested and itself much more resilient to changes the team will need to make to it in the future.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了