The Hubla CTO Diaries #3 - The one about the war room
In the last last two weeks since my last report a lot has happened. First Carnaval, where all of Brazil stops for a 5 days to party or relax. The old joke goes that Brazilians only start working after Carnaval, so I guess we have no excuse any longer :-)
Second, I had the chance to put a war room together.
The quintessential war room from the classic nuclear end of the world comedy Dr. Strangelove. Unfortunately it's a more relevant movie now than any time in the last few decades.
What is a war room anyway?
A war room is a tool we use to deal with serious situation - one which the company's normal processes are not enough to handle. Another name for the same thing is code yellow. When a war room is created, a small team is mobilized to work full time on the burning issue until the situation is under control.
Hubla's Reliability War Room
The war room we created was to deal with Hubla's reliability problem. Today we release code daily, but our end to end test coverage is not enough to ensure serious bugs don't hit production. When a bug like that happens, our users can have a really hard time until we fix the issue.
领英推荐
The strategy of the war room is pushing code to production weekly rather than daily, and only doing that after it has been thoroughly manually tested. This scheme goes against engineering best practices, which are to release often and automatically, but we do that consciously. By paying a large price in cycle time we will get a large and immediate gain in production reliability.
While the war room will have a few people working on it (writing test scripts, automating them into end to end tests, configuring releases, etc.) we appointed a single person to be responsible for the health of releases pushed to production. That person can then feel empowered to be the guardian of reliability. It's always important to have cool names for these sort of roles, so we went with release sheriff ??.
The sheriff will manually validate the weekly release and push it to production. They also have to handle requests form people who want to create ad-hoc releases to fix a bug or to get get a time-sensitive feature out sooner. We picked a Hubla long-timer who knows a lot about our systems and knows everyone in the company. That way they will be able to judge what fixes are important enough to justify an ad-hoc release, and be comfortable enough saying no.
Winding Down the War Room
Another important aspect of war rooms is that they need to end. We could end ours by declaring that we will forever do weekly manually gated releases, but that's simply not good enough. Healthy software orgs need to release fast and safely.
The way I see it, the war room is creating tech debt by changing the release strategy, so for the sake of our long term health it needs to also pay back that debt. The way we made sure that'd happen is to clearly specify the war room exit criteria at the outset: It will end once have end to end tests for all of Hubla's core flows and we are able to safely release daily or even hourly. That was made clear to folks mobilized to work on the war room as well as their managers.
In conclusion
It's not fun to declare a war room. I'd much prefer the boring routine of project after project. But startups move fast, and sometimes you build yourself into a situation that requires targeted action to get out of. If you notice that fast enough and can put together a quick response, you can unwind a bad situation and get back to the well lit path.
Knowledge Interface Manager | Staff Data Engineer | I put an emoji in my name to screen bots | MSc Comp Sci | Ex-Google, ex-Amazon
3 年Great write-up! I liked the exit criteria with a clear, achievable goal, and I'm wondering: was there any request for a time limit? Or, at least, an estimate on how long the war room would last? I believe that even with a goal in mind, other people may get nervous not knowing if it will last 1 week or 1 month