Chaos in the war room
War rooms. If they are empty everything is good. If they have a few people in them there's an issue and if there is not enough place for all the people in one war room and it overflows into another one... You get the idea.
War rooms can be fun. In a way. When it's all done and the root cause is found and patched and the operation is back to normal and the moneys/second chart is going up and the CPU usage chart is going flat. You feel like you did a thing. You've all survived.
But it can be very stressful when you get the message: "Please come to the war room immediately".
Stress is not your friend
Stress can be useful. When a guy is chasing you and he has an ax and is wearing a leather apron and is covered in blood, you want stress. You want all the stress you can get to make sure your body is operating at an unhealthy level of awesomeness.
But when you need to use your big brain and think, stress will not help you. Brains don't help with running away. Much.
So we can improve the war rooms' efficiency by lowering the stress. And since we don't control what's going to happen in the next moment we can only control how we're reacting to it.
Chaos is a super food for stress
Chaos can turn a normal day into a stressful day. And a stressful day into a "why was I even born?!" kind of day.
I've been in a bunch of war rooms and every time it was pretty chaotic. Not a total chaos but something like 90% chaos, 10% keyboard noise.
It doesn't have to be that way though.
The MC
Someone has to run the show. So every war room needs a coordinator. No, not a manager, but a coordinator. Someone's who's responsible for communication and following the steps that I will describe further. Someone to answer the question "is it fixed yet?" from managers popping in and out of the war room.
It should probably be someone who can keep calm and who is good with people.
Or it can just be this week's release manager if you have such a role.
The Suspects
Secondly we need a list of suspects. If the war room is in a physical space a whiteboard or a shared big screen/projector could help. The coordinator creates the list, the rest of the people in the war room provide the suspects: "it might be the caching layer again", "it might be our new experiment, we've enabled it this morning", "it's the Chinese new year today, maybe the Asian traffic is killing us"... etc.
The Investigators
After the list is done or during its creation choose a person from the war room or outside if you need to. That person gets his or her name put next to the suspect and will take care of checking that suspect.
领英推荐
Too many times I've seen two or three people in a war room looking at the same set of charts because they are checking the same idea because they don't know that someone else is checking it. Having a list visible to everyone will help with that. People will probably volunteer to check the suspect they know about or that they've suggested, so you won't have a problem finding enough investigators.
The Records
There should... there must be a place and a way to record all the war rooms.
Some of the data points (the more data the better):
Why collect the data for war rooms? To make sure we have less war rooms in the future or to make them end faster.
Imagine that from this data you notice that a lot of war rooms happen on Fridays. You can start looking into what happens on Fridays. Could it be that people are super tired and are pushed by the management to deploy before the weekend so they cut corners and make mistakes? Or is there a technical issue? A cron job that runs on Fridays and adds just enough CPU load to trip something over.
Imagine that the same name keeps appearing in the "who found the root cause" field. Why is that? Is the person smarter or have more experience? Is the person better at using the monitoring tools? Can we work with that person to create some tutorials, lessons, knowledge sharing sessions that will be a part of the onboarding for the new joiners?
Is it taking too long to find the right information? Can be build some internal tools to quickly find all the running experiments? Can we create better graphs and provide direct links to them? Are we missing some alerts in some graphs?
Is it the caching layer that's causes production issues most of the time? Can we assign more resources to improve the caching layer? Do we even need a caching layer?
Probably the most important and the easiest metric is how many war rooms there are per week/month/year. The number should be going down, not up.
The data is so so valuable. Don't underestimate the data.
Who's going to clean all this mess?
There's no sense in collecting all the data if there's no one whose responsibility it is to improve the war room efficiency. Maybe it's someone from the ops team or from engineering. It depends on your company and the culture. Maybe it's even a whole dedicated team. But there should be someone who'll be able to spend time and thought on improving the war rooms.
Guess what, improving the war rooms will translate into improving the quality of the software, the engineering productivity and happiness, the user satisfaction and of course the bottom line. So this is not a place for summer interns or junior devs. This is for experienced folks and for the folks who cares. Let's be honest, some people are there only for the paycheck and it's fine (but I would also track this cause it might be a signal that bad things are happening).
Continuous improvement
Remember, war rooms - like failed experiments - are opportunities to learn and improve. The worst thing you can do is to not do anything about war rooms. Scratch that, the ultra worse thing you can do is to blame someone for war rooms and then not do anything.
It's not about who did it. It's about how can we become more resilient to future issues.