Chaos in the war room
Photo by Petr Magera on Unsplash

Chaos in the war room

War rooms. If they are empty everything is good. If they have a few people in them there's an issue and if there is not enough place for all the people in one war room and it overflows into another one... You get the idea.

War rooms can be fun. In a way. When it's all done and the root cause is found and patched and the operation is back to normal and the moneys/second chart is going up and the CPU usage chart is going flat. You feel like you did a thing. You've all survived.

But it can be very stressful when you get the message: "Please come to the war room immediately".

Stress is not your friend

Stress can be useful. When a guy is chasing you and he has an ax and is wearing a leather apron and is covered in blood, you want stress. You want all the stress you can get to make sure your body is operating at an unhealthy level of awesomeness.

But when you need to use your big brain and think, stress will not help you. Brains don't help with running away. Much.

So we can improve the war rooms' efficiency by lowering the stress. And since we don't control what's going to happen in the next moment we can only control how we're reacting to it.

Chaos is a super food for stress

Chaos can turn a normal day into a stressful day. And a stressful day into a "why was I even born?!" kind of day.

I've been in a bunch of war rooms and every time it was pretty chaotic. Not a total chaos but something like 90% chaos, 10% keyboard noise.

It doesn't have to be that way though.

The MC

Someone has to run the show. So every war room needs a coordinator. No, not a manager, but a coordinator. Someone's who's responsible for communication and following the steps that I will describe further. Someone to answer the question "is it fixed yet?" from managers popping in and out of the war room.

It should probably be someone who can keep calm and who is good with people.

Or it can just be this week's release manager if you have such a role.

The Suspects

Secondly we need a list of suspects. If the war room is in a physical space a whiteboard or a shared big screen/projector could help. The coordinator creates the list, the rest of the people in the war room provide the suspects: "it might be the caching layer again", "it might be our new experiment, we've enabled it this morning", "it's the Chinese new year today, maybe the Asian traffic is killing us"... etc.

The Investigators

After the list is done or during its creation choose a person from the war room or outside if you need to. That person gets his or her name put next to the suspect and will take care of checking that suspect.

Too many times I've seen two or three people in a war room looking at the same set of charts because they are checking the same idea because they don't know that someone else is checking it. Having a list visible to everyone will help with that. People will probably volunteer to check the suspect they know about or that they've suggested, so you won't have a problem finding enough investigators.

The Records

There should... there must be a place and a way to record all the war rooms.

Some of the data points (the more data the better):

  • when it started;
  • who was in the war room;
  • when was the resolution found and applied;
  • if possible, who found the root cause;
  • what was the root cause;
  • the component(s) that broke;
  • the effect for the users;
  • the effect for the business;
  • tags;
  • any artifacts, links, dashboards, etc that were useful;
  • anything and everything you can think of, record it.

Why collect the data for war rooms? To make sure we have less war rooms in the future or to make them end faster.

Imagine that from this data you notice that a lot of war rooms happen on Fridays. You can start looking into what happens on Fridays. Could it be that people are super tired and are pushed by the management to deploy before the weekend so they cut corners and make mistakes? Or is there a technical issue? A cron job that runs on Fridays and adds just enough CPU load to trip something over.

Imagine that the same name keeps appearing in the "who found the root cause" field. Why is that? Is the person smarter or have more experience? Is the person better at using the monitoring tools? Can we work with that person to create some tutorials, lessons, knowledge sharing sessions that will be a part of the onboarding for the new joiners?

Is it taking too long to find the right information? Can be build some internal tools to quickly find all the running experiments? Can we create better graphs and provide direct links to them? Are we missing some alerts in some graphs?

Is it the caching layer that's causes production issues most of the time? Can we assign more resources to improve the caching layer? Do we even need a caching layer?

Probably the most important and the easiest metric is how many war rooms there are per week/month/year. The number should be going down, not up.

The data is so so valuable. Don't underestimate the data.

Who's going to clean all this mess?

There's no sense in collecting all the data if there's no one whose responsibility it is to improve the war room efficiency. Maybe it's someone from the ops team or from engineering. It depends on your company and the culture. Maybe it's even a whole dedicated team. But there should be someone who'll be able to spend time and thought on improving the war rooms.

Guess what, improving the war rooms will translate into improving the quality of the software, the engineering productivity and happiness, the user satisfaction and of course the bottom line. So this is not a place for summer interns or junior devs. This is for experienced folks and for the folks who cares. Let's be honest, some people are there only for the paycheck and it's fine (but I would also track this cause it might be a signal that bad things are happening).

Continuous improvement

Remember, war rooms - like failed experiments - are opportunities to learn and improve. The worst thing you can do is to not do anything about war rooms. Scratch that, the ultra worse thing you can do is to blame someone for war rooms and then not do anything.

It's not about who did it. It's about how can we become more resilient to future issues.


要查看或添加评论,请登录

Evgheni Kondratenko的更多文章

  • Share utilities, not flow

    Share utilities, not flow

    Sometimes a program you're working on have multiple flows that look very similar. It could be something like this:…

    1 条评论
  • With or without str

    With or without str

    There are useful functions in Clojure that do not return anything but print text to out (usually the REPL, the log, or…

  • Clj-kondo in a monorepo

    Clj-kondo in a monorepo

    This season monorepos are back on the streets. You can see monorepos everywhere: in a startup, in a scale up, in a…

  • A book about Hackers

    A book about Hackers

    The last episode of the CoRecursive podcast has a story about a veteran game developer and designer Mick West…

  • Your desk is not a mess it's a playground

    Your desk is not a mess it's a playground

    Software engineers' work desks used to be interesting and fun. They looked like chaos at a first glance, but if you…

  • Single-header file libraries

    Single-header file libraries

    When I've started writing C/C++ programs twenty years ago I've learned that there are two types of files in my program:…

  • Source code is the ultimate documentation

    Source code is the ultimate documentation

    So I've been coding a custom Sentry SDK. While developing a Sentry SDK they recommend you to run a Sentry Relay - the…

  • Life before LSP

    Life before LSP

    You know, there was a time when LSP didn't exist. Yeah, I know.

  • Flatten with caution

    Flatten with caution

    In one of my previous Clojure posts I've used flatten in my examples to concatenate collection of collections after a…

    3 条评论
  • There is more than one way

    There is more than one way

    In one of my recent Clojure posts Dave Liepmann has commented: ..

社区洞察

其他会员也浏览了