Keeping it alive: Your company and handling interruptions in times of crisis.
Gert Taeymans
Resilience expert, Founder and Chief Executive Officer at Tymans Group, CISM, Dora Certified Compliance Specialist
Monday, March 16 saw the first real test of the European infrastructure of many companies.
That’s how the previous article started about the People part of keeping your People, Processes and technology in place and working in times of crisis. Today, I want to take you into the process part of the equation.
On April 13th, 2020 Millions of Europeans are still working from home, trying to get their regular job done as best as they can.
For that, without a doubt, a number of processes had to change in your company. Most likely the computer infrastructure did not allow for everyone to work from home. The term “incident” took on a whole new meaning.
Incidents are interruptions to the normal way of working
For staff in a company, when they hear the term “incident”, it typically means that the technology has a problem. “The printers do not work” or “my computer does not boot up” or “I cannot reach the application ABC”. But “incident” can mean anything that puts the brakes on your normal day to day, like a fall-out between two co-workers or the mail not arriving, or documents that are not in the spot where they are supposed to be, or a client that calls with a unexpected situation. Or a virus like Corona strikes.
When a company suddenly must turn on a dime and re-invent large parts of the workflow, you find that there are many assumptions that no longer hold.
Like eg. Pierre, an employee of Michael (see the first article, link further down) who sifts daily through the 20 sets of lenses that are returned due to unknown reasons. The 75 other ones neatly fall into the predefined process flow, but those 20, well, they need manual intervention. And nobody ever wrote down that process, because, well, they are all one-offs, so you “can’t describe it.”
Another company has a rule that traders are not allowed to work from home. There are many “obvious” reasons for this. Most of these have to do with risk management, and others with latency. I will not become technical, but you can imagine that your home network may be just a bit slower than the office network.
When the governments of several countries decided on lock down procedures, light or otherwise, it suddenly meant that our normal way of working no longer held. We see the closure of many departments and companies because they have no means to continue operating.
But things do not need to be so dire, for interruptions to occur. While you may keep on working, you may find that things keep breaking because they are under larger stresses. That you have many more “interruptions to your normal day to day” than usual.
That means you better get really good at handling “incidents”.
Bring order to chaos
As the old adage goes, “Shit Happens”. While you may not be able to predict every individual thing that will go wrong, you can be sure that something, somewhere will go wrong at some point in the future.
You may be surprised at how many things you can accurately predict will go wrong when you think about your operations holistically and structurally. But that is food for another article.
When you face a lot of incidents, you need a way to quickly bring all required information together so that you have a good overview and can take appropriate action.
When you do not have a lot of incidents, you can still benefit from this system, because you can then simply follow a number of steps to complete the recovery. When a company only occasionally has an interruption, it can become a stressful time for the people that are tasked to solve it, if they do not have a framework or set of guidelines to follow.
Now, whether you have a lot of incidents, or only a few bad ones, the people assigned to lead the incident need to be cut from the right cloth and have the required support. For that, I refer you back to my first article :
The Incident Life-cycle
An incident never comes out of a clear blue sky. That may sound counter-intuitive, and some causes are so far outside you control that it may as well be. It is important to keep this in mind though.
This means that after the incident you will need to look for a root cause, the “60 seconds to disaster” so to speak. This is also why it is so important to understand that Incident response does not start when you have an incident. It starts way before that: during the preparation phase, as we'll see further down.
I like to visualize this in the below timeline.
The time leading up to an incident (the shaded white to grey zone) is where the seeds are sowed. It may be a change to the systems or supply chain. In that case it is typically easy to find out what happened, provided you have some sort of change management process in your organization.
BTW - Do not let “modern” gurus tell you that change management is dead! It is not. It has just taken on a different form, but the basic premise is still there: know the “before” situation, know who changed what, where and when, know the “after” situation. Whether you do that with meetings and pen and paper, or with a fully automated log and deployment street with auditing capabilities, that is up to you.
There are of course other causes to an incident, such as a breakage, someone become ill or no longer doing their job as they should, maybe you launched a successful marketing campaign and the demands on your organization have grown too much too quickly, or external factors over which you do not have control have occurred.
Point is, none of these come falling out of a clear blue sky. Breakage can be anticipated via regular maintenance, people getting ill or losing interest in their job can be found and helped through follow up and care, a marketing campaign can be planned together with all stakeholders, not just the boss and sales and external factors can be risk managed.
Prepare
To be effective for when an incident actually happens, you have to know 3 things
- How the process works normally
- What the components of the process are.
- What steps you will follow to get to a resolution
Preparation 1: Know how the process works normally
While this feels self-evident, pick any process in your company and see if you know exactly how it flows from beginning to end, and who deals with what and when. Chances are that you may not know some steps or are unclear where some information is kept.
And if you do know, check if you are the only one who knows. As the boss of a small or mid sized operation, you may still have that handle on things. But what if you are the affected party? Who can resolve the situation besides you?
Not having the right knowledge at incident time will significantly lower your efficiency and potentially even your effectiveness.
That means that in the preparation phase you must document on how your company gets things done:
- Who does what and What does what (think automated processes)
- In what order
- Using what tools (automated processes will also be using other tools like middle ware, or resources on the internet)
Larger organizations use things like a database full of details about their systems, people and links. But flow charts also work wonders when resolving problems.
Whatever form you choose, make sure it is easily accessible to the people that will need to work to resolve the problem. At the same time, do not put this stuff out on the street either ??. Apply common sense on how to protect that information from outsiders and people that do not need to know.
Preparation 2: Know what the components of the process are
It is also important that this documentation is kept up to date. Therefore, while in “normal operation” have a process or directive that this documentation is reviewed regularly. Whenever something changes, have the people that do the change look at this document, or database, or drawing and check if it still reflects the process or flow or way of working accurately. If not, have them update it. It will help you tremendously at incident time.
Eg. Say you are a real estate agent and you use a central client management system to manage properties and clients. You probably also promote your properties on several different platforms like Immoscout or Zimmo or the MLS service in the US. Those platforms offer a way for your management system to “talk” to their system. That helps you to only manage the information in your own system and have it sent to these other platforms at the push of a button.
In this process a lot can suddenly stop working, even if to you there are only 2 components:
- your system
- the target web page
Simple, right? Not quite...
And this is just high level. Even though we like to think that “I’m directly connected to this web site”, that is not really the case. Just like the letter you sent to the client does not "automagically" appear at their doorstep. There are many components in between you and the result.
This is just a very simple example. And these processes are certainly not limited to information technology.
This means that a change or breakage in any of these steps can mean that your update does not make it onto Zimmo. Diagrams or documents that describe these links do not need to be complex. As a matter of fact, they need to be as simple as possible. Typically, you’d use an onion approach.
Think back to the drawings of your own house: you have an electrical plan, a plumbing plan, an elevation etc. If all these elements were on the same piece of paper, things would get very confusing very quickly. So, they are on different sheets that are layered on top of each other and that you can peel off to reveal only what you need to see.
The same with this: start with the overall picture like the one above and work your way into deeper detail where needed.
Preparation 3: Know What steps you will follow to get to a resolution
This is where it helps to have a standard way of working.
Determine criticality is in grey, because this typically happens automatically through your Event Management system based on rules you set while preparing your application or flow or process. You may not have a formal event system, but you know that if you stand to lose a lot of money or suffer reputational damage, you need to act. As a reminder, the incidents we are discussing here are the larger incidents that may have serious impact.
It is important to see that you need to have a standard team or person that kicks off the incident management sequence. Otherwise, everyone will start running once they notice something is wrong. Typically, in all directions, with chaos as a result.
Depending on the type of incident, you will need to choose another lead for the incident. The lead to start with should be a member of the Incident management team. That person should have a helicopter view of the impacted area. If you already have a good idea of the department or area where the fault may have originated, also include that person, but pick 1 single lead.
If you do not have an incident management team, pick the head of the impacted department to start with. Although the lead should switch to the senior person of the department that can resolve it. Do not pick the person who has to actually act to resolve it! He or she will be busy with fixing whatever needs fixing and will not not have the time for proper reporting to the stakeholders.
All of this must be decided before an incident actually happens. That way, when something does occur, all you need to do is pick up your incident management run book and follow it.
Resolution Mode
In resolution mode, I find it helpful to use a resolution flow:
I always follow this when resolving an incident.
- Write down the facts, and ONLY the facts. Resist for a minute to immediately start thinking about what the cause may be.
- Write down the theories. You may need several potential theories here. Again, resist to immediately jump to conclusions. If there are several likely candidates, write them all down.
- Define actions to test each theory.
- Assign a person to act for each action.
- Collect the results.
If the results are negative: meaning the theory did not pan out, leave it and move on. If the results of the actions show a confirmation, then the outcome of the action becomes a new fact that can be added to the list. (See example lower in the article.)
Note that this process can go very quickly. Some theories are very easily dismissed or confirmed.
A confirmation though does not mean that you found the cause of the issue. Whatever you found may still have deeper underlying causes.
But there are really 2 phases in resolution mode:
- Pre-Analyze and Contain
- Analyze, Repair and recover
Pre-Analyze and Contain
This phase is meant to get a very quick view of what is happening and to determine if there is critical “bleeding”. This means that the issue is acute, happening as we speak and needs to be stopped immediately.
Examples of critical “bleeding” are e.g. a hacker is stealing your data, or a process is sending out duplicate payments or a bad person is actively posting to your social media account, or the alarm system of your shop is going off.
In first instance you will want to stop whatever is happening, but you need to try and do it in a way that will not make the situation worse. In a case like that, take the least disruptive action that will stop the activity from continuing.
At some point I got a call from a client whose main computer system was looping through millions of records of a central database located on another machine. That resulted in stop of all other processes at the site, with visible effects to the outside world. We had 2 options: shut down the machine (by pulling the plug) or pulling the network cable from the offending machine. Obviously, we chose the latter. The database got a chance to catch its breath and the front end systems could resume. Obviously, the situation was a bit more complex than that, but you get the idea.
This means that in this phase I tend to employ a shortened version of the resolution flow:
Containment is also very important if you have a security type incident. If there is no “Stop The Bleeding” needed, then we can proceed with the normal resolution flow.
Analyze, Repair and Recover
Below is an example of the normal resolution flow, which is easily used with a standard whiteboard.
It is important to have some periodicity built into this: every 90 minutes or whatever is appropriate, we report to the stakeholders.
This brings an Analyze->Decide->Perform->Report rhythm into the process, which helps on a number of fronts:
- It focuses everyone on the task at hand. What you want is that people are dedicated to resolving the issue.
- It ensures the fastest possible resolution time. Either the issue gets solved completely or a workaround is implemented. The time to get back to as close to normal as possible is minimized.
- It helps to calm everyone in the organization down.
- It allows the other governance bodies in your company to review the evolving situation and decide on company wide measures if needed.
- It clearly establishes the credentials of those managing the incident.
The team stays together until the situation is back to normal.
But then the question becomes: do you need to keep the whole team together? Remember that you started with a standard set of people: your incident management team or the one that is always called in, who then calls his or her trustees.
During the incident you will have determined that some areas are really not involved with the incident. Those you let go as quickly as possible. What remains are a core set of people that are needed to resolve the incident or have a function towards stakeholders.
The situation is back to normal when there is no more impact on the organization. This does not necessarily mean that it is back to exactly the situation it was before. What it does mean is that the company can again function normally.
In times of Corona, many companies are definitely not back to the way they were. But they are functioning normally: meaning their service output is the same as before or close enough to it. And yet, most people at the time of writing are not working in the office, but at home. A situation deemed unthinkable even just two months ago.
Observe and Learn
Observer and learn is the process whereby we observe how the incident unfolded and how it was resolved. Once the incident is resolved, we need to learn lessons from it:
- Were we efficient in resolving the incident?
- Was the right attitude present?
- How was the decision-making process?
- How did the leadership work?
- Were there bottlenecks we could have avoided?
- Did we have all the right data, could we create good information out of it?
No doubt you will find many more questions to ask during this phase. Most importantly, this needs to be blameless. Do not point fingers, remain factual and polite to all involved.
If there were problems with attitude, take them up separately. This process is about leaning, not blaming.
Root Cause Analysis
While RCA is not a function of incident management (it is part of problem management) it does deserve mention here.
Root Cause Analysis is the process of delving deeper into what happened and really figuring out how the incident could occur in the first place. It is very important that this is done. The aviation industry would not be able to present its current safety record if not for the forensic activities of say the transportation safety boards all over the world.
It is insufficient to say that a plane went down because the wing tore off. You need to be able to say why the wing tore off, and then why the bolt that should have kept the wing on, broke, etc...
In our example above a small design update to make the page look better ended up hiding a simple error message. That caused many people not to be able to update the central system, potentially causing lost sales.
That however is not the root cause: how was it possible that this was not detected in testing? Or was it, and it was not deemed serious enough? Or was the development team under too much pressure by the marketing department? Or were only good entries used to test?
In other words, what is preventing us from making the same mistake again next week? Or next year when we have forgotten this episode?
In Root Cause Analysis you go deeper, and you examine all angles to this problem (hence it is part of problem management). You expose all contributing factors and determine whether they need fixing.
This is also why it is so important to keep these processes blameless. Unless you have a terrorist or industrial spy working at your company, placed there as a mole to undermine your operations, chances are your staff are not out to get you. They are there to do their best, so that they can go home with a sense of fulfillment and in the knowledge they helped you and that they have shown their value to you.
The results of the Root Cause Analysis are again fed back into the prepare cycle of normal operations to make the environment better prepared for the next incident. Because there WILL be a “next incident”. Trust me.
IT Major Incident Manager at ING Belgium
4 年Incident management 101 indeed. Keep calm and let the incident manager handle it. Feedback session is welcomed after situation is back to normal ;)
Information Security & Cybersecurity Consultant
4 年Interesting one ! Indeed we all having to work quite differently meaning more remotely than usual. Companies Business Continuity plan updated with the integration of such situation Covid 19 with that level of pandemic if it was not part of scenario testing yet. Consequently an update on the strategy, on the potential ongoing transformation, digitalisation, and related information security