Rapidly Develop, Debug and Resolve Incidents using OODA Loop

Rapidly Develop, Debug and Resolve Incidents using OODA Loop

As part of my planning for "Mic Drop" sessions we do here at Transunion UK I am working on a number of presentations under the theme of "Building products like a Rocket Scientist"

One of the interesting elements that came up was the application of the OODA Loop in incident management and code development. Something SpaceX have been noted to use as well as the US Military.

What is the OODA Loop ?

The Observe,Orient,Decide,Act Loop or OODA Loop for short was developed by the US Military (Colonel John Boyd) for improved decision-making and making the best decision possible in a short space of time. applications of this have ranged from development of US military hardware, control of ongoing military conflicts and Air to Air combat for Fighter Pilots. The OODA loop from a military perspective is designed to allow you to out think your opponent by iterating quick decisions.

Implementation of the OODA Loop

The OODA loop is designed to be used in an iterative and highly repeatable fashion with each pass of the loop feeding back into the process.

No alt text provided for this image

Observe

In the observe step the Person, Team or Organization will identify the issue they are having. This could be something like an outage with a Production system or maybe a developer is having problems with a release in System Test that just wont work.

Once the problem is understood gather information and data ready to hand, this data is fresh for the purpose of the loop, when the loop restarts the data should be refreshed and understood. A great way to do this is via test automation and a decent monitoring solution.

Orient

Orient is often one of the hardest parts to describe and implement but in short this is investigating what we have found then considering what to do about it. The Person , Team or Organisation should have a solid understanding of the data and information presented to them and what it means in the Observe phase. having a concise summarized grasp on the data presented allows the Orient phase to make better recommendations on what to do in the next phase. I strongly recommend the use of Occam's Razor which in its simplest form is "the simplest explanation is most likely to be the right".

Decide

Decide is the decision phase of OODA, its where the Person , Team or Organisation commit to a plan, reviewing alternatives and considering what the outcomes of these choices will be. It is worth noting that the intent is to keep the loop as short and quick as possible to allow another pass through to redo the process. As such it is advisable that the choices made should reflect small iterative improvements rather than wholesale changes, remember the OODA loop is not designed to help you change the world, just fix one small part of it.

ACT

Acting is the part of the OODA loop that actually does the doing, it takes all the observations, orientation and decision making done previously and implements it. This is where the Person , Team or Organisation do initial testing and then implement a change to effect the scenario they are dealing with. Some important things to think about here are how do you know when and what your change did, consider what testing needs to be included or what extra information you can garnish at this point, remember that at the end of Act the next phase is Observe where the outcome of your action will be added to the information and data which now needs refreshing.

Examples and implementations of the OODA Loop

A recent scenario I came across was a failed development application, the tool was an Apache front end which talked to a back end SQL Server.

  • Loop 1 - Observe - Application is hanging, timing out , users unable to access the UI and APIs are failing.
  • Loop 1 - Orient - What is causing the Application to hang, is it network to the server itself? Is the port open? is the service running?
  • Loop 1 - Decide - run a test-netconnection to the application box and port to check its open
  • Loop 1 - Act - Returns an open status so we know the application box is accessible and the port is open
  • Loop 2 - Observe - Application is hanging, timing out , users unable to access the UI and APIs are failing. Network is open and ports work
  • Loop 2 - Orient - What is causing the Application to hang, is it the Application on the server itself? We know the port is open so the server must be running at least.
  • Loop 2 - Decide - Log on to the box and check the application service is running, if not start it
  • Loop 2 - Act - Logging on to the box revealed the service wasn't started and when starting it returned a SQL error
  • Loop 3 - Observe - Application is hanging, timing out , users unable to access the UI and APIs are failing. Network is open and ports work, service wont start and returns a SQL error.
  • Loop 3 - Orient - what is causing the service not to start in the SQL layer, we know the port is open so the server must be running at least and the application server attempts to start the application
  • Loop 3 - Decide - Log into SQL Management studio, check the logs and health check the instance
  • Loop 3 - Act - Logging into SQL revealed that the Log Directory for the Database was full

At this point the loop continued for a few more cycles to rectify why the Log Directory was full but it is possible to see that each loop moved the process closer to resolution and move closer to the root cause.

I have been using the OODA Loop and Occum's Razor for a few years now to help me problem solve both in development and in resolution of faults. Its helped me up my game and increase how quick I resolve things. The OODA Loop has a few items worth taking into considering that need to be done or understood for better results

  • Autonomy - in order for the OODA Loop to work well the Person , Team or Organisation need to be allowed to act with as few barriers in the way as possible. the more huddles the team have to jump over to ACT the more time it takes between each loop and then the more chance the ACT phase is based on out of data Observe phase data
  • Keep it simple and linear, dont try to do too much in each ACT phase or you will end up with concurrent running loops which may conflict or trip each other up
  • The better the data in Observe the better the outcome in ACT, by having better monitoring in your business the data provided in the first few loops will result in better Orient and Decide steps resulting in more beneficial ACT steps.
  • Repeatability is key, if you have a pre-written script which provides a lot of data and analysis this could be your first ACT, saving you many loops to try and garnish that data. At the end of the problem it could be worth writing some of the tests and checks you did in ACT to speed up a process next time round
  • OODA should lead to self healing, the process path you took is important because it can be automated, the actions you take in ACT can be scripted and created as tests allowing for a number of Automatic self healing solutions so that should a problem occur again the process path can automatically respond.
  • OODA self writes documentation if you document each step quickly. I used MarkDown now to write my OODA process out if somethings a bit complex.


The OODA Loops at its simplest level can be an effective tool for any line of work that has potentiality highly stressful, impact full scenarios and with more study and complexity it can alter the way entire organisations approach problems. Give it a go next time you have a problem you want to resolve, you'll be amazed at the results

Kurt

PS - Great link for easy to use Loop Diagrams

https://online.visual-paradigm.com/diagrams/templates/ooda-loop/ooda-model/




要查看或添加评论,请登录

Kurtis Lamb的更多文章

社区洞察

其他会员也浏览了