Modern IT Ops, Incident Mgmt Workflow based on complexity requires focus on only 3 disciplines
Mario Schlangenotto
IT Executive, CIO-Level Leader and astute business driver, delivering exceptional customer experience
What if …? … Issues are fixed automatically without any negative impact on the IT consumer? … IT Operations is able to solve issues quickly without an often too lengthy Tier Support Model? I would call this a fantastic Modern IT Operations!
To make this dream come true, a combination of different approaches may be necessary. The following describes a high-level concept based on Shift-Left and Swarming. It uses Cynefin to classify different categories of work based on the level of complexity.
In short, what is Shift-Left, Swarming and Cynefin about?
Shift-Left: Basically, the aim is to ensure that the work that is currently being done by experts can be done by less skilled people or even through automation. So you “shift the work” from experts to non-experts and automation. The great advantage is that the experts can focus on what matters (improving the usage experience and product quality), while the issue of the IT consumer can be avoided or solved automatically without negative impact.
Swarming: Defines an approach in which one person selects a topic, takes end-2-end responsibility, and leads it to a solution by using the experience of technical experts directly if necessary. It’s a collaborative approach that encourages co-solutioning and requires a disciplined way of working, possibly with an agile mindset. Therefore, it also helps to break down IT silos. In our context, the advantage of this approach is that the incident is resolved as quickly as possible.
Cynefin: Describes 5 different levels of complexity called domains (Obvious, Complicated, Complex, Chaotic and Disorder) and explains how to approach them. It is a conceptual framework that was originally designed to aid decision making.
All this theory is fine, but let’s combine it to optimize the Incident Management Workflow with the aim of avoiding or resolving issues quickly and reducing the Service Support workload.
领英推荐
Have you noticed that it only takes 3 disciplines to achieve a Modern IT Operations Incident Management Workflow? Automation, Knowledge and Swarming. Shift-Left is to be considered as a general core principle. (Please note that I am not encouraging you to implement all 3 disciples at once. Focus on the most urgent one first, according to your context and needs.)
Speaking of automation, there are two different areas that we need to consider. First, search for ways to proactively avoid issues having a negative impact. This often requires AI and machine learning capabilities that can search for patterns based on detected events and are enabled to define an automatic correction. But hey, start small: an e.g. automatic execution of the operating system patch deployment avoids also negative effects. The second area is called self-healing. This means that in the event of an error, it is automatically detected and fixed (cleaning profiles, free disk, CPU or RAM space, … start small).
Speaking of Knowledge, this should already be one of the core disciplines of IT. However, knowledge must be made available to various stakeholders (Self-Help articles for IT consumers and KBAs for the Service Desk) in a language that they can understand and execute accordingly. The most difficult thing is likely to improve the findability of relevant knowledge articles. Remember, it will only help if the right article is found quickly that describes in a simple and actionable manner how the incident can be resolved.
Speaking of Swarming, the biggest problem is scaling. It sounds great that one person takes up the incident and ends up being responsible for fixing it. However, if you have too many incidents at the same time, you will not be able to achieve the goal of resolving the incidents quickly with limited resources. One trick is to assign the work based on the level of complexity. Incidents where the cause-and-effect-relationship is clear and well documented, should be shifted to self-help or resolved directly by the Service Desk (obvious work). All other incidents are recorded in a backlog. Swarms can take up work directly from this backlog. There are many different types of swarms described in various articles on the internet, but I will only focus on the following three.
A Backlog Swarm is basically a group of experienced, knowledgeable persons who each take on different incidents and try to resolve them as quickly as possible. Sometimes this person needs help from a technical expert who he/she can pull in directly (make sure that technical experts have enough time). While a Backlog Swarm focuses on complicated issues, a Dispatch Swarm takes complex incidents. Dispatch Swarms meet frequently to review work that has not yet been completed. A Swarm Leader is appointed and has access to several different technical cross-functional experts. The last type of Swarm I want to briefly describe is a Drop-in Swarm. This Swarm reviews the backlog frequently and “drops-in” if incidents with a high complexity are discovered or based on requested by the Product Owner (e.g. in case of Major Outages). A Swarm Leader is appointed, usually a Support Analyst who has access to Sub-Swarms of domain technical experts.
A Swarm always strives to “shift work left”. Once they have resolved the issue, we need to think about how to avoid the same issue happening in the future or which additional knowledge is required. Of course, swarming cannot replace the Tier Support Model overnight. But you can start small, e.g. with a Backlog Swarm and scale it up and/or introduce different swarms types once you can demonstrate success (so don’t forget to think about how you measure and define success).
One final, even independent, note: you should focus on reducing waste frequently. Proper Automation, Knowledge Management and Swarming requires first getting rid of complicated processes, IT silo thinking and outdated knowledge or poorly written articles.
Head of End User Experience
3 年Interesting read...Thank You!
Customer Experience | Relationship Management | Business Development | Change Management
3 年Really interesting read. It'd be interesting to have view resources / budget should be allocated in avoidance vs. occurred