Applying Flow Metrics to ITSM

Applying Flow Metrics to ITSM

When I recently published my article about my Monte Carlo Simulation demo app in ServiceNow, it was like crossing a finish line. Long-term goal achieved, book closed, done. I knew the article was targeted at a very narrow group, so was not too discouraged because many reactions were just like "Well... congrats, what ever that is."

I've got the feeling that the name of the method alone distracts a lot, as it sounds too much like gambling. Nevertheless, I won't stop talking about it, and so I was lucky enough to meet someone who had no chance but to listen to my enthusiastic sermon during a long car drive. It didn't take too long until we were caught in a deep discussion about this method, and how to apply it to other fields than software development.

Flow Metrics allow us to look at our work from at least three different angles.

  • Past: How fast did we get new incoming work done (->Cycle Time), and what can we learn from it to derive realistic expectations about our performance in the future? That's where we need the Cycle Time scatterplot diagram.
  • Present: How old is the work on our plate right now, and where is our focus needed to keep it flowing smoothly? Speaking in charts, that's a scatterplot with percentile lines again.
  • Future: Based on historic data, we can predict future work volumes and throughputs like a weather forecast. For that, we've got the Monte Carlo method.

If you think about typical challenges in ITSM, the topics above might sound very familiar to you. What if... we could predict the amount of new Incidents coming through the next few weeks? Or if we would simply apply the term of "throughput" to tickets (like Incidents) instead of Stories, which would allow us to draw conclusions about the ticket resolution rate of a Service Desk team? Can't we take the Service Level Expectation and use it to design realistic and data-driven SLAs which are truly agreeable?

After all these discussions and idea exchanges, I finally understood that I might have passed a milestone, but I was far from closing that book. It was just the first chapter, with much more to come. Of course it's not the answer to the universe and everything (that's still 42), but if applied correctly, flow metrics are a mighty and very flexible tool where only your own imagination is the limit.

Just a few changes here and there...

Of course my first thoughts were about how to adjust the app which I had built before. So I created the same app again, spent a night over it, squeezed out ChatGPT to make it give me some new code snippets because I had even found a little flaw in my previous logic (I forgot the days with zero throughput), and then I had something which was much more flexible than before:

Predicted volume of Incidents within 7 days

It had additional input parameters for the table, an encoded query, and the date field.

Result: I could still calculate what I had in the first app, but now I can apply the same logic on any other table, with any historic data filter I want, and even for a field of my choice. In the example above, I filtered for a month in the past (several thousand dummy Incidents, hundreds per day), and I wanted it to look at the Opened date field.

Result: I can predict (with the usual confidence level bar) how many Incidents might come in throughout the next 7 days. And if I switch opened_at with closed_at... you get it.

Incident Throughput

Since we're looking at the throughput (to put it simple, the amount of tickets solved per day), we can definitely use that point of view to improve our performance. Instead of only looking at the next incoming tickets and their proximity to the SLA breach point, we can simply observe how much we get done in a day.

I can tell you this from first hand: Back when I was part of a 2nd level support team, we got a dashboard on the wall, and the most interesting widget was a little gauge, showing incoming vs. resolved tickets of today. It worked a bit like gamifying, and we did what ever we could to move the needle towards the "resolved" area. I can proudly say that our team performance was amazing.

Cycle Times

The discussion about ticket SLA metrics is really old. Shall we measure reaction times? Resolution times? With or without suspensions? Only time within business hours? At the end of the day, the end user doesn't care about all your process flaws and blockers in between, and all the excuses why you couldn't deliver faster. What really counts is the cycle time, i.e. the time from creation to closure.

In "Flow Metrics for Scrum Teams", we learned how to read and use a Cycle Time scatterplot for backlog items. But who said it only makes sense for Stories? We can use it exactly in the same way for Incidents: If you want a truly realistic SLA definition based on real data, just generate a scatterplot diagram where you can see the cycle time of closed Incidents per day, and add some percentile lines. Now you can tell which percentage of Incidents got solved in x days. Given that you applied a proper date filter (not back too far in the past, but still enough time to get a good amount of rows), it allows you to make assumptions on the future team performance.

This time, I used Power BI to generate a basic example (ignore the x axis, the weekday grouping was something random for the demo):

Incident Cycle Time scatterplot

Assume the diagram is based on real Incident data from the last month for one team. We can see that 95% of all Incidents got resolved within 40.6 days, and if we go by the magic 85% percentile, we still have 14.2 days.

Service Level Expectation

What I strongly dislike about SLAs is that they are much too often based on belly feelings instead of data. It seems like many people forget that the "A" in "SLA" stands for "Agreement". In reality, neither the customer nor the delivery team has actually agreed on it. What happens? Customers get wrong expectations which are sometimes far too high, and get disappointed after a short while. Agents and Service Owners get under pressure because the SLAs could not be kept as intended, meaning penalties, contract renegotiations and - in worst case - cancellations.

If you now use the Cycle Time metrics from above (given that you're lucky enough to have such historic data which are comparable to what you need), you can easily derive a Service Level Expectation telling what the team can deliver with a confidence of XYZ.

So if a very ambitious sales guy would sell to the customer that the team above could solve their tickets within 5 days, you could tell that guy that he should better join the team because only around 70% of the tickets got resolved in such a short time (or earlier), and there's a high chance of disappointing the customer with up to 30% of the ticket volume.

Maybe I should point out that the SLE is NOT the SLA, and it's not a replacement at all. But you can definitely use it to design the agreement in a way that both sides remain happy for a longer time.

Aging Work

If you've been working in ITSM for a while, you know the phenomenon of aging tickets. It happens with any kind of work item, and of course it applies to tickets like Incidents as well. The problem with a purely SLA driven management approach is that already escalated tickets are quickly moving out of side because they can't be un-escalated. Breach is breach. So they might stay in the queue, the customer opens a new ticket because the old one never got solved, and you forget more and more. One day, someone looks closer at the queue and asks you why you've got hundreds of tickets which are more than a year old. Sounds familiar, right?

Now let's assume we've got another shatterplot diagram with tickets that are not solved yet. The x axis goes by ticket state (and/or priority, if you want), and the y axis shows the current age. Add some percentile lines, and you've got the tickets you should absolutely look at right now. For example, the 90% line could be something you want to get rid of next, and you even put some people together in a task force to get them closed. Try to get the extreme outliers removed, and to get the dots "flattened" a bit. Of course you can get the same by simply ordering the whole list by oldest first, but this way, you can see much better how far the oldest tickets are away from the norm. You can also compare it with the historic cycle time (see above) to understand the impact of any countermeasure you apply.

Your turn

Now you can think yourself further, and derive actions based on the individual challenges you are working on. I'm sure this article gives you some inspiration to tackle them from a different angle than before, and maybe they help you to turn your workflows into something predictable, reliable, and scalable.

If you need any assistance or advice to evolve your Service Management environment to the next level, or just some fresh new answers to your old questions, feel free to reach out to me (or anyone else at Flow-IT ).

Stefan Reichelt

Senior IT Consultant ServiceNow bei Flow-IT

2 个月

I'm curious: In addition to the ideas I gave in the article, how else do YOU think an ITSM/ESM professional could utilize the principles of Flow Metrics and Kanban?

回复

要查看或添加评论,请登录

Stefan Reichelt的更多文章

  • Bringing Monte Carlo Simulations to Life in ServiceNow: A Learning Journey

    Bringing Monte Carlo Simulations to Life in ServiceNow: A Learning Journey

    TLDR; I created a Monte Carlo Simulation feature for ServiceNow. It has an interactive simulation widget with a chart.

    6 条评论
  • Of talents and penguins

    Of talents and penguins

    tl;dr: Personality over qualification, focus on strengths instead of weakness. Don't try to teach a penguin how to run.

  • The Ment(r)ee

    The Ment(r)ee

    Today I had a relaxing afternoon in my garden. While letting my thoughts flow freely and staring at a well-groomed…

    3 条评论
  • An extra mile closer to burnout

    An extra mile closer to burnout

    "He always went the extra mile." Excuse my sarcasm, but isn't that a great inscription for a gravestone? I recently got…

    3 条评论
  • How to get code reviewed by ChatGPT directly in ServiceNow

    How to get code reviewed by ChatGPT directly in ServiceNow

    Hey there! Have you ever been stuck in the never-ending cycle of manual code review? It's like playing whack-a-mole…