登录查看更多内容

How did we reduce the number of production incidents by 60%?

Dani Tweig

Program/Project Manager, Incident Management Expert & Problem Solver

发布日期: 2023年2月19日

To sum it up in two words: Communication & Visibility.

If you are interested in all the juicy technical details of how we built a Change Notification and Capacity Management systems then keep reading.

Three years ago, we reached a point where we were always surprised by production incidents.

Further analysis showed us that the major factors resulting these incidents were the lack of internal communication in executing production changes and bad operational resources allocation.

The solution we came up with, which reduced production incidents by 60%, was to build an automatic Change Notification and Capacity Management systems in Cyren, allowing us to better communicate production changes, and have a visibility of our resources consumption.

For those of you who would like to build such a system at your company I will go over each step in details so that you can also build such a system, adding code snippets and screen shot examples which will assist you, and save you many googling hours.

Before we start, some technical information about the systems we use in Cyren:

Cyren has private datacenters spread around the globe
We use JIRA as our ticketing system
We choose Statuspage internal page to publish production changes internally
We use Slack as our internal communication platform, and
We use Grafana as a graphical interface to view internal statistics and data

First problem, lack of internal communication about production changes

At that point each group in the company did what they saw fit when publishing production notification changes, according to the group standard. That was not effective at all.

The first goal was to create a company standard and an automatic notification system for internal communication of production changes.

It was accomplished with three main steps:

1.??????Create an internal status page, sending change notification changes to a dedicated slack channel

2.??????Create a dedicate JIRA ticket type for production changes to document and send targeted email notification

3.??????Hook up JIRA and Statuspage to automate the updating of the internal status page

So first, I created an internal status page, which posts relevant messages to a dedicated Slack channel whenever there is a new production change. At this stage, these notifications are set up manually in the internal Cyren Statuspage portal. In the following steps I have automated the manual work.

The challenge here was to change people’s habits, and to start using the dedicated slack channel, which now served as the source of truth for production changes notification. It took a while, but eventually people saw the benefits of having one channel that summarizes and refers only to production changes notifications.

This step helped to increase both communication and visibility of production changes.

No alt text provided for this image — Statuspage updating the maintenance slack channel

The second step was to create a system, which sends email notifications with the production change details to only the relevant technical teams.

I followed two important aspects while performing this step; 1. Make sure the whole process is easy to use for Cyren employees, since I understood that if the process would be too complex it would not succeed, and 2. Make sure only the relevant groups will get the relevant notification regarding changes which refer to the products they are interested in, reducing the level of unneeded noise.

To document the change I enhanced JIRA with a new dedicated Change Notification type and sub-type tickets, which I added to all the JIRA projects our various R&D groups were using, followed by a new process introduced to the R&D and Operations groups, which required them to fill the details of the change in the Change Notification ticket. I introduced the new process gradually using a beta testing group, who shared their feedback to make the process effective and easy to use.

These Change Notification tickets had a dedicated structure designed for change management, documenting all steps and relevant details of the change, impelling everyone to plan the change in one standard way for the whole company. The feedback here was very positive of this change.

To automate the process of sending targeted emails about change notifications I used JIRA and its various abilities to write validation scripts when moving between states, and the use of post-scripts with dedicated logic after a change in the ticket status.

The new ticket workflow was very simple with four basic states.

After achieving a standard, JIRA base, change notification process the third step was to automate the internal notifications.

The JIRA post-script ability enabled to incorporate an API call, using the internal Statuspage API functionality, which updates the internal status page automatically after documenting the change details in the JIRA ticket.

领英推荐

Freemium Announcement - Operational Excellence, CI…

International Standard for Lean Six Sigma (ISLSS) 1 年前

ITSM Hype Cycle in Plain English - Episode 1

KTSL 2 年前

Freemium Announcement - Operational Excellence, CI…

Presto PDCA 1 年前

We used different API calls for the different scenarios in the workflow. For example, as soon as a change notification ticket reach the PUBLISHED state JIRA post-script sends a ‘Create’ API request via the statuspage API with all the details in the JIRA ticket. If the focal point decides to postpone the execution of the change for some reason, moving it from CHANGE EXECUTION state back to PUBLISHED state generates an ‘Update’ API call to Statuspage to update the start time and end time of the execution.

At the end of this step we had a system with a global company standard way to report production changes, which communicates the change automatically to relevant teams by mail and to a dedicated slack channel.

Building the Capacity Management system

To complete our solution of reducing the number of production incidents we wanted to build a system that can give us visibility and foresee the operational resources required in our production environment given an increase in service usage.

Grafana was a perfect candidate.

Grafana has the ability to integrate data from a big variety of different data sources, and the plan was to place the business graphs and resources graphs in one dashboard while looking at details of the same time interval. Grafana also has the ability to define events, called annotations, which happened on specific time, or time interval, and display these events on all the graphs in the dashboard simultaneously.

To accomplish this goal we performed the following steps:

1.??????Define relevant business metrics per products with their relevant resource metrics

2.??????Build the dashboards in Grafana

3.??????Injects annotations of production changes to the graphs

So, the first step here was to define a set of business usage metrics for each product, and to outline the resources metrics related to that product.

We picked for business service metrics the error rate, the average processing time of each request, and service usage, defining it uniquely per service.

We picked for resource metrics Memory, CPU, Storage and Bandwidth to start with.

The second step was to start building the different dashboards, placing business usage graphs together with their resource graphs for all the relevant components of a specific service.

The grand final was to use annotations, sending production change details from JIRA to the Capacity Management graphs.

I have used, again, the JIRA post-script ability defined for the Change Notification JIRA tickets to send production changes details automatically to Grafana via Grafana annotation API, synching production changes events on the business and resources graphs, enabling visibility, to see if a production change has influenced the business or the resources.

Summary

In this post you saw the steps of building two systems, which helped us to have better communication and visibility, dropping our production incidents rate significantly.

The steps specified here took a long time to execute, fine tune, formalize, and get used to. It required the assistance of many different groups in the company, several reviews, trials and errors until we came to the result presented above.

These systems helped to reduce our surprise rate and be well prepared in advance for service demand increase and production changes while enjoying an automatic notification communication system of production changes.

I hope this information will be useful to you, and please feel free to reach out to me with questions.

Dani Tweig

Program Manager Director

irfan ?a?

DO?U? ALARKO YDA ?irketinde idari Amir ve Sat?nalma ?efiydim. Ukrayna Kiev Borispol D Terminali projesinde.

2 年

1 次回应

查看更多评论

要查看或添加评论，请登录

Dani Tweig的更多文章

Is it possible to manage projects with GitHub tools?

2024年12月8日

Is it possible to manage projects with GitHub tools?

I started working in a new company this year. ScyllaDB.

4 条评论
Root Cause Analysis Techniques Series - Mastering Improvement with DMAIC

2023年12月25日

Root Cause Analysis Techniques Series - Mastering Improvement with DMAIC

Introduction Welcome to the closing article in our Root Cause Analysis techniques series. The series contains five…

1 条评论
Root Cause Analysis Techniques Series - Global overview using Pareto Charts

2023年12月18日

Root Cause Analysis Techniques Series - Global overview using Pareto Charts

Introduction Welcome to the fourth article in our Root Cause Analysis Techniques series. The series contains five…

1 条评论
Root Cause Analysis Techniques Series?- The Fishbone Diagrams

2023年12月11日

Root Cause Analysis Techniques Series?- The Fishbone Diagrams

Introduction Welcome to the third article in our journey through Root Cause Analysis techniques. The series contains…

1 条评论
Root Cause Analysis Techniques Series?- The power of the 5 Whys

2023年12月4日

Root Cause Analysis Techniques Series?- The power of the 5 Whys

Introduction Welcome to the second article in our series on Root Cause Analysis techniques. The series contains five…

16 条评论
Root Cause Analysis Techniques Series - Change Analysis and Event Analysis

2023年11月27日

Root Cause Analysis Techniques Series - Change Analysis and Event Analysis

Welcome to a series of articles that will share practical information about effective Root Cause Analysis (RCA)…

3 条评论

See all articles

How did we reduce the number of production incidents by 60%?

Dani Tweig

Program/Project Manager, Incident Management Expert & Problem Solver

领英推荐

Dani Tweig的更多文章

社区洞察

其他会员也浏览了

The human element in ITSM: soft skills for Service Desk professionals

?? Is Your IT Delivery Too Slow and Costly? ??Frustrated with slow and costly IT delivery?

Service Management Round Up

Top 6 ITSM worst practices

IT Operations: Unified Monitoring, Alerting, and Automation

The Evolution of Observability: Trends Reshaping the Landscape

Say Goodbye to Bad Service Management with Atlassian

Harnessing Generative AI in Incident Management Systems: Transforming Software Engineering and Beyond

Case Study: Optimizing Incident Management with Atlassian

ITSM and SRE: Combining Strategy and Reliability for IT Excellence

领英推荐

Dani Tweig的更多文章

Is it possible to manage projects with GitHub tools?

Root Cause Analysis Techniques Series - Mastering Improvement with DMAIC

Root Cause Analysis Techniques Series - Global overview using Pareto Charts

Root Cause Analysis Techniques Series?- The Fishbone Diagrams

Root Cause Analysis Techniques Series?- The power of the 5 Whys

Root Cause Analysis Techniques Series - Change Analysis and Event Analysis

社区洞察

其他会员也浏览了

The human element in ITSM: soft skills for Service Desk professionals

?? Is Your IT Delivery Too Slow and Costly? ??Frustrated with slow and costly IT delivery?

Service Management Round Up

Top 6 ITSM worst practices

IT Operations: Unified Monitoring, Alerting, and Automation

The Evolution of Observability: Trends Reshaping the Landscape

Say Goodbye to Bad Service Management with Atlassian

Harnessing Generative AI in Incident Management Systems: Transforming Software Engineering and Beyond

Case Study: Optimizing Incident Management with Atlassian

ITSM and SRE: Combining Strategy and Reliability for IT Excellence