How did we reduce the number of production incidents by 60%?
To sum it up in two words: Communication & Visibility.
If you are interested in all the juicy technical details of how we built a Change Notification and Capacity Management systems then keep reading.
Three years ago, we reached a point where we were always surprised by production incidents.
Further analysis showed us that the major factors resulting these incidents were the lack of internal communication in executing production changes and bad operational resources allocation.
The solution we came up with, which reduced production incidents by 60%, was to build an automatic Change Notification and Capacity Management systems in Cyren, allowing us to better communicate production changes, and have a visibility of our resources consumption.
For those of you who would like to build such a system at your company I will go over each step in details so that you can also build such a system, adding code snippets and screen shot examples which will assist you, and save you many googling hours.
Before we start, some technical information about the systems we use in Cyren:
?
First problem, lack of internal communication about production changes
At that point each group in the company did what they saw fit when publishing production notification changes, according to the group standard. That was not effective at all.
The first goal was to create a company standard and an automatic notification system for internal communication of production changes.
It was accomplished with three main steps:
1.??????Create an internal status page, sending change notification changes to a dedicated slack channel
2.??????Create a dedicate JIRA ticket type for production changes to document and send targeted email notification
3.??????Hook up JIRA and Statuspage to automate the updating of the internal status page
So first, I created an internal status page, which posts relevant messages to a dedicated Slack channel whenever there is a new production change. At this stage, these notifications are set up manually in the internal Cyren Statuspage portal. In the following steps I have automated the manual work.
The challenge here was to change people’s habits, and to start using the dedicated slack channel, which now served as the source of truth for production changes notification. It took a while, but eventually people saw the benefits of having one channel that summarizes and refers only to production changes notifications.
This step helped to increase both communication and visibility of production changes.
The second step was to create a system, which sends email notifications with the production change details to only the relevant technical teams.
I followed two important aspects while performing this step; 1. Make sure the whole process is easy to use for Cyren employees, since I understood that if the process would be too complex it would not succeed, and 2. Make sure only the relevant groups will get the relevant notification regarding changes which refer to the products they are interested in, reducing the level of unneeded noise.
To document the change I enhanced JIRA with a new dedicated Change Notification type and sub-type tickets, which I added to all the JIRA projects our various R&D groups were using, followed by a new process introduced to the R&D and Operations groups, which required them to fill the details of the change in the Change Notification ticket. I introduced the new process gradually using a beta testing group, who shared their feedback to make the process effective and easy to use.
These Change Notification tickets had a dedicated structure designed for change management, documenting all steps and relevant details of the change, impelling everyone to plan the change in one standard way for the whole company. The feedback here was very positive of this change.
To automate the process of sending targeted emails about change notifications I used JIRA and its various abilities to write validation scripts when moving between states, and the use of post-scripts with dedicated logic after a change in the ticket status.
The new ticket workflow was very simple with four basic states.
After achieving a standard, JIRA base, change notification process the third step was to automate the internal notifications.
The JIRA post-script ability enabled to incorporate an API call, using the internal Statuspage API functionality, which updates the internal status page automatically after documenting the change details in the JIRA ticket.
领英推荐
We used different API calls for the different scenarios in the workflow. For example, as soon as a change notification ticket reach the PUBLISHED state JIRA post-script sends a ‘Create’ API request via the statuspage API with all the details in the JIRA ticket. If the focal point decides to postpone the execution of the change for some reason, moving it from CHANGE EXECUTION state back to PUBLISHED state generates an ‘Update’ API call to Statuspage to update the start time and end time of the execution.
At the end of this step we had a system with a global company standard way to report production changes, which communicates the change automatically to relevant teams by mail and to a dedicated slack channel.
Building the Capacity Management system
To complete our solution of reducing the number of production incidents we wanted to build a system that can give us visibility and foresee the operational resources required in our production environment given an increase in service usage.
Grafana was a perfect candidate.
Grafana has the ability to integrate data from a big variety of different data sources, and the plan was to place the business graphs and resources graphs in one dashboard while looking at details of the same time interval. Grafana also has the ability to define events, called annotations, which happened on specific time, or time interval, and display these events on all the graphs in the dashboard simultaneously.
To accomplish this goal we performed the following steps:
1.??????Define relevant business metrics per products with their relevant resource metrics
2.??????Build the dashboards in Grafana
3.??????Injects annotations of production changes to the graphs
So, the first step here was to define a set of business usage metrics for each product, and to outline the resources metrics related to that product.
We picked for business service metrics the error rate, the average processing time of each request, and service usage, defining it uniquely per service.
We picked for resource metrics Memory, CPU, Storage and Bandwidth to start with.
The second step was to start building the different dashboards, placing business usage graphs together with their resource graphs for all the relevant components of a specific service.
The grand final was to use annotations, sending production change details from JIRA to the Capacity Management graphs.
I have used, again, the JIRA post-script ability defined for the Change Notification JIRA tickets to send production changes details automatically to Grafana via Grafana annotation API, synching production changes events on the business and resources graphs, enabling visibility, to see if a production change has influenced the business or the resources.
Summary
In this post you saw the steps of building two systems, which helped us to have better communication and visibility, dropping our production incidents rate significantly.
The steps specified here took a long time to execute, fine tune, formalize, and get used to. It required the assistance of many different groups in the company, several reviews, trials and errors until we came to the result presented above.
These systems helped to reduce our surprise rate and be well prepared in advance for service demand increase and production changes while enjoying an automatic notification communication system of production changes.
I hope this information will be useful to you, and please feel free to reach out to me with questions.
Dani Tweig
Program Manager Director
DO?U? ALARKO YDA ?irketinde idari Amir ve Sat?nalma ?efiydim. Ukrayna Kiev Borispol D Terminali projesinde.
2 年??