SRE concepts part 3 (Risk / Toil)
Marcel Koert
Innovative Platform Engineer | DevOps Engineer | Site Reliability Engineer | IT Educator | Founder of Melomar-IT
In the third article in the series about SRE Concepts/Topics in this article, I will discuss Risk and Toil.
How to deal with Risk as SRE?
Site Reliability Engineers consider the risks involved in a project. Unlike developers, they don't ignore the risks entirely. For instance, a developer plans an application never to have downtime. A Site Reliability Engineer, on the other hand, tries to measure this downtime using error budgets.
Dealing with risks is one of the significant responsibilities of a Site Reliability Engineer. The way they deal with risks is also noteworthy. Risk doesn't have a measure in the development phase.
However, in the operations phase, we need to know the exact impact of a risk. Not knowing the effect of a failure can lead to disastrous situations.
So, how do you deal with risks? We've established that we need to quantify the risk to understand their nature and act on them. Now, as a Site Reliability Engineer, you may have come across these two terms, SLIs and SLOs. SLI stands for Service Level Indicators, and SLO stands for Service Level Objectives.
Service Level Objectives (SLO)
One of the critical aspects of measurement of success for a Site Reliability Engineer is availability. If a product is not available at the moment, then it is regarded as a failure. In SRE terms, availability is the ability of a product to perform the function that it intends to execute. One can describe risk in similar words.
In SRE, risk can be defined as the impact of an action on the product's availability. If your product won't be available after adding a new feature, that's an availability incident. In other words, the risk is an indicator of whether an incident will occur.
Now, we've established the definitions, let us see what SLOs are. Since measuring risk is abstract in many cases, we need to set some numerical value parameters. Usually, organizations set a precise numerical target for their product availability.
The target may vary from product to product and organization to organization. This target set by the engineers and developers is called Service Level Objective or SLO in short.
The measurement of the SLO is the first risk indicator. When you want to measure the risk associated with an action, you can use this numerical target to check whether the product will be available. If it does not meet the target, then the risk is pretty high. But if it is met, then you may proceed with the action.
Service Level indicators (SLI)
A service Level Indicator is a measure of the behavior. If the Service Level Indicator falls below a specific value, it will negatively affect an application's behavior. When assessing the risk associated with an action, a Site Reliability Engineer has to consider the SLI as well.
We can quantify risk only by using both the Service Level Objectives and Service Level Indicators.
Conclusion
Managing risk as an SRE can be challenging at times. However, it can be made easier by using automated tools and precise numerical values established separately for each application.
TOIL
Toil should be defined as a set of repetitive and constant stream of maintenance tasks. Toil is familiar to any working team and is usually unavoidable. Many companies limit the time Site Reliability Engineers spend on operational tasks to 50%.
Usually, the Site Reliability Engineers spend the other half doing development tasks such as coming up with new features for the application and improving existing features. Toil can be quite time-consuming if left unchecked for a long time.
Setting an upper limit on the time spent on the toil can make your SRE team more efficient.
Characteristics of Toil
Toil has a few primary characteristics: Manual, Repetitive, Automatable.
Manual
Manual toil refers to the jobs that need manual intervention. These tasks are usually small and depend on numerous parameters. They are either vital and can't be automated or too difficult to automate. A knowledgeable SRE must do these tasks.
Repetitive
Toil is usually just a set of repetitive tasks. You may have performed these tasks before and may have to execute them again. However, if this task does not include numerous considerations or vital functions, you can automate the task using simple tools.
Automatable
Most of the toil is easily automatable. Since the tasks are repetitive, you can note down the pattern and write commands to trigger the work that needs to be done automatically. Instead of running a script, you can even automate the entire problem detection software using simple tools and software programs.
Why do we Want Toil Eliminated?
The toil is mostly a set of mind-numbing tasks. Even the engineers won't be too eager to take upon themselves these repetitive tasks. Moreover, the time Site Reliability Engineers spend on operations and management tasks is just around 50%. If we do not eliminate toil, then most of that time goes to doing these simple tasks leaving no time for the big-picture work.
How to Eliminate Toil?
Identify Toil
The first step to eliminating toil is to identify it. You can do so by observing the tasks the engineers are doing. When the task seems repetitive and straightforward, the task can probably be automated. Automating such tasks is essential for the growth of the product as a whole.
Use Open Source Tools
There are many open-source and third-party tools available to automate toil. If you neither have the time nor the resources to do this, it is best to use the available tools. Using third-party tools will reduce the development costs.
Improve with Feedback
Since the engineers will know the effectiveness of a solution, it is crucial to consider their feedback to improve the product. By taking user feedback, you can understand the precise requirements and develop better tools to ease engineers' tasks.
Conclusion
Reducing toil is vital for any Site Reliability Engineer. If we do not eliminate toil from a process, it may ultimately consume the entire time an SRE has to spend on management and operations.