登录查看更多内容

SRE concepts part 3 (Risk / Toil)

Marcel Koert

Innovative Platform Engineer | DevOps Engineer | Site Reliability Engineer | IT Educator | Founder of Melomar-IT

发布日期: 2021年3月18日

+ 关注

In the third article in the series about SRE Concepts/Topics in this article, I will discuss Risk and Toil.

How to deal with Risk as SRE?

Site Reliability Engineers consider the risks involved in a project. Unlike developers, they don't ignore the risks entirely. For instance, a developer plans an application never to have downtime. A Site Reliability Engineer, on the other hand, tries to measure this downtime using error budgets.

Dealing with risks is one of the significant responsibilities of a Site Reliability Engineer. The way they deal with risks is also noteworthy. Risk doesn't have a measure in the development phase.

However, in the operations phase, we need to know the exact impact of a risk. Not knowing the effect of a failure can lead to disastrous situations.

So, how do you deal with risks? We've established that we need to quantify the risk to understand their nature and act on them. Now, as a Site Reliability Engineer, you may have come across these two terms, SLIs and SLOs. SLI stands for Service Level Indicators, and SLO stands for Service Level Objectives.

Service Level Objectives (SLO)

One of the critical aspects of measurement of success for a Site Reliability Engineer is availability. If a product is not available at the moment, then it is regarded as a failure. In SRE terms, availability is the ability of a product to perform the function that it intends to execute. One can describe risk in similar words.

In SRE, risk can be defined as the impact of an action on the product's availability. If your product won't be available after adding a new feature, that's an availability incident. In other words, the risk is an indicator of whether an incident will occur.

Now, we've established the definitions, let us see what SLOs are. Since measuring risk is abstract in many cases, we need to set some numerical value parameters. Usually, organizations set a precise numerical target for their product availability.

The target may vary from product to product and organization to organization. This target set by the engineers and developers is called Service Level Objective or SLO in short.

The measurement of the SLO is the first risk indicator. When you want to measure the risk associated with an action, you can use this numerical target to check whether the product will be available. If it does not meet the target, then the risk is pretty high. But if it is met, then you may proceed with the action.

Service Level indicators (SLI)

A service Level Indicator is a measure of the behavior. If the Service Level Indicator falls below a specific value, it will negatively affect an application's behavior. When assessing the risk associated with an action, a Site Reliability Engineer has to consider the SLI as well.

We can quantify risk only by using both the Service Level Objectives and Service Level Indicators.

Conclusion

Managing risk as an SRE can be challenging at times. However, it can be made easier by using automated tools and precise numerical values established separately for each application.

TOIL

Toil should be defined as a set of repetitive and constant stream of maintenance tasks. Toil is familiar to any working team and is usually unavoidable. Many companies limit the time Site Reliability Engineers spend on operational tasks to 50%.

Usually, the Site Reliability Engineers spend the other half doing development tasks such as coming up with new features for the application and improving existing features. Toil can be quite time-consuming if left unchecked for a long time.

Setting an upper limit on the time spent on the toil can make your SRE team more efficient.

Characteristics of Toil

Toil has a few primary characteristics: Manual, Repetitive, Automatable.

Manual

Manual toil refers to the jobs that need manual intervention. These tasks are usually small and depend on numerous parameters. They are either vital and can't be automated or too difficult to automate. A knowledgeable SRE must do these tasks.

Repetitive

Toil is usually just a set of repetitive tasks. You may have performed these tasks before and may have to execute them again. However, if this task does not include numerous considerations or vital functions, you can automate the task using simple tools.

Automatable

Most of the toil is easily automatable. Since the tasks are repetitive, you can note down the pattern and write commands to trigger the work that needs to be done automatically. Instead of running a script, you can even automate the entire problem detection software using simple tools and software programs.

Why do we Want Toil Eliminated?

The toil is mostly a set of mind-numbing tasks. Even the engineers won't be too eager to take upon themselves these repetitive tasks. Moreover, the time Site Reliability Engineers spend on operations and management tasks is just around 50%. If we do not eliminate toil, then most of that time goes to doing these simple tasks leaving no time for the big-picture work.

How to Eliminate Toil?

Identify Toil

The first step to eliminating toil is to identify it. You can do so by observing the tasks the engineers are doing. When the task seems repetitive and straightforward, the task can probably be automated. Automating such tasks is essential for the growth of the product as a whole.

Use Open Source Tools

There are many open-source and third-party tools available to automate toil. If you neither have the time nor the resources to do this, it is best to use the available tools. Using third-party tools will reduce the development costs.

Improve with Feedback

Since the engineers will know the effectiveness of a solution, it is crucial to consider their feedback to improve the product. By taking user feedback, you can understand the precise requirements and develop better tools to ease engineers' tasks.

Conclusion

Reducing toil is vital for any Site Reliability Engineer. If we do not eliminate toil from a process, it may ultimately consume the entire time an SRE has to spend on management and operations.

要查看或添加评论，请登录

Marcel Koert的更多文章

AI Ethics and Bias

2025年3月19日

AI Ethics and Bias

Building a Fairer Future with AI AI is transforming industries at an unprecedented pace, making decisions that affect…
AI and Job Displacement

2025年3月17日

AI and Job Displacement

A New Era of Opportunity If history has taught us anything, it’s that technology changes the way we work—sometimes in…
AI-Driven Decision Making

2025年3月16日

AI-Driven Decision Making

Transforming Critical Industries for the Better Imagine a world where AI helps doctors diagnose diseases earlier than…
Paying for views/advertisement for your youtube channel is that bad.

2025年2月12日

Paying for views/advertisement for your youtube channel is that bad.

The Debate Over Paid Views and Advertising on YouTube: A Balanced Perspective YouTube is an ever-expanding universe of…
Emphasizing Developer Experience in DevOps

2025年1月30日

Emphasizing Developer Experience in DevOps

In the realm of DevOps, the focus has traditionally been on streamlining processes, automating workflows, and enhancing…
Rise of Internal Developer Platforms

2025年1月29日

Rise of Internal Developer Platforms

The Rise of Internal Developer Platforms: A Comprehensive Guide for DevOps Engineers In the dynamic realm of software…
The Hype About Platform Engineering: Echoes of the SRE Revolution

2025年1月27日

The Hype About Platform Engineering: Echoes of the SRE Revolution

In the world of modern software development, buzzwords come and go, but some stick long enough to redefine the way we…
Openshift V Kubernetes

2025年1月23日

Openshift V Kubernetes

OpenShift and Kubernetes are both popular container orchestration platforms used in the deployment and management of…
Human biases in SRE

2025年1月22日

Human biases in SRE

Human biases can have a negative impact on reliability in an IT organisation by influencing decision-making…
The Devaluation of SRE

2025年1月21日

The Devaluation of SRE

The Devaluation of SRE: When Operations Gets a New Label In recent years, Site Reliability Engineering (SRE) has…

9 条评论

See all articles

SRE concepts part 3 (Risk / Toil)

Marcel Koert

Innovative Platform Engineer | DevOps Engineer | Site Reliability Engineer | IT Educator | Founder of Melomar-IT

How to deal with Risk as SRE?

Service Level Objectives (SLO)

Service Level indicators (SLI)

Conclusion

TOIL

Characteristics of Toil

Manual

Repetitive

Automatable

Why do we Want Toil Eliminated?

How to Eliminate Toil?

Identify Toil

Use Open Source Tools

Improve with Feedback

Conclusion

Marcel Koert的更多文章

社区洞察

其他会员也浏览了

Using Observability to Drive Continuous Improvement in Site Reliability Engineering (SRE)

Mastering Stress Testing: Breaking Systems to Build Better Ones

The Evolution of Site Reliability Engineering at VGW: Insights from our Head of SRE

AIOps in Site Reliability Engineering (SRE): 10 Practical Examples Enhancing Operational Efficiency

Service Reliability Is More Than Just Uptime: A Deep Dive Into the Math Behind It

Embracing Graceful Degradation in Site Reliability Engineering (SRE)

Just restore service, we can work out why later ......

A Site Reliability Engineering (SRE) Manifesto

If you fall, fall right - a tale of SRE critical incident management

ITSM and SRE: Combining Strategy and Reliability for IT Excellence

How to deal with Risk as SRE?

Service Level Objectives (SLO)

Service Level indicators (SLI)

Conclusion

TOIL

Characteristics of Toil

Manual

Repetitive

Automatable

Why do we Want Toil Eliminated?

How to Eliminate Toil?

Identify Toil

Use Open Source Tools

Improve with Feedback

Conclusion

Marcel Koert的更多文章

AI Ethics and Bias

AI and Job Displacement

AI-Driven Decision Making

Paying for views/advertisement for your youtube channel is that bad.

Emphasizing Developer Experience in DevOps

Rise of Internal Developer Platforms

The Hype About Platform Engineering: Echoes of the SRE Revolution

Openshift V Kubernetes

Human biases in SRE

The Devaluation of SRE

社区洞察

其他会员也浏览了

Using Observability to Drive Continuous Improvement in Site Reliability Engineering (SRE)

Mastering Stress Testing: Breaking Systems to Build Better Ones

The Evolution of Site Reliability Engineering at VGW: Insights from our Head of SRE

AIOps in Site Reliability Engineering (SRE): 10 Practical Examples Enhancing Operational Efficiency

Service Reliability Is More Than Just Uptime: A Deep Dive Into the Math Behind It

Embracing Graceful Degradation in Site Reliability Engineering (SRE)

Just restore service, we can work out why later ......

A Site Reliability Engineering (SRE) Manifesto

If you fall, fall right - a tale of SRE critical incident management

ITSM and SRE: Combining Strategy and Reliability for IT Excellence