登录查看更多内容

Avoiding Toiling

Ben Walpole

Software Architect at Aviva

发布日期: 2024年8月26日

Site Reliability Engineering (SRE) is the practice of applying software engineering principles to the management of infrastructure and operations.

Originating at Google in the early 2000s the sorts of things an SRE team might work on include system availability, latency and performance, efficiency, monitoring and the ability to deliver change.

Optimising these kinds of system aspects covers many different topics and areas, one of which is the management of toil.

Toil in this context is not work we don't particularly enjoy doing or don't find stimulating, it has a specific meaning defined by aspects other than our enjoyment of the tasks it involves.

What is Toil?

Toil is work that exhibits some or all of the following attributes.

It is Manual in nature, even if a human isn't necessarily doing the work it requires human initiation, monitoring or any other aspect that means a team member has to oversee its operation.

Toil is Repetitive, the times work has to be done may vary and may not necessarily be at regular intervals, but the task needs to be performed multiple times and will never necessarily be deemed finished.

It is Tactical meaning it is generally reactive, it has to be undertaken either in relation to something happening within the system for example when monitoring highlights something is failing or is sub-optimal.

It has No Enduring Value, this means it leaves the system in the same state as before the work happened. It hasn't improved any aspect of the system or eliminated the need for the work to happen again in the future.

It Scales with Service Growth. Some work items need to happen regardless of how much a system is used. This tends to be viewed as overhead and is simply the cost of having the system in the first place. Toil scales with system use meaning the more users you attract the greater the impact of the toil on your team.

Finally toil can be Automated, some tasks will always require human involvement, but for a task to be toil it must be possible for it to be automated.

KWAN 11 个月前

The Definitive Guide to Site Reliability Engineering:…

Huzaifa Asif 1 年前

Using Observability to Drive Continuous Improvement in…

Yoseph Reuveni 1 个月前

What is Toils Impact?

It would be wrong to suggest that toil can be totally eliminated, having a production system being used by large numbers of people is always going to incur a certain amount of toil, and it is unlikely that the whole engineering effort of your organisation can be dedicated to removing it.

Also, much like technical debt, even if you do reach a point where you feel its eliminated the chances are a future change in the system will likely re-introduce it.

But also like technical debt the first step is to acknowledge toil exists, develop ways to be able to detect it and have a strategy for managing it and trying to keep it to a reasonable minimum.

Toils impact is that it engages your engineering resource on tasks that don't add to or improve your system. It may keep it up and running but that is a low ambition to have for any system

It's also important to recognise that large amounts of toil is likely to impact a teams morale, very few engineers will embark on their career looking to spend large amounts of time on repetitive tasks that lead to no overall value.

The Alternative to Toil

The alternative to spending time on toil is to spend time on engineering. Engineering is a broad concept but in this context it means work that improves the system itself or enables to to be managed in a more efficient way.

As we said previously completely eliminating toil is probably an unrealistic aim. But it is possible to measure how much time your team is spending on toil related tasks. Once you are able to estimate this then it is possible both to set a sensible limit on how much time is spent on these tasks but also measure the effectiveness of any engineering activities designed to reduce it.

This engineering activity might relate to software engineering, refactoring code for performance or reliability, automating testing or certain aspects of the build and deployment pipeline. It might also be more aimed at system engineering, analysing the correctness of the infrastructure the system is running on, analysing the nature of system failures or automating the management of infrastructure.

As previously stated we can view toil as a form of technical debt. In the early days of a system we may take certain shortcuts that at the time are manageable but as the system grows come with a bigger and bigger impact. Time spent trying to fix this debt will set you on a path for gradual system improvement, both for your users and the teams that work on the system.

Detect.sh - Reliability Community

3 个月

Thanks for sharing Ben. Commenting so our SRE community sees this.

要查看或添加评论，请登录

Ben Walpole的更多文章

Network of Networks

2024年11月9日

Network of Networks

When we're searching for analogies to describe the operation of the internet we often fall back on that of posting a…
Underpinning Kubernetes

2024年10月19日

Underpinning Kubernetes

Kubernetes is the de facto choice for deploying containerized applications at scale. Because of that we are all now…
Compiling Knowledge

2024年10月13日

Compiling Knowledge

Any software engineer who works with a compiled language will know the almost religious concept of the build. Whether…
Terraforming Your World

2024年9月14日

Terraforming Your World

Software Engineers are very good at managing source code. We have developed effective strategies and tools to allow us…
Being at the Helm

2024年9月3日

Being at the Helm

The majority of containerized applications that are being deployed at any reasonable scale will likely be using some…
The Language of Love

2024年7月13日

The Language of Love

Software engineers are often polyglots who will learn or be exposed to multiple programming languages over the course…
Vulnerable From All Sides

2024年6月28日

Vulnerable From All Sides

Bugs in software engineering are a fact of life, no engineer no matter what he perceives his or her skill level to be…
The World Wide Internet

2024年6月15日

The World Wide Internet

Surfing the web, getting online and hitting the net are terms that ubiquitous among the verbs that describe how we live…
What's In a Name?

2024年6月9日

What's In a Name?

Surfing the web seems like straightforward undertaking, you type the website you want to go to into your browsers…
The Web of Trust

2024年4月13日

The Web of Trust

Everyday all of us type a web address into a browser or click on a link provided by a search engine and interact with…

See all articles

Avoiding Toiling

Ben Walpole

Software Architect at Aviva

领英推荐

Ben Walpole的更多文章

社区洞察

其他会员也浏览了

The Evolution of Site Reliability Engineering at VGW: Insights from our Head of SRE

Site Reliability Engineering (SRE): Ensuring Reliability at Scale with Real-World Examples

DevOps VS. Site Reliability Engineering

Site Reliability Engineering: Revolutionizing Business Operations

SITE RELIABILITY ENGINEERING (SRE)

SRE vs. Reliability Engineer.

Site Reliability Engineering Fundamentals

From Site to Service: The Evolution of SRE

Revolutionizing Production Support: A Journey to SRE

Embracing SRE Principles: Building Reliable and Efficient Systems

领英推荐

Ben Walpole的更多文章

Network of Networks

Underpinning Kubernetes

Compiling Knowledge

Terraforming Your World

Being at the Helm

The Language of Love

Vulnerable From All Sides

The World Wide Internet

What's In a Name?

The Web of Trust

社区洞察

其他会员也浏览了

The Evolution of Site Reliability Engineering at VGW: Insights from our Head of SRE

Site Reliability Engineering (SRE): Ensuring Reliability at Scale with Real-World Examples

DevOps VS. Site Reliability Engineering

Site Reliability Engineering: Revolutionizing Business Operations

SITE RELIABILITY ENGINEERING (SRE)

SRE vs. Reliability Engineer.

Site Reliability Engineering Fundamentals

From Site to Service: The Evolution of SRE

Revolutionizing Production Support: A Journey to SRE

Embracing SRE Principles: Building Reliable and Efficient Systems