Avoiding Toiling

Avoiding Toiling

Site Reliability Engineering (SRE) is the practice of applying software engineering principles to the management of infrastructure and operations.

Originating at Google in the early 2000s the sorts of things an SRE team might work on include system availability, latency and performance, efficiency, monitoring and the ability to deliver change.

Optimising these kinds of system aspects covers many different topics and areas, one of which is the management of toil.

Toil in this context is not work we don't particularly enjoy doing or don't find stimulating, it has a specific meaning defined by aspects other than our enjoyment of the tasks it involves.

What is Toil?

Toil is work that exhibits some or all of the following attributes.

It is Manual in nature, even if a human isn't necessarily doing the work it requires human initiation, monitoring or any other aspect that means a team member has to oversee its operation.

Toil is Repetitive, the times work has to be done may vary and may not necessarily be at regular intervals, but the task needs to be performed multiple times and will never necessarily be deemed finished.

It is Tactical meaning it is generally reactive, it has to be undertaken either in relation to something happening within the system for example when monitoring highlights something is failing or is sub-optimal.

It has No Enduring Value, this means it leaves the system in the same state as before the work happened. It hasn't improved any aspect of the system or eliminated the need for the work to happen again in the future.

It Scales with Service Growth. Some work items need to happen regardless of how much a system is used. This tends to be viewed as overhead and is simply the cost of having the system in the first place. Toil scales with system use meaning the more users you attract the greater the impact of the toil on your team.

Finally toil can be Automated, some tasks will always require human involvement, but for a task to be toil it must be possible for it to be automated.

What is Toils Impact?

It would be wrong to suggest that toil can be totally eliminated, having a production system being used by large numbers of people is always going to incur a certain amount of toil, and it is unlikely that the whole engineering effort of your organisation can be dedicated to removing it.

Also, much like technical debt, even if you do reach a point where you feel its eliminated the chances are a future change in the system will likely re-introduce it.

But also like technical debt the first step is to acknowledge toil exists, develop ways to be able to detect it and have a strategy for managing it and trying to keep it to a reasonable minimum.

Toils impact is that it engages your engineering resource on tasks that don't add to or improve your system. It may keep it up and running but that is a low ambition to have for any system

It's also important to recognise that large amounts of toil is likely to impact a teams morale, very few engineers will embark on their career looking to spend large amounts of time on repetitive tasks that lead to no overall value.

The Alternative to Toil

The alternative to spending time on toil is to spend time on engineering. Engineering is a broad concept but in this context it means work that improves the system itself or enables to to be managed in a more efficient way.

As we said previously completely eliminating toil is probably an unrealistic aim. But it is possible to measure how much time your team is spending on toil related tasks. Once you are able to estimate this then it is possible both to set a sensible limit on how much time is spent on these tasks but also measure the effectiveness of any engineering activities designed to reduce it.

This engineering activity might relate to software engineering, refactoring code for performance or reliability, automating testing or certain aspects of the build and deployment pipeline. It might also be more aimed at system engineering, analysing the correctness of the infrastructure the system is running on, analysing the nature of system failures or automating the management of infrastructure.

As previously stated we can view toil as a form of technical debt. In the early days of a system we may take certain shortcuts that at the time are manageable but as the system grows come with a bigger and bigger impact. Time spent trying to fix this debt will set you on a path for gradual system improvement, both for your users and the teams that work on the system.

Thanks for sharing Ben. Commenting so our SRE community sees this.

回复

要查看或添加评论,请登录

Ben Walpole的更多文章

  • Network of Networks

    Network of Networks

    When we're searching for analogies to describe the operation of the internet we often fall back on that of posting a…

  • Underpinning Kubernetes

    Underpinning Kubernetes

    Kubernetes is the de facto choice for deploying containerized applications at scale. Because of that we are all now…

  • Compiling Knowledge

    Compiling Knowledge

    Any software engineer who works with a compiled language will know the almost religious concept of the build. Whether…

  • Terraforming Your World

    Terraforming Your World

    Software Engineers are very good at managing source code. We have developed effective strategies and tools to allow us…

  • Being at the Helm

    Being at the Helm

    The majority of containerized applications that are being deployed at any reasonable scale will likely be using some…

  • The Language of Love

    The Language of Love

    Software engineers are often polyglots who will learn or be exposed to multiple programming languages over the course…

  • Vulnerable From All Sides

    Vulnerable From All Sides

    Bugs in software engineering are a fact of life, no engineer no matter what he perceives his or her skill level to be…

  • The World Wide Internet

    The World Wide Internet

    Surfing the web, getting online and hitting the net are terms that ubiquitous among the verbs that describe how we live…

  • What's In a Name?

    What's In a Name?

    Surfing the web seems like straightforward undertaking, you type the website you want to go to into your browsers…

  • The Web of Trust

    The Web of Trust

    Everyday all of us type a web address into a browser or click on a link provided by a search engine and interact with…

社区洞察

其他会员也浏览了