SRE Concepts series Part 1
Marcel Koert
Innovative Platform Engineer | DevOps Engineer | Site Reliability Engineer | IT Educator | Founder of Melomar-IT
I have been asked many times about certain concepts of SRE. So I will do a series about 15 topics that feature in the Google SRE book. These concepts are not just for SRE. These are concepts that you can use in any IT environment. The first in the series ( This one ), I will discuss a little what SRE is. After that, in no particular order I will discuss the following topics/concepts.
- Risk
- Time-based Versus Aggregated Availability
- Error budget
- Service Level indicator
- Service Level Objectives
- toil
- Black box monitoring
- White box monitoring
- The value of automation
- Continuous build and deployment
- Stability versus Agility
- Root Cause analysis
- Capacity planning
- Break your systems
- Testing in production
What is Site Reliability?
Site Reliability Engineering or for short, SRE, is essentially a discipline where we apply software engineering aspects to infrastructure and other operations problems. Site Reliability Engineering aims at delivering more reliable and scalable applications/environments.
Even though first developed before the DevOps movement, Site Reliability Engineering is still one of the hot fields to work in the Information Technology sector. Site Reliability Engineers are also present in the modern cloud infrastructures and are a vital part of the process. Depending on the infrastructure, Site Reliability Engineering may do quite a few things.
A Site Reliability Engineer spends half of their time doing work related to operations. For instance, the engineer may work on issues, manual intervention, deployments, and much more ad-hoc work. The engineer spends the other 50% of their time on development.
Site Reliability Engineers generally oversee projects that are easy to automate. This leaves them with more time to develop new features and improve the application along with maintaining it. Since a Site Reliability Engineer needs to know both development and operations, it is usually hard to find a skilled Site Reliability Engineer.
SRE and DevOps usually work on the same base principle, one engineer managing both development and management. Site Reliability Engineering is often known as a specific implementation of DevOps.
Site Reliability Engineer Responsibilities
Share responsibility
Many organizations are adopting a shared responsibility model to speed up the development and ensure security in applications. Using a shared responsibility model will also remove the single point of delay.
Site Reliability Engineers use the same tools and software programs as developers. They share the responsibility of developing a product with genuine developers, which will be a significant part of the shared responsibility model.
Accept Failure and Prepare for it
Unlike traditional developers or operational engineers, Site Reliability Engineers understand that failure is common and consider the failure scenarios. They measure the downtime of the product using an error budget and will take the necessary measures. Site Reliability Engineers must embrace risk for the whole project to work without any issues.
Site Reliability Engineers quantify failure and availability in terms of SLIs or Service Level Indicators or SLOs Service Level Objectives.
Use Automation for Menial Tasks
A core function of a Site Reliability Engineer is to automate some simple tasks. Automating menial tasks that would otherwise require decent work power will considerably reduce the time spent on operations. When the automation is done, the Site Reliability Engineers more time to work on new features or develop the existing ones.
Implement and Adopt to Gradual Changes
Site Reliability Engineers urge the developers to move quickly by ensuring that the cost of failure is low. As a result, changes can be implemented by SREs more rapidly than the traditional ways.
Quantify Parameters
Site Reliability Engineers try to quantify as much information as possible to minimize the losses and maximize the gains.
Conclusion
Site Reliability Engineers are a core part of modern development and operations teams, especially when the cloud is involved. Combining a good cloud environment with a decent software development plan and a good SRE can bring in the results you are looking for.
Speaker | DEI Advocate | Women in Tech | Psychological safety | Driving Innovation and Inclusion: Head of DEI & Head of Benelux at Templeton & Partners
4 年Emile V.
Site Reliability Engineer | Security & Infrastructure
4 年Great information, thanks for sharing
Site Reliability Engineer / AWS Cloud
4 年That series could shed some light on the subtle differences, with regards to focus for example, as the line between devops and SRE is quite thin.
Principal Site Reliability Engineer at Oracle
4 年Good read. Waiting for the next series of articles.