Site Reliability Engineering (SRE)
The What and Why Behind SRE
SRE is a way to build and run reliable production systems in increasingly complex technical environments. SRE acknowledges that running successful production systems is a specific skill that’s different than other engineering disciplines. Ben Treynor, the founder of the SRE team at Google, describes SRE responsibilities?in an interview?for the SRE book:
[the] SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
Site Reliability Engineers require software development and operations skills. They’re expected to write software that assists with deployment and production operations and also debug software in production environments. A cursory look at SRE job posts shows new hires are expected to be fluent in a programming language (such as Go or Node.js), configuration management, and automation tools (such as Ansible, Chef, or Puppet) and cloud infrastructure (like AWS, Azure, or GCP). Experience with containers and container orchestration like Mesos or Kubernetes is a common job requirement too.
The interdisciplinary skill set is useful throughout the SDLC and overlaps with other technical team members. It may also cause SRE to become an organization’s junk drawer for work that doesn’t map clearly onto existing teams. It also means that these skills will be less effective if they’re not focused on clear goals and defined responsibilities.
WHAT IS SITE RELIABILITY ENGINEERING (SRE)?
A new approach to running production systems!
Google introduced another approach, where they brought in software engineers to automate the operation processes. This was previously done manually by system admins to run their products and services. According to Google,?SRE is what happens when you ask a software engineer to design an operations team.?
Knowledge of Unix system internals, networking, and an aptitude for developing complex software systems are the key skill sets required for any SRE. Thus, we can consider SRE as a specific implementation of DevOps with some extensions.
SREs, being software engineers, possess the required skills and knowledge to design and implement automation for the processes that are done manually by the traditional operations team. Traditional operations teams need linear scale-ups (that involve hiring more people) with increasing load/size of the services, which adds to the overall project costs. At the same time, SREs with constant engineering and automation keep the team size independent of the size of services provided. SREs usually spend 50% of their time on engineering/development and 50% on the operations side of running services.?
SRE PROS & CONS
Advantages of the SRE Approach
SRE Challenges
RESPONSIBILITIES OF SRE
SRE?teams interact with the environments, development teams, testing teams, users, etc., to understand the work practices and business requirements, while focusing on engineering the changes. An SRE is responsible for the following with respect to each of the services running in production:
One of the critical responsibilities of any member of the operations team is monitoring. SREs monitor the system 24/7 to keep track of the system’s health and availability. In a traditional environment, email alerts are generated, which are then reviewed by an operations team member, who then takes the necessary actions. In the SRE world, the software will interpret the alerts and try to resolve them by itself. The software notifies?the SREs only when it requires human intervention. Based on the severity, there are three types of monitoring outputs:
Alerts: High severity where human intervention and action are required immediately.
Tickets: Medium severity where a ticket is created for the operations team to take action, but not necessarily immediate action.
Logs: Low severity. In such cases, the information is recorded for audit and forensic purposes and is to be used whenever required.
Failures and system emergencies can happen at any time. What’s important is how fast the response team can bring the system back to its normal state. In a manual setup, this recovery takes more time.?
Mean Time To Repair (MTTR)?is a measure of how effective the emergency response is. Automation helps to increase system availability by bringing down the MTTR at least three times.?
One of the critical responsibilities of an?SRE?is applying changes to the system without causing any downtime. Around 70% of system failures and outages?while changes are being made to the live system. SRE employs automation and best practices like progressing rollouts, and mechanisms to detect issues quickly and roll back the changes safely if any problem occurs. Automation increases the safety and velocity of change management.
SREs will monitor and modify the services or provision more capacity to meet the expected loads and maintain required performance and efficiency levels for systems. The efficient use of resources will also reduce the overall costs incurred.
An?SRE?constantly monitors the system resources and the system’s performance to identify expected future demands. Using this data, it then plans the system’s capacity
accordingly. SRE ensures sufficient capacity and redundancy to meet such demands (organic and inorganic) to run the services with expected efficiency and availability. SRE can also use load tests to determine and correlate the available capacity to the required capacity.??
领英推荐
Based on change management and capacity planning, SREs do the provisioning when necessary. As increasing capacity is expensive, it should be done only when necessary. During such changes, the SRE validates the changes and also ensures that the modifications deliver the correct results and provide the expected performance for the services.??
DEVELOPMENT & OPERATIONS?
It is often challenging to design, build, and?deploy?large complex software systems in production environments. Moreover, it is equally important and challenging to run and maintain these live production systems.?
Traditionally, companies employed system administrators to run/operate and respond to any events in complex and large computing systems. However, the skills required for such system admins (operational teams) were different from those skills needed by software developers.?Thus leading to the creation of separate teams for developments (DEV) and operations (OPS).
THE DEV-OPS CONFLICT!
Dividing the teams into?DEV/OPS?has many pitfalls and disadvantages. The first one among them is higher operations costs (direct and indirect). Maintaining a separate operations team and scaling it up as and when load/events increase will increase direct costs too. At the same time, indirect costs are incurred by the organization mostly due to the split between the teams – i.e., in terms of skill sets, interests, objectives, risks, etc.?
Development teams will want to push the new features and releases to the production systems as soon as possible, whereas the operations teams will want to maintain their systems stable without any service disruption or outages by keeping the changes to a minimum. The operations team tries to safeguard the running system by reducing changes/risks. This naturally leads to a structural conflict between the two teams regarding the?pace of innovation?and?service stability.?
Day-to-Day SRE
SRE strives to reduce toil in their day to day work which continuously improves their efficiency and dependent teams. (Also note, that continuous improvement is a fundamental DevOps principle that connects SRE to the larger DevOps movement.) The SRE book defines toil as:
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
It’s not work that engineers don’t want to do. Toil is an inhibitor that should be reduced in all possible areas. SRE’s should always maximize their automation skills to reduce manual work, enabling an SRE team to scale out while maintaining consistency across their systems. Reducing toil is a powerful idea since it expands out to capture the day to day work of maintaining logging and metric systems, standing up new services, reporting SLOs, and/or adding CICD pipelines to other systems. This is all important work that other teams need but someone has to set up for them.
Google found that capping their SRE time to 50% on toil and the other 50% on project work (such a driving improvement or supporting existing teams) was a key factor in successful SRE implementations. Capping the work sets a clear limit on how painful toil may be and it exposes a clear priority in addressing toil that habitually pushes against the limit. It also enforces the idea that SRE is more than?just?toil and encourages a shared responsibility model. If the SREs are overwhelmed with toil, then work can be distributed across other teams. This sheds load from the SRE team while exposing other engineers to the reality of running their own systems in production.
Capping toil is the second checkpoint in getting started with SRE. Stephen Thorne reiterates this point in the talk mentioned earlier:
if you’re not capping that toil and allowing them to actually go and implement that [monitoring] work, then all they’re doing is getting overloaded with toil and then they won’t be able to do any project work. The next time they need to do some things to improve the reliability of the system, they’re too overloaded. I think any org with one or a thousand SREs must be able to apply this principle. There must be this ability for the SREs to address the toil and do the project work.
After these two checkpoints, it’s up to management and leadership to form teams and set responsibilities.
Moving Towards Site Reliability Engineering
When you have SLOs, a declared cap on toil and a plan to handle overflow, then it’s time to consider what SRE looks like for your organization. There are three common models:
There is no one correct answer. The best fit varies by organization size and specific goals. Consider a simple example. An 8 person team may not require a dedicated SRE, and it certainly doesn’t mandate a dedicated SRE team. Conversely, there’s an inflection point where a dedicated SRE team makes sense and embedding SRE into existing teams makes sense. You must consider the trade-offs before making a decision.
VictorOps see SRE differently. They consider SRE a behavior rather than a dedicated role. Their goal is to build a culture of reliability into their engineers instead of into a specific team. They accomplished this by building a cross-functional council. Here’s Jason Hand from VictorOps in the eBook “Build the Resilient Future Faster: Creating a Culture of Reliability“:
For VictorOps, the SRE mentality would need to be central to the culture of our entire organization. The responsibility of owning the scalability and reliability of the product (VictorOps) from a customer experience point of view doesn’t rest solely on an SRE team or individual engineer. Rather than assigning the SRE role and responsibility to a specific team or individual, we chose to assemble a cross-functional panel of engineers, support leads, and product representatives referred to as the SRE council.
VictorOps came to this conclusion by surveying SREs at other companies and determining what seemed right for them. You should do this before getting started with SRE since implementations of SRE ideas vary wildly between different organizations. There is no gold standard, just what’s effective for your organization and yielding results. Learning from other teams is a great way to avoid pitfalls.
Regardless of how SRE is structured within your organization, you’ll need buy-in from leadership and engineers. Management must enforce consequences for missed SLOs, breaching caps on toil, and defining clear boundaries between SRE and other teams. Introducing SRE can be a major organizational change and when so will only be successful if supported at the highest levels.
Next Steps
Let’s review the checkpoints we’ve established along the way to getting started with SRE. First and foremost is to establish, monitor, and report on SLOs. SLOs provides the foundation for building and maintaining reliable systems. Second is the cap on toil which ensures SREs are focused on continuous improvements throughout the system and not on low-value toil work. Lastly, there’s the collaborative effort of documenting responsibilities and building organizational buy-in.
Once you’re through these gates it’s time to consider the initial goals. Jason Hand, from VictorOps, poses a series of exercises. First, ask the team what keeps them up at night? The answer brings skeletons out of the closet. That kickstarts the process and allows new SREs to navigate their responsibilities while improving reliability.
CONCLUSION
SRE is a relatively new topic that is gaining traction now. SRE applies software engineering practices to the operations processes and brings in many advantages in operations and managing a large complex production system. To a great extent, it bridges the gap and reduces the split between the developers and the operations team.
DevOps Engineer - II @ o9 Solutions | Infrastructure Automation, AWS, Incident Management
1 年Well written Sakthivel Gopi