登录查看更多内容

Site Reliability Engineering: Fundamental Concepts And How To Put Them In Practice

KWAN

We're for #TechTalentDoneRight | Career Coaching | Tech Recruitment

发布日期: 2023年12月28日

What is Site Reliability Engineering (SRE)? What are the fundamental principles of this discipline? How is continuous improvement applied?

In a simplified way, and based on the book made available by its creators, we can define SRE as an approach to systems operations that began at Google, which brings together software engineering principles with traditional IT Operations practices.?

Essentially, SRE’s major objective is to create reliable and resilient systems, ensuring a positive experience for users. The SRE team is responsible for managing and maintaining essential systems with the purpose of ensuring the functionality and availability of critical business systems, aiming to minimize the impact of potential failures and safeguard the business.

The Four Core Principles of SRE

To achieve the objectives already mentioned, SRE teams are based on four basic fundamental principles:

1. Measurement of SLIs, SLOs, and Error Budgets

a) Service Level Indicators (SLIs) are metrics that quantify the quality of a system, such as the average response time of an API.

b) Service Level Objectives (SLOs) are goals established for SLIs. For example, keeping an API’s response time below 100ms for 99% of requests over the course of a week.

c) Error Budgets represent unwanted occurrences in which a system does not reach its SLO.

2. Automation

Automation is a key tool used by the SRE team to handle repetitive and routine tasks (Toil). This approach minimizes the likelihood of human errors and allows the team to dedicate their time to more complex and meaningful activities instead of solving problems.

3. Controlled Escalation

Implementation of changes are carried out in a gradual and controlled manner by the SRE team to mitigate risks. If a change causes your environment to become unavailable, you can quickly roll back those changes to a stable state.

4. Culture of Learning from Mistakes

Instead of avoiding errors at all costs, the SRE team sees mistakes as opportunities to learn and improve the system. This involves incident analysis, problem resolution, and documentation to prevent similar problems in the future (Postmortem Culture Implementation).

The SRE approach has demonstrated success at both Google and other tech companies. It promotes closer collaboration between operations and development teams, creating a culture of trust and knowledge sharing. However, SRE does not follow a single process. Each company should adapt the SRE principles according to its needs and infrastructure, and it is critical to understand that reliability is a shared responsibility, not making a single specific team strictly responsible.

Next, we will delve deeper into the principles mentioned above, aiming to demonstrate how these principles are applied in the routine of an SRE team.

Krishna Srikanth K 9 个月前

The Definitive Guide to Site Reliability Engineering:…

Huzaifa Asif 1 年前

Creating a Culture of Reliability Through SRE and…

Yoseph Reuveni 2 周前

Continuous Improvement and Reliability Engineering

An SRE team is always looking to improve system reliability through an engineering-based approach. This implies not only reacting to incidents, but also finding ways to avoid them proactively.

Incident analysis, known as “Postmortem”, is a powerful tool for any teams that use SRE. When an incident occurs, the team conducts a detailed analysis to understand the root causes and identify opportunities for improvement. This approach allows you to learn from past mistakes and implement changes to avoid similar problems in the future.

A culture of learning from mistakes is vital to SRE success. Instead of punishing failures, the organization should encourage an open culture where mistakes are seen as valuable learning opportunities, known as “Blameless Culture”. This creates an environment where team members share experiences and insights, constantly improving system reliability and team engagement, encouraging a good work environment without the pressure of not being able to make mistakes.

When we talk about resilience and fault tolerance, we can say that resilience is a central characteristic of the systems managed by SRE teams, the team assumes that failures happen and works so that the system can recover from them consistently, minimizing the impact.

This involves designing redundant, distributed, and fault-tolerant architectures. The SRE team must identify single points of failure and implement solutions to mitigate those risks. Stress tests and failure simulations are crucial to ensure the system can handle adverse situations.

Regarding automatic scaling, scalability is a constant concern for growing systems in a dynamic and demanding market as we currently have, the SRE team should use automation whenever possible, so that the system may dynamically adjust to user demand, business or infrastructure.

Scaling automation allows the system to automatically increase or decrease its capacity based on performance metrics. This ensures that the system remains available even during traffic peaks.

Therefore, SRE teams, by implementing a culture of collaboration and transparency, promote involvement between operations and development teams so that, instead of working separately, these teams communicate regularly and share knowledge to achieve system reliability and availability objectives, according to business needs.

This collaborative effort involves the mutual definition of goals and priorities for the service. The development and operations teams synergize their efforts to set realistic and achievable Service Level Indicators (SLIs) and Service Level Objectives (SLOs). Through this joint definition, the SRE team gains a clear understanding of user expectations and the essential requirements to fulfill them.

I can’t help but mention that the evolution of organizational culture is directly dependent on the successful implementation of SRE, demanding a cultural transformation within the organization. Leadership must actively support the adoption of the SRE approach, fostering a culture of trust where mistakes are viewed as learning opportunities and the search for continuous improvements is encouraged.

Proper training of the SRE team and other teams involved is paramount. This ensures that everyone has the necessary skills to effectively implement the practices outlined for a successful SRE implementation.

What is Site Reliability Engineering (SRE) – Final Considerations

? In short, SRE is an innovative synthesis between software engineering, operations team, and IT infrastructure, aiming at the reliability and availability of scalable systems. How? By proactively measuring SLIs, SLOs, and Error Budgets; automating routine tasks and implementing changes gradually in a controlled manner; and fostering a culture of learning from errors. This way, the SRE approach becomes highly effective, ensuring reliability, and a positive user experience, which can provide a competitive advantage in our increasingly digitized environment.

The SRE discipline is comprehensive, encompassing various aspects. In the upcoming articles, we will delve into its fundamental principles and give you detailed examples, so you can understand how to seek excellence in its implementation, combining the performance with other processes used by high-performance teams, as well as DevOps and Agile methodologies.?

By now, you can start exploring these topics by reading this article, about the agile manifesto and how to build an agile mindset.

Article originally published at https://kwan.com/blog/site-reliability-engineering-fundamental-concepts-and-how-to-put-them-in-practice/ on December 22, 2023.

Site Reliability Engineering: Fundamental Concepts And How To Put Them In Practice

KWAN

We're for #TechTalentDoneRight | Career Coaching | Tech Recruitment

The Four Core Principles of SRE

1. Measurement of SLIs, SLOs, and Error Budgets

2. Automation

3. Controlled Escalation

4. Culture of Learning from Mistakes

领英推荐

Continuous Improvement and Reliability Engineering

What is Site Reliability Engineering (SRE) – Final Considerations

Tech Talent Done Right

8,757 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Using Observability to Drive Continuous Improvement in Site Reliability Engineering (SRE)

AIOps in Site Reliability Engineering (SRE): 10 Practical Examples Enhancing Operational Efficiency

The Evolution of Site Reliability Engineering at VGW: Insights from our Head of SRE

Site Reliability Engineering (SRE): Ensuring Reliability at Scale with Real-World Examples

A Site Reliability Engineering (SRE) Manifesto

Site Reliability Engineering: Revolutionizing Business Operations

Measuring Success in SRE - Part#2

Embracing Graceful Degradation in Site Reliability Engineering (SRE)

SRE vs. Reliability Engineer.

Impact of GenAI on Site Reliability Engineering (SRE)

The Four Core Principles of SRE

1. Measurement of SLIs, SLOs, and Error Budgets

2. Automation

3. Controlled Escalation

4. Culture of Learning from Mistakes

领英推荐

Continuous Improvement and Reliability Engineering

What is Site Reliability Engineering (SRE) – Final Considerations

Tech Talent Done Right

8,757 位关注者

Black Friday Tech Extravaganza: Your Ultimate Guide to the Best Deals

2024年11月20日

Top Tips to Start Developing The Maturity of a Team in Agile Methodologies

2024年11月13日

Swift Testing: A Comprehensive Overview

2024年11月6日

Principais diferen?as da cultura empresarial entre Brasil e Portugal

2024年10月30日

Refactoring: The Art of Polishing Code

2024年10月23日

Refatora??o: a Arte de Lapidar Código

2024年10月16日

How to integrate Test Automation into the system pipeline with a downstream trigger on GitLab

2024年10月9日

Company Perks That Tech Employees Don’t Value VS What They Actually Value

2024年10月2日

Robot Framework Using the Browser Library: Advantages, Disadvantages, and Practical Tips

2024年9月25日

Agilidade: Dicas para Come?ar e Manter o Foco em uma Evolu??o Constante

2024年9月11日

社区洞察

其他会员也浏览了

Using Observability to Drive Continuous Improvement in Site Reliability Engineering (SRE)

AIOps in Site Reliability Engineering (SRE): 10 Practical Examples Enhancing Operational Efficiency

The Evolution of Site Reliability Engineering at VGW: Insights from our Head of SRE

Site Reliability Engineering (SRE): Ensuring Reliability at Scale with Real-World Examples

A Site Reliability Engineering (SRE) Manifesto

Site Reliability Engineering: Revolutionizing Business Operations

Measuring Success in SRE - Part#2

Embracing Graceful Degradation in Site Reliability Engineering (SRE)

SRE vs. Reliability Engineer.

Impact of GenAI on Site Reliability Engineering (SRE)