登录查看更多内容

Thoughts on SRE

Vamshi Yemula

Senior DevOps Engineer | SRE | Kubernetes, Docker ,Terraform ,AWS and CI/CD Specialist | Driving Reliability & Performance

发布日期: 2022年5月23日

What exactly is SRE?

Site reliability engineering (SRE) empowers software developers to own the ongoing daily operation of their applications in production. The goal is to bridge the gap between the development team that wants to ship things as fast as possible and the operations team that doesn't want anything to blow up in production.

In a simple way, we can say

SRE is a prescriptive way to do DevOps

How does site reliability engineering relate/compare to DevOps?
Traditionally, DevOps has been more about collaboration between developers and operations. It has also focused more on deployments. Site reliability engineering is more focused on the operations and monitoring part, and as well as to provide highly reliable and scalable software systems that run with minimum failure for a long duration.

Roles and responsibilities of SRE

The main roles of SRE engineer would be

Make deployment easier
Proactively monitor and review application performance
Ensure software has good logging and diagnostics
Set SLI’s, SLA's and SLO’s and error budgets
Increase speed by assuming calculated risks
Eliminate toil
Reduce the cost of failure to lower new feature cycle time.

SLI, SLO and SLA ?

The SRE teams have the responsibility for maintaining and establishing service level indicators (SLIs), objectives (SLOs), agreements (SLAs), and?error budgets?for their systems and make sure these are met.

Service Level Indicators (SLIs) are the quantitative measures defined for a system, also knows as “what we are measuring"

Most important SLI's would be

Latency—how long it takes to return a response to a request
Availability- the fraction of the time that a service is usable.
Durability—the likelihood that data will be retained over a long period of time—is equally important for data storage systems.

Broadus Palmer 2 年前

A Comprehensive Guide to Site Reliability Engineering…

Vinayak Bedake 7 个月前

Why Automated Testing is the Future of SRE Best…

Yoseph Reuveni 3 周前

Service level objective(SLO): a target value or range of values for a service level that is measured by an SLI

SLO is a typical internally objective to help guide the design, operation, and management service.
The structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound
It is numerical targets that agreed upon between stakeholders for system reliability.

Service Level Agreement is service level agreements, the promise you make about your service’s health to your customers.?

With an SLA, the consumer would have a clear idea about the proposed product or service in terms of functionality, reliability, and performance.
For example, a simple SLA might guarantee that a SaaS application will be available 99.9 percent of the time. If the application fails to meet that level of availability, the customer’s payments will be reduced by 10 percent

However, once you dive into the details, SAs, SLOs, and SLIs are clearly different types of entities:

An SLA is a contract.
An SLO is a specific goal that is defined in a contract.
An SLI measures the extent to which teams comply with the SLO promises they make in SLA contracts.

What is TOIL?

“TOIL is the kind of work that tends to be manual, repetitive and tactical devoid of enduring value and that scales linearly as a service grows”. — Vivek Rau, Google.

Examples of toil are something related to manual intervention like manual releases, physically connecting to infrastructure to check something, manual resets, on-call response, extracting data, manual scaling of infrastructure, etc. We need to eliminate the TOIL as manual work reduces the quality

What is Error Budget?

“100% is the wrong reliability target for basically everything” — Ben Treynor

Error Budget means the amount of Time Budget we have where service can get affected. This is the time that is used to bring in new features or make architectural changes. If we tend to spend more than the budget, there has to be a consequence. One such consequence is to stop new features and get the system stable.?

References: You can have a look at below resources which gives more insight on SRE concepts

https://dzone.com/articles/sla-vs-slo-vs-sli-understanding-the-similarities-a#
https://cloud.google.com/blog/products/devops-sre/sre-fundamentals-slis-slas-and-slos
https://www.devopsinstitute.com/site-reliability-engineering-key-concepts-slo-error-budget-toil-and-observability/

Datta Sai Krishna Somesula

Senior Software Engineer(Sr. AVP) at Wells Fargo | Microsoft Certified Azure Developer| Ex- Deloitte

2 年

Great Start

1 次回应

要查看或添加评论，请登录

查看全部

Thoughts on SRE

Vamshi Yemula

Senior DevOps Engineer | SRE | Kubernetes, Docker ,Terraform ,AWS and CI/CD Specialist | Driving Reliability & Performance

What exactly is SRE?

Roles and responsibilities of SRE

SLI, SLO and SLA ?

领英推荐

What is TOIL?

What is Error Budget?

更多精彩文章

社区洞察

其他会员也浏览了

Why Monitoring and Logging are Important in DevOps

Site Reliability Engineering (SRE) – Top 35 questions answered

Site Reliability Engineering (SRE)

DevOps VS. Site Reliability Engineering

Introducing SRE into a DevOps

Unlock the Power of AI in Site Reliability Engineering: The Ultimate Guide to SRE Benefits

DevOps as a Culture

The Dynamic Duo: Unveiling the Crucial Role of DevOps in Site Reliability Engineering (SRE)

UNCOVERING THE DIFFERENCE BETWEEN SRE AND DEVOPS

Site Reliability Engineering (SRE) vs DevOps: Focus, Differences, Similarities, and Practices

What exactly is SRE?

Roles and responsibilities of SRE

SLI, SLO and SLA ?

领英推荐

What is TOIL?

What is Error Budget?

Agile Methodologies: A Deep Dive into Modern Project Management

2024年8月9日

Automating Your CI/CD Pipeline with GitHub Actions: An End-to-End Workflow

2024年7月29日

Embracing the Future: DevOps Tools to Watch in 2024

2024年7月8日

Is Platform Engineer other fancy name given to DevOps Engineer?

2023年5月11日

LENS -Kubernetes IDE

2022年7月14日

My thoughts on Docker

2019年8月28日

DevOps (Tool?Process?Product?)

2019年8月26日

DevOps workflow

2019年8月25日

社区洞察

其他会员也浏览了

Why Monitoring and Logging are Important in DevOps

Site Reliability Engineering (SRE) – Top 35 questions answered

Site Reliability Engineering (SRE)

DevOps VS. Site Reliability Engineering

Introducing SRE into a DevOps

Unlock the Power of AI in Site Reliability Engineering: The Ultimate Guide to SRE Benefits

DevOps as a Culture

The Dynamic Duo: Unveiling the Crucial Role of DevOps in Site Reliability Engineering (SRE)

UNCOVERING THE DIFFERENCE BETWEEN SRE AND DEVOPS

Site Reliability Engineering (SRE) vs DevOps: Focus, Differences, Similarities, and Practices