Thoughts on SRE

Thoughts on SRE

What exactly is SRE?

Site reliability engineering (SRE) empowers software developers to own the ongoing daily operation of their applications in production. The goal is to bridge the gap between the development team that wants to ship things as fast as possible and the operations team that doesn't want anything to blow up in production.

  • In a simple way, we can say

SRE is a prescriptive way to do DevOps

  • How does site reliability engineering relate/compare to DevOps?
  • Traditionally, DevOps has been more about collaboration between developers and operations. It has also focused more on deployments. Site reliability engineering is more focused on the operations and monitoring part, and as well as to provide highly reliable and scalable software systems that run with minimum failure for a long duration.

Roles and responsibilities of SRE

The main roles of SRE engineer would be

  • Make deployment easier
  • Proactively monitor and review application performance
  • Ensure software has good logging and diagnostics
  • Set SLI’s, SLA's and SLO’s and error budgets
  • Increase speed by assuming calculated risks
  • Eliminate toil
  • Reduce the cost of failure to lower new feature cycle time.

SLI, SLO and SLA ?

  • The SRE teams have the responsibility for maintaining and establishing service level indicators (SLIs), objectives (SLOs), agreements (SLAs), and?error budgets?for their systems and make sure these are met.

Service Level Indicators (SLIs) are the quantitative measures defined for a system, also knows as “what we are measuring"

Most important SLI's would be

  1. Latency—how long it takes to return a response to a request
  2. Availability- the fraction of the time that a service is usable.
  3. Durability—the likelihood that data will be retained over a long period of time—is equally important for data storage systems.

Service level objective(SLO): a target value or range of values for a service level that is measured by an SLI

  • SLO is a typical internally objective to help guide the design, operation, and management service.
  • The structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound
  • It is numerical targets that agreed upon between stakeholders for system reliability.

Service Level Agreement is service level agreements, the promise you make about your service’s health to your customers.?

  • With an SLA, the consumer would have a clear idea about the proposed product or service in terms of functionality, reliability, and performance.
  • For example, a simple SLA might guarantee that a SaaS application will be available 99.9 percent of the time. If the application fails to meet that level of availability, the customer’s payments will be reduced by 10 percent

However, once you dive into the details, SAs, SLOs, and SLIs are clearly different types of entities:

  • An SLA is a contract.
  • An SLO is a specific goal that is defined in a contract.
  • An SLI measures the extent to which teams comply with the SLO promises they make in SLA contracts.

What is TOIL?

“TOIL is the kind of work that tends to be manual, repetitive and tactical devoid of enduring value and that scales linearly as a service grows”. — Vivek Rau, Google.

Examples of toil are something related to manual intervention like manual releases, physically connecting to infrastructure to check something, manual resets, on-call response, extracting data, manual scaling of infrastructure, etc. We need to eliminate the TOIL as manual work reduces the quality

What is Error Budget?

“100% is the wrong reliability target for basically everything” — Ben Treynor

Error Budget means the amount of Time Budget we have where service can get affected. This is the time that is used to bring in new features or make architectural changes. If we tend to spend more than the budget, there has to be a consequence. One such consequence is to stop new features and get the system stable.?

References: You can have a look at below resources which gives more insight on SRE concepts

  • https://dzone.com/articles/sla-vs-slo-vs-sli-understanding-the-similarities-a#
  • https://cloud.google.com/blog/products/devops-sre/sre-fundamentals-slis-slas-and-slos
  • https://www.devopsinstitute.com/site-reliability-engineering-key-concepts-slo-error-budget-toil-and-observability/

Datta Sai Krishna Somesula

Senior Software Engineer(Sr. AVP) at Wells Fargo | Microsoft Certified Azure Developer| Ex- Deloitte

2 年

Great Start

要查看或添加评论,请登录

社区洞察

其他会员也浏览了