What is SRE?
Varun Kaushik
Solution Design, Consulting & Sales | Technical Pre-sales | Business Development | Commercial Architecture | Technical Account Management
In the traditional system, Sysadmin (Systems Administrator) fixes the broken systems and keep working on incidents/events to make system reliable. But still most project teams failed to achieve the desired SLA. And Hence, most of the companies are moving towards different model and approach that is accompanying with emerging technologies and support model. New approach has less conflicts between teams and build a new system to achieve maximum reliability and durability.
So, what is new approach? #SRE - A systematic and automated approach to enhancing IT service delivery using standardized tools and practices.
Benjamin Treynor Sloss (VP engineering at Google Cloud) explained SRE as - “SRE is what happens when you ask a software engineer to design an operations function.”?SRE is where software developer team develop software systems to solve complex systems’ problems i.e. Capacity and performance planning, disaster management and quality monitoring.
SRE team is responsible for the?availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning?of their service(s).?For SRE team, a 50% cap on the aggregate "ops" work - tickets, on-call, manual tasks, etc. You will get more time for serious coding if you have reduced ops work by your coding.?
SRE Team Responsibilities –
Tools Used by SRE Team –
Selecting right tool is very important for managing the challenging environment of client. There are a variety of tools for each aspect of SRE: monitoring, SLOs and error budgeting, incident management, incident retrospectives, alerting, chaos engineering, and more.?
领英推荐
Some Important Definitions –
1)?????SLO – Service level objective –
·???????As per Gartner - SLOs are the objectives that must be achieved — for each service activity, function and process — to provide the best opportunity for service recipient success?
·????????As per Wikipedia - SLOs are specific measurable characteristics of the SLA such as availability, throughput, frequency, response time, or quality. These SLOs together are meant to define the expected service between the provider and the customer and vary depending on the service's urgency, resources, and budget. SLOs provide a quantitative means to define the level of service a customer can expect from a provider.
2)?????Error Budget – ??An?error budget?is the amount of?error?that your service can accumulate over a certain period before your users start being unhappy. You can think of it as the pain tolerance for your users but applied to a certain dimension of your service: availability, latency, and so forth. Error budgets are the tool SRE uses to balance service reliability with the pace of innovation. An error budget is 1 minus the SLO of the service. A 99.9% SLO service has a 0.1% error budget. If our service receives 1,000,000 requests in four weeks, a 99.9% availability SLO gives us a budget of 1,000 errors over that period.
Conclusion –
SRE establishes a healthy and productive relationship between development and operations. SRE is an enabler to maintain the massive infrastructure in an intelligent, efficient, and scalable way.?
Reference –