Site Reliability Engineering (SRE) – Top 35 questions answered

Site Reliability Engineering (SRE) – Top 35 questions answered

Site Reliability Engineering (SRE) is been used across the industries to deliver best in world class service delivered to end users. Here is attempt to answer top 35 questions around SRE. #sre?#sitereliabilityengineering

1.??????What is Site Reliability Engineering (SRE)?

Site Reliability Engineering (SRE) is what you get when you start treating operations as Software problem.?SRE consist of set of principles and practices that incorporate aspects of software engineering and applies them to operational problems to make software systems more reliable and scalable.

Over the years, developers were trying to deliver features faster to production environment while operation teams were trying to reduce the production volatility.

DevOps became popular since it consist of set of cultural philosophies, practices, and tools to increase the volatility of software systems while reducing the operational risks.

SRE was pioneered by Google to make product and services highly reliable to end users. It has been picked up by other organizations and has become industry standard.

Typically, SRE implements DevOps.

2.??????Who invented SRE?

Ben Treynor Sloss, who is a VP at Google is widely regarded as the person who invented SRE. He is taking care of Google's infrastructure and user facing systems, etc. As per his LinkedIn intro, he is leading over 8,000 engineers and responsible for over $10B/year budget at Google.

3.??????What is the difference between SRE and DevOps?

Let’s put it in this way – SRE implement DevOps. All DevOps philosophies are implemented part of SRE implementations.?Main DevOps concepts been implemented as part of SRE implementation are,

-?????????Reduce Operational silos – SRE mandate shared ownership

-?????????Accept failures as normal – SRE mandate Error budgets with less than 100% targets and blameless post mortems?

-?????????Implement Gradual changes?- SRE mandate small changes with faster rollbacks ?

-?????????Leverage tooling and automations – SRE mandate Toil elimination through tooling and automation

-?????????Measure everything – SRE mandate measuring reliability and other SLOs, SLIs and Error budgets

4.??????What is the role of SRE?

Role of SRE is to keep focus on what matters to customers – make customer supported systems and services available and reliable.

Note – Main role of DevOps is to increase software delivery velocity

5.??????Does SRE more powerful than DevOps?

Not necessarily.?In simple terms, SRE implements DevOps. There’s no existence of SRE if there is no DevOps. #devops

6.??????What are the key fundamentals of SRE?

SRE team keeps systems reliable, scalable and efficient with solutions while guiding system architecture.?SRE is a mindset change to operations.

Main fundamentals of SRE includes,

-?????????Enabling observability

-?????????SLIs, SLOs and SLAs

-?????????Error Budget

-?????????Toil Elimination

-?????????Chaos Engineering

-?????????Deployment Automation

-?????????Auto Capacity provisioning

-?????????Microservice architecture and decouple sub systems

7.??????What are the key questions typical SRE team would try answer?

Typical SRE teams would try to answer following questions,

-?????????How much reliability is required for an application?

-?????????What is the right time to invest in feature vs. resiliency?

-?????????How many alerts are generated for same incident?

-?????????How quickly does and incident reach to the right team?

-?????????How many times an incident reoccur?

-?????????When downstream are failing, can we still manage user experience?

-?????????etc

8.??????What are the key accelerators of SRE implementation?

All below enablers would accelerate SRE implementations,

-?????????Continuous Integration / Continues Delivery pipelines (CI/CD)

-?????????Infrastructure as code solution

-?????????Observability solutions

-?????????Automated testing

-?????????Cloud

-?????????Microservices architrave

-?????????Automation tools

-?????????Chaos Engineering

-?????????AIOps

9.??????What are the different mode of SRE implementations?

-?????????Kitchen Sink, a.k.a “Everything SRE” - Scope of work is typically unbounded. It’s often results of organic growth

-?????????Infrastructure – Scope of work is typically focus around Infrastructure. Common implementations includes, CI/CD, Infrastructure as Code etc

-?????????Tools – Scope of work is typically focus around tools. Common implementations includes focus on building software to measure, maintain and improve system reliability or any other aspects of SRE such as capacity planning etc.

-?????????Product /Application – Scope work is typically focus to improve reliability of application or business area part of feature development.

-?????????Embedded – Typically, SRE teams embedded with respective developer counterparts.

-?????????Consulting – Implementation is very similar to embedded approach.??Main difference is Consulting team would avoid changing customer code and configurations.

10.??What are the key success measuring SRE metrics?

-?????????Service Level Objectives (SLOs)

-?????????Error Budgets

-?????????Golden Signals

o??Latency

o??Traffic

o??Errors

o??Saturation

-?????????Lead time to changes

-?????????Deployment frequency

-?????????Time to restore service

-?????????Change failure rate

11.??What is a Service Level Indictor (SLI)?

Service Level Indictors are indictors which tells you how well our service is doing.

Example –

Error Rate = Ratio of error / total request

SLI = [Good Events] / [Valid Events]

#sli

12.??What is a Service Level Objective (SLO)?

Service Level Objectives aggregate at point of time how service is doing against a target.

SLOs are also refer to as Customer Happiness Test.

Meet target SLO - > Happy customers

Misses target SLO - > sad customers

#slo

13.??What is a Service Level Agreement (SLA)?

Service Level Agreement is the service agreement you sign with customers.

#slo

14.??What is the difference between SLO and SLA?

Typical SLOs is much tighter than SLAs.?Obliviously you need to identify failures before your customers complain.

15.??What is a bad SLO?

100% SLO is always a wrong target for basically everything.?Anything beyond not expected by your customers are typically bad SLO too.

16.??What are Error Budgets?

Error Budget = 100% - SLO target

Example -

Error Rate SLI = Ratio of error / total request

Error Rate SLO can be set as 99.9% for a 4 week period.?Then for a 4 week period, Error Rate Error Budget is 0.1%. Generally this means, you can have 0.1% Error rate during this period which is consider as acceptable failure.

Error Budgets are usually with consequence. The moment you breach the Error budget, consequence can be change freeze in production environment or a period of time or until you fix the issue.

#slo #errorbudgets

17.??What is SRE guideline related to monitoring and alerting?

Enabling observability is the key concept when it comes to SRE.

4 Golden signals are the key areas SRE mandate production teams to enable observability.?4 areas are,

-?????????Latency

-?????????Traffic

-?????????Errors

-?????????Saturation

18.??What is Toil management?

Toil is generally refer to as manual repetitive work.?Toil management means, documenting the toil, identify automation potential candidates and automate them to remove manual intervention.

19.??How can we eliminate Toil?

Yes. And we must.?Toil is typically the manual work we do. If same manual work is attempted twice, if it is falling in to a category of typical repetitive work, then you should start planning to automate the tasks.?At first, you may not able to fully automate it and plan is to automate as much as possible and simplify the remaining work. There are great tools supporting automation initiatives and able to leverage them. Runbook automation, monitoring , deployment and any other workarounds are typically the key high profile automation candidates in any operations engagement.?

20.??What is blameless post mortems?

Typically, humans are not the problem. It’s the system which allows human to make mistake.

Blameless post mortems focus on identifying the actual root cause, then identify how we can make our systems reliable to withstand failure, or avoid failure.

Example - Admin personal run a command and remove a file system causing outage in production environment.?Ideally systems should be robust enough not to allow to remove file system.?Blameless postmortem in this instance will focus on how to enhance system to eliminate future failure instead of going with individually blaming.

21.??What is Incident Command?

Incident Command is a standardized approach to command, control and coordinate of an emergency response.

Each project team require coming up with Incident Command workflow base on nature of work.

22.??What is load balancing?

Efficiently distributing incoming traffic across multiple group of backend services is typically referred to as Load balancing. Why it's important is that since it could be your first point of failure.

23.??What is auto capacity provisioning?

Automating the entire Capacity provisioning workflow is known as Auto Capacity provisioning. Example - in a typical Cloud set up, if memory is not enough, system itself will automatically will uplift the memory base on triggered.

?24.??What is the difference between monitoring and observability?

Monitoring is refer to a software or hardware component which used to monitor system resources and performance in a computer system. Monitoring consists of systematic data collection and analysis to keep application or infrastructure to operate optimally. Key aspect to highlight is, monitoring is based on gathering predefined set of metrics or logs in an IT estate.

Observability refers to being able to collect data about IT estate about software metrics execution, internal states of modules, and communication between components. Observability is basically the tooling or technical solution that allows team to actively debug IT estate. Observability is based on exploring properties and patterns that are not defined in advance.?In order to facilitate observability, SRE typically used wide range of logging and tracing techniques and tools.

While monitoring is based on predefine matrices, observability is based on metrics which can be define or not define in advance. In summary observability cover below three main areas.

1) Structured logging

2) Metrics (aggregate type data)

3) Traces

25.??What is NALSD?

NALSD – Non Abstract Large System Design

NALSD describe a skill critical to SRE. The ability to assess, design and evaluate large systems.?NALSD involves, capacity planning, component isolation, and graceful system degradation which all of them are key to high available production systems.

26.??What is Canary release?

Canary deployment is a deployment approach which allows to roll out new code/features to a subset of users as an initial test. If initial test are successful, you can start rolling out the new changes and if not, able to blackout them without impacting larger user base.

27.??What is the effort distribution of a typical SRE team?

Google generally mandate, 50% of SRE team’s effort should be spend on work actually does impact on enhancing reliability. That means work on areas or features that will enhance and strength the reliability of application.?SRE team can spend remaining 50% effort on typical operational work.

28.??How SRE influence change management?

Usually change management activities are decouple with user experience of applications. In typical terms, if there is an issue which is impacting user experience, developers might still deliver features and there is no such explicit control part of change management (unless it’s a major service impact issue such as complete outage etc).

What SRE brings to the table is Error Budgets. If Error Budgets are getting breach, that will have consequence on the feature delivery.?(Base on how you agree with stakeholders) ?In some instance, this can hold your feature deliveries for a foreseeable future until you fix the actual SLO breaches.

29.??SRE mindset is to embrace risks, how far you should go?

Make systems volatile as far as Error Budgets are start to breached. Idea is to deliver features, make system volatile until it breaks Error Budgets, then settle down, and fix the issues. Then again, start stressing the volatility.

Golden rule today’s business world is SPEED, which is the key deal breaker everywhere. You can’t simply have a nonvolatile system. Make system volatile in control manner and continuously making them reliable is the way to go.

30.??What is chaos engineering?

Chaos engineering is typically refer to as art of breaking things purposefully.

Formally it’s the process of testing distributed computing system to ensure it can withstand unexpected disruptions. It relies on concepts of underling chaos theory, which focus on random and unpredictable behavior. Netflix has pioneered the chaos engineering using early tool they developed call Chaos Monkey.

31.??Why is Software Engineering so important within SRE?

SRE teams are heavily involved in making systems reliable and naturally they should have in-depth of knowledge of "the code"

SRE is all about making system reliable and it involves, knowing the code and performing feature reliability enhancements.

Golden Rule -?Not all solutions are related to infrastructure automations. Hands on development experience a must for a SRE.

32.??What tools does typical SRE teams uses?

SRE teams does use the typical DevOps tools.

-?????????Monitoring ,observability and AIOps related tools – APM tools,?ELK, Splunk, Dynatrace, Datadog

-?????????Incident Respond tools – PagerDuty

-?????????Real time communication – Slack, Teams

-?????????Configuration management – Terraform / Ansible

-?????????Deployment automations , scaling and orchestration tools?– Jenkins / Spinnaker / Kubernetes

-?????????Cloud native tools

33.??What is the difference of Platform Engineering and SRE?

Platform engineering typically refer designing and building tool chains and workflows that enable self-service capabilities for software engineering (mainly in cloud-native).

34.??What are the key pillars of SRE implementation?

Key pillars of SRE are,

-?????????Enabling monitoring and observability

-?????????Alerting base on SLOs and Error Budgets

-?????????Automate the Incident response and resolution workflow

-?????????Deployment automation through implementation of CI/CD

-?????????Blameless post mortems

-?????????Test automation

-?????????Auto capacity provisioning

-?????????Development of reliability features

35.??What are the best practices related to SRE documentation?

Typically SLI, SLO, Error Budgets and Observability metrics has to be properly baseline, documented and refine periodically.?Ways of SRE work has to be documented and audited periodically.

SRE standards has to be continuously documented, audited and refined.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了