SRE Concepts series Part 1

SRE Concepts series Part 1

I have been asked many times about certain concepts of SRE. So I will do a series about 15 topics that feature in the Google SRE book. These concepts are not just for SRE. These are concepts that you can use in any IT environment. The first in the series ( This one ), I will discuss a little what SRE is. After that, in no particular order I will discuss the following topics/concepts.

  • Risk
  • Time-based Versus Aggregated Availability 
  • Error budget
  • Service Level indicator
  • Service Level Objectives
  • toil
  • Black box monitoring
  • White box monitoring 
  • The value of automation 
  • Continuous build and deployment
  • Stability versus Agility 
  • Root Cause analysis
  • Capacity planning  
  • Break your systems  
  • Testing in production 


What is Site Reliability?

Site Reliability Engineering or for short, SRE, is essentially a discipline where we apply software engineering aspects to infrastructure and other operations problems. Site Reliability Engineering aims at delivering more reliable and scalable applications/environments.

Even though first developed before the DevOps movement, Site Reliability Engineering is still one of the hot fields to work in the Information Technology sector. Site Reliability Engineers are also present in the modern cloud infrastructures and are a vital part of the process. Depending on the infrastructure, Site Reliability Engineering may do quite a few things.

A Site Reliability Engineer spends half of their time doing work related to operations. For instance, the engineer may work on issues, manual intervention, deployments, and much more ad-hoc work. The engineer spends the other 50% of their time on development.

Site Reliability Engineers generally oversee projects that are easy to automate. This leaves them with more time to develop new features and improve the application along with maintaining it. Since a Site Reliability Engineer needs to know both development and operations, it is usually hard to find a skilled Site Reliability Engineer.

SRE and DevOps usually work on the same base principle, one engineer managing both development and management. Site Reliability Engineering is often known as a specific implementation of DevOps.

Site Reliability Engineer Responsibilities

Share responsibility

Many organizations are adopting a shared responsibility model to speed up the development and ensure security in applications. Using a shared responsibility model will also remove the single point of delay. 

Site Reliability Engineers use the same tools and software programs as developers. They share the responsibility of developing a product with genuine developers, which will be a significant part of the shared responsibility model.

Accept Failure and Prepare for it

Unlike traditional developers or operational engineers, Site Reliability Engineers understand that failure is common and consider the failure scenarios. They measure the downtime of the product using an error budget and will take the necessary measures. Site Reliability Engineers must embrace risk for the whole project to work without any issues.

Site Reliability Engineers quantify failure and availability in terms of SLIs or Service Level Indicators or SLOs Service Level Objectives.

Use Automation for Menial Tasks

A core function of a Site Reliability Engineer is to automate some simple tasks. Automating menial tasks that would otherwise require decent work power will considerably reduce the time spent on operations. When the automation is done, the Site Reliability Engineers more time to work on new features or develop the existing ones.

Implement and Adopt to Gradual Changes

Site Reliability Engineers urge the developers to move quickly by ensuring that the cost of failure is low. As a result, changes can be implemented by SREs more rapidly than the traditional ways.

Quantify Parameters

Site Reliability Engineers try to quantify as much information as possible to minimize the losses and maximize the gains.

Conclusion

Site Reliability Engineers are a core part of modern development and operations teams, especially when the cloud is involved. Combining a good cloud environment with a decent software development plan and a good SRE can bring in the results you are looking for.

Caroline Rademaker

Speaker | DEI Advocate | Women in Tech | Psychological safety | Driving Innovation and Inclusion: Head of DEI & Head of Benelux at Templeton & Partners

4 年
回复
Alex Omosa

Site Reliability Engineer | Security & Infrastructure

4 年

Great information, thanks for sharing

回复
Krzysztof Biernat

Site Reliability Engineer / AWS Cloud

4 年

That series could shed some light on the subtle differences, with regards to focus for example, as the line between devops and SRE is quite thin.

回复
Akhil Baburaj

Principal Site Reliability Engineer at Oracle

4 年

Good read. Waiting for the next series of articles.

回复

要查看或添加评论,请登录

Marcel Koert的更多文章

  • Artificial General Intelligence and Existential Risk

    Artificial General Intelligence and Existential Risk

    Progress or Pandora’s Box? The idea of Artificial General Intelligence (AGI) has long danced on the edge of science…

  • Privacy and AI Surveillance

    Privacy and AI Surveillance

    Balancing Security and Personal Freedoms Imagine walking through a city where every movement is tracked—every purchase,…

  • AI + Interdisciplinary Science

    AI + Interdisciplinary Science

    Why This Should Be Every Scientist’s Dream ?? Ever feel like your research would go further if you just had more…

    1 条评论
  • Deepfakes and AI-Generated Misinformation

    Deepfakes and AI-Generated Misinformation

    A Double-Edged Sword Imagine stumbling across a video of a world leader declaring war, only to find out later it was…

  • AI Ethics and Bias

    AI Ethics and Bias

    Building a Fairer Future with AI AI is transforming industries at an unprecedented pace, making decisions that affect…

    1 条评论
  • AI and Job Displacement

    AI and Job Displacement

    A New Era of Opportunity If history has taught us anything, it’s that technology changes the way we work—sometimes in…

    2 条评论
  • AI-Driven Decision Making

    AI-Driven Decision Making

    Transforming Critical Industries for the Better Imagine a world where AI helps doctors diagnose diseases earlier than…

  • Paying for views/advertisement for your youtube channel is that bad.

    Paying for views/advertisement for your youtube channel is that bad.

    The Debate Over Paid Views and Advertising on YouTube: A Balanced Perspective YouTube is an ever-expanding universe of…

  • Emphasizing Developer Experience in DevOps

    Emphasizing Developer Experience in DevOps

    In the realm of DevOps, the focus has traditionally been on streamlining processes, automating workflows, and enhancing…

  • Rise of Internal Developer Platforms

    Rise of Internal Developer Platforms

    The Rise of Internal Developer Platforms: A Comprehensive Guide for DevOps Engineers In the dynamic realm of software…

社区洞察

其他会员也浏览了