登录查看更多内容

SRE Concepts series Part 1

Marcel Koert

Innovative Platform Engineer | DevOps Engineer | Site Reliability Engineer | IT Educator | Founder of Melomar-IT

发布日期: 2021年3月4日

I have been asked many times about certain concepts of SRE. So I will do a series about 15 topics that feature in the Google SRE book. These concepts are not just for SRE. These are concepts that you can use in any IT environment. The first in the series ( This one ), I will discuss a little what SRE is. After that, in no particular order I will discuss the following topics/concepts.

Risk
Time-based Versus Aggregated Availability
Error budget
Service Level indicator
Service Level Objectives
toil
Black box monitoring
White box monitoring
The value of automation
Continuous build and deployment
Stability versus Agility
Root Cause analysis
Capacity planning
Break your systems
Testing in production

What is Site Reliability?

Site Reliability Engineering or for short, SRE, is essentially a discipline where we apply software engineering aspects to infrastructure and other operations problems. Site Reliability Engineering aims at delivering more reliable and scalable applications/environments.

Even though first developed before the DevOps movement, Site Reliability Engineering is still one of the hot fields to work in the Information Technology sector. Site Reliability Engineers are also present in the modern cloud infrastructures and are a vital part of the process. Depending on the infrastructure, Site Reliability Engineering may do quite a few things.

A Site Reliability Engineer spends half of their time doing work related to operations. For instance, the engineer may work on issues, manual intervention, deployments, and much more ad-hoc work. The engineer spends the other 50% of their time on development.

Site Reliability Engineers generally oversee projects that are easy to automate. This leaves them with more time to develop new features and improve the application along with maintaining it. Since a Site Reliability Engineer needs to know both development and operations, it is usually hard to find a skilled Site Reliability Engineer.

SRE and DevOps usually work on the same base principle, one engineer managing both development and management. Site Reliability Engineering is often known as a specific implementation of DevOps.

Site Reliability Engineer Responsibilities

Share responsibility

Many organizations are adopting a shared responsibility model to speed up the development and ensure security in applications. Using a shared responsibility model will also remove the single point of delay.

Site Reliability Engineers use the same tools and software programs as developers. They share the responsibility of developing a product with genuine developers, which will be a significant part of the shared responsibility model.

Accept Failure and Prepare for it

Unlike traditional developers or operational engineers, Site Reliability Engineers understand that failure is common and consider the failure scenarios. They measure the downtime of the product using an error budget and will take the necessary measures. Site Reliability Engineers must embrace risk for the whole project to work without any issues.

Site Reliability Engineers quantify failure and availability in terms of SLIs or Service Level Indicators or SLOs Service Level Objectives.

Use Automation for Menial Tasks

A core function of a Site Reliability Engineer is to automate some simple tasks. Automating menial tasks that would otherwise require decent work power will considerably reduce the time spent on operations. When the automation is done, the Site Reliability Engineers more time to work on new features or develop the existing ones.

Implement and Adopt to Gradual Changes

Site Reliability Engineers urge the developers to move quickly by ensuring that the cost of failure is low. As a result, changes can be implemented by SREs more rapidly than the traditional ways.

Quantify Parameters

Site Reliability Engineers try to quantify as much information as possible to minimize the losses and maximize the gains.

Conclusion

Site Reliability Engineers are a core part of modern development and operations teams, especially when the cloud is involved. Combining a good cloud environment with a decent software development plan and a good SRE can bring in the results you are looking for.

Caroline Rademaker

Speaker | DEI Advocate | Women in Tech | Psychological safety | Driving Innovation and Inclusion: Head of DEI & Head of Benelux at Templeton & Partners

4 年

Emile V.

Alex Omosa

Site Reliability Engineer | Security & Infrastructure

4 年

Great information, thanks for sharing

Krzysztof Biernat

Site Reliability Engineer / AWS Cloud

4 年

That series could shed some light on the subtle differences, with regards to focus for example, as the line between devops and SRE is quite thin.

Akhil Baburaj

Principal Site Reliability Engineer at Oracle

4 年

Good read. Waiting for the next series of articles.

查看更多评论

要查看或添加评论，请登录

Marcel Koert的更多文章

Artificial General Intelligence and Existential Risk

2025年3月26日

Artificial General Intelligence and Existential Risk

Progress or Pandora’s Box? The idea of Artificial General Intelligence (AGI) has long danced on the edge of science…
Privacy and AI Surveillance

2025年3月24日

Privacy and AI Surveillance

Balancing Security and Personal Freedoms Imagine walking through a city where every movement is tracked—every purchase,…
AI + Interdisciplinary Science

2025年3月22日

AI + Interdisciplinary Science

Why This Should Be Every Scientist’s Dream ?? Ever feel like your research would go further if you just had more…

1 条评论
Deepfakes and AI-Generated Misinformation

2025年3月21日

Deepfakes and AI-Generated Misinformation

A Double-Edged Sword Imagine stumbling across a video of a world leader declaring war, only to find out later it was…
AI Ethics and Bias

2025年3月19日

AI Ethics and Bias

Building a Fairer Future with AI AI is transforming industries at an unprecedented pace, making decisions that affect…

1 条评论
AI and Job Displacement

2025年3月17日

AI and Job Displacement

A New Era of Opportunity If history has taught us anything, it’s that technology changes the way we work—sometimes in…

2 条评论
AI-Driven Decision Making

2025年3月16日

AI-Driven Decision Making

Transforming Critical Industries for the Better Imagine a world where AI helps doctors diagnose diseases earlier than…
Paying for views/advertisement for your youtube channel is that bad.

2025年2月12日

Paying for views/advertisement for your youtube channel is that bad.

The Debate Over Paid Views and Advertising on YouTube: A Balanced Perspective YouTube is an ever-expanding universe of…
Emphasizing Developer Experience in DevOps

2025年1月30日

Emphasizing Developer Experience in DevOps

In the realm of DevOps, the focus has traditionally been on streamlining processes, automating workflows, and enhancing…
Rise of Internal Developer Platforms

2025年1月29日

Rise of Internal Developer Platforms

The Rise of Internal Developer Platforms: A Comprehensive Guide for DevOps Engineers In the dynamic realm of software…

See all articles

SRE Concepts series Part 1

Marcel Koert

Innovative Platform Engineer | DevOps Engineer | Site Reliability Engineer | IT Educator | Founder of Melomar-IT

What is Site Reliability?

Site Reliability Engineer Responsibilities

Share responsibility

Accept Failure and Prepare for it

Use Automation for Menial Tasks

Implement and Adopt to Gradual Changes

Quantify Parameters

Conclusion

Marcel Koert的更多文章

社区洞察

其他会员也浏览了

The Power of Site Reliability Engineering: Transforming the Future of Software Reliability

Unlock the Power of AI in Site Reliability Engineering: The Ultimate Guide to SRE Benefits

SITE RELIABILITY ENGINEERING (SRE)

DevOps to SRE - Shifting focus to "Reliability"

Site Reliability Engineering – The New Ruler of the Software Management

Site Reliability Engineering – Roles & Responsibilities

Trending Topics in Site Reliability Engineering (SRE) - 2024

SLI, SLO, and SLA: The Cornerstones of SRE

Site Reliability Engineering (SRE): For Efficient IT Operations

SITE RELIABILITY ENGINEERING (SRE)

What is Site Reliability?

Site Reliability Engineer Responsibilities

Share responsibility

Accept Failure and Prepare for it

Use Automation for Menial Tasks

Implement and Adopt to Gradual Changes

Quantify Parameters

Conclusion

Marcel Koert的更多文章

Artificial General Intelligence and Existential Risk

Privacy and AI Surveillance

AI + Interdisciplinary Science

Deepfakes and AI-Generated Misinformation

AI Ethics and Bias

AI and Job Displacement

AI-Driven Decision Making

Paying for views/advertisement for your youtube channel is that bad.

Emphasizing Developer Experience in DevOps

Rise of Internal Developer Platforms

社区洞察

其他会员也浏览了

The Power of Site Reliability Engineering: Transforming the Future of Software Reliability

Unlock the Power of AI in Site Reliability Engineering: The Ultimate Guide to SRE Benefits

SITE RELIABILITY ENGINEERING (SRE)

DevOps to SRE - Shifting focus to "Reliability"

Site Reliability Engineering – The New Ruler of the Software Management

Site Reliability Engineering – Roles & Responsibilities

Trending Topics in Site Reliability Engineering (SRE) - 2024

SLI, SLO, and SLA: The Cornerstones of SRE

Site Reliability Engineering (SRE): For Efficient IT Operations

SITE RELIABILITY ENGINEERING (SRE)