登录查看更多内容

SRE Principles

Ravi Naarla

Chief Technologist - Optimizing Value Streams through AI

发布日期: 2023年9月25日

+ 关注

Automation

Everything should be completely automated.
If an existing process cannot be automated, it will be replaced.
If a proposed process cannot be automated, it will be rejected.
The SRE’s job is to automate themselves out of a job. In practice this means constantly automating menial tasks and moving on to solve more interesting problems.

Ephemerality

Servers are ephemeral. They can and will go away at any time.
Servers live in auto-scaling groups that self-heal.
Servers have health checks that assert the health of their process(es).
Servers boot from images that are fully equipped and operational.
Configuration management is not run against existing servers. It is used only to create images.
Application servers are stateless.
Engineers are ephemeral. They can and will go away at any time.
Engineering workloads are shared. There are no individual silos.
Engineering practices are documented. Documentation is up to date.
All engineers have access to all codebases including Architects.

Continuous Integration

All code changes are made via pull requests, verified, and approved.
All code is functionally tested, unit tested, and linted.
Linters are extremely opinionated. Engineers should feel empowered to propose changes to the rules in isolated discussions and pull requests.
Unit tests and linters run on every pull request, preventing merges when the build fails.
Functional tests run on every deploy, preventing (or rolling back) deploys when the build fails.

Continuous Deployment

Deploys are easy, fast, safe, and frequent.
Changes are deployed on every merge.
Deploys do not require any human interaction or approval.
Deploy time matters and engineers should strive to make it faster.
Deploys can be started manually with a single button. As many engineers as possible should have access to the button.
Rollbacks happen automatically when a failed deploy is automatically detected.
Rollbacks are held to all the same standards as deploys.
The master branch is the only branch that gets deployed. All git branching is for the benefit of the engineer prior to merging the changes into master.
It is easy to tell which commit is deployed.
There is no such thing as a code freeze.
Features are released by feature flags. Flipping a flag does not require a deploy. A “flip freeze” is acceptable.

Software Engineering

SRE’s operate as software engineers, not system administrators.
Everything is managed in code. Any change to a system is a code change.
Code is written to be read by other engineers. It is self-documenting.
All processes are automated with software.
CI/CD principles apply to all SRE code.
The entire engineering team has access to all SRE code.

Microservices

Services are small, well defined, and isolated.
Services are reasonably small and single purpose. If a service cannot be summarized succinctly, it is too big.
Services run in isolation. Excessive resource usage in one service does not affect other services.
They are independently deployable to any environment.
A service going down affects other services minimally or not at all.
They do not share data stores.
Their infrastructure is homogeneous.
All services are deployed the same way, from the same interface.
Services communicate with each other through APIs or well-defined pub/sub mechanisms.
Implementing and deploying a new service is trivial.
Service discovery is highly available and held to microservice standards.

Monitoring

All systems are monitored for critical metrics.
Metrics are easily available and consumable in a single interface.
Critical metrics are displayed on dashboards for each system.
The system that does the monitoring is monitored by a separate system.

Alerting

When self-healing fails, engineers are intelligently notified.
Alerts summarize the problem succinctly and include suggested actions.
Engineers are only paged off-hours for production. Other environments may alert engineers during business hours.
After resolving the alert as quickly as possible, the next step (during business hours) is to ensure the same alert never fires again.
Excessive alerting is unacceptable. It is addressed immediately.

Incident Response

On-call engineers (both SRE’s and SE’s) feel empowered to respond in a timely manner.
SE’s are on-call for the systems they create and own.
SRE’s are on-call for low level systems and to assist developers.
All escalation policies have backups or fallbacks.
All escalation policies have rotations. No engineer is on-call for a system full time.
Escalating is acceptable if needed. Escalation generates a follow-up task to understand why the on-call engineer could not solve the problem.

Postmortems

All user-facing incidents require a postmortem.
Postmortems are blameless.
The process for a postmortem is easy to conduct and has very little overhead. A few sentences is sometimes sufficient. A meeting is not always required.
Postmortems are conducted reasonably soon after the incident is resolved.
A repository of postmortems is easily accessible.

领英推荐

Exploring the Future of AI-Powered DevSecOps with…

Evan Kirstel 5 个月前

Why Automated Testing is the Future of SRE Best…

Yoseph Reuveni 4 个月前

Platform Engineering : Understanding its Relevance and…

TeamLease Digital 1 年前

Security

Security is automated and baked into everything.
Security checks run as part of CI/CD.
Intrusion detection systems are in place.
Identity and access management is used to gate all actions.
As few infrastructure components as possible are publicly accessible, ideally zero.
Client applications only use public APIs.
Engineers are trusted but verified.
Credentials are not stored in plain text, especially not in code.
Credentials can be easily rotated.
Access is revoked in a single place, which propagates to all systems.

Offload security to managed services.

Servers receive requests through managed load balancers.
All data stores receive requests from inside the network only.
Static content is delivered through a CDN. Buckets are private.

Finance

SRE’s are financially conscious in all aspects of their work.
Costs measurements include engineering time and effort.
Tooling is used to monitor all engineering costs an SRE can affect.

Cloud Architecture

An externally managed cloud is the default place to run services. Running services by any other means requires justification.
Multi-region is appropriate when downtime vs cost is properly measured.
Multi-cloud (for redundancy) is almost never worth the effort and loss of features.

On-premise solutions are appropriate when:

A modern cloud front-end is in place (OpenStack, etc).
IT, capacity planning, and system administration are all top-notch.
The increased overhead is drastically cost-effective when engineering time is considered, and is projected to remain this way for the foreseeable future
SRE’s are not expected to physically interact with the data center.

Containerized orchestration is appropriate when:

Services are shown to successfully run in containers.
Services are in a healthy state and sufficiently modularized.
The increased overhead is deemed acceptable.
The company is willing to invest heavily in tooling.

Serverless solutions are appropriate when:

1.?? Tooling and automation are used to managed serverless functions.

2.?? Service owners are willing to accept the limitations of serverless.

Supporting Services

The default option for supporting services (logging, monitoring, alerting, etc) is externally managed and hosted. Running these services internally requires justification.
SRE’s are constantly evaluating supporting service options, new and old. The ability to consolidate is a factor.
Supporting services are secure, cost effective, and useful to engineers.

People

SRE’s and SE’s are on the same team. They are all engineers.
SRE’s are not blockers and allow access to as many systems as possible.
SE’s own their services and do not “throw code over the wall.”
SRE’s are willing and able to contribute to and debug application code.
SRE’s use and contribute to open source, if possible.
SE’s and SRE’s work together to plan new services and architectures.
SRE’s strive to make the lives of all engineers better through automation.

要查看或添加评论，请登录

Ravi Naarla的更多文章

The Quiet Revolution of "Vibe Coding"

2025年3月21日

The Quiet Revolution of "Vibe Coding"

Something subtle yet profound is unfolding in the realm of software engineering, quietly altering the contours of a…
AI-Powered Macroblocking Detection & Enhancement for Live Streaming

2025年3月20日

AI-Powered Macroblocking Detection & Enhancement for Live Streaming

In the age of ubiquitous streaming, nothing is more frustrating than a pixelated screen at the peak of an intense…
NVIDIA Dynamo: The AI Engine Powering the Next Wave of Intelligence

2025年3月19日

NVIDIA Dynamo: The AI Engine Powering the Next Wave of Intelligence

The future of AI isn’t just about building bigger models; it’s about serving them fast, cheap, and at scale. Enter…

2 条评论
LLMs That Reason: Transforming Communications, Media, and Tech

2025年3月18日

LLMs That Reason: Transforming Communications, Media, and Tech

In a quiet corner of a vast communications hub, data pulses over fiber-optic strands, gathering to feed a new…
360° Defense Framework for LLMs

2025年2月13日

360° Defense Framework for LLMs

Interweaving Trust, Risk, and Security Management with NIST, ISO 27001, and SOC 2 Standards In the intricate…
Generative AI Value Creation in Technology Consulting: Ten Key Dimensions

2025年2月13日

Generative AI Value Creation in Technology Consulting: Ten Key Dimensions

In an era defined by rapid digital transformation and relentless innovation, generative AI (GenAI) has emerged as a…
Bridging Minds and Machines – The New Wave of LLM Research

2025年2月12日

Bridging Minds and Machines – The New Wave of LLM Research

In the fast-paced world of AI, a few days can unveil a trove of innovations. Over the past week, researchers have been…

1 条评论
Ambient AI: Shaping Smart Spaces

2025年2月9日

Ambient AI: Shaping Smart Spaces

In the tangled realm of circuits and code, where the distinction between our tangible world and the digital ether…
The Assembled Future: How Agentic AI is Redefining Telecom’s Architecture of Possibility

2025年2月6日

The Assembled Future: How Agentic AI is Redefining Telecom’s Architecture of Possibility

The future often arrives unassembled. The pieces are there—waiting, potential, raw material yearning for…
DeepSeek-R1: Building Better AI for Less

2025年1月30日

DeepSeek-R1: Building Better AI for Less

IThe AI world has been buzzing this past week, and for good reason. DeepSeek's R1 model didn't just make headlines – it…

1 条评论

See all articles

SRE Principles

Ravi Naarla

Chief Technologist - Optimizing Value Streams through AI

领英推荐

Ravi Naarla的更多文章

社区洞察

其他会员也浏览了

Platform Engineering - The Backbone of Modern Software Systems

A Deep Dive into the Role of SRE in Automated Testing Pipelines

Automate DevOps – Tasks DevOps Should Automate

Bridging the Gap: A Practical Guide to Platform Engineering for IT Decision Makers

Benefits of DevSecOps,its working,components,culture, best practices,common tools,challenges

Seamless Security: Integrating DevOps and DevSecOps for Robust Application Development

Comprehensive Guide to DevSecOps Challenges

Day 10: Security in DevOps - DevSecOps and Best Practices

Enabling Engineers to Detect and Resolve Issues 10x Faster: Our Investment in Checkly

Embracing DevSecOps: A Paradigm Shift in Secure Software Development

领英推荐

Ravi Naarla的更多文章

The Quiet Revolution of "Vibe Coding"

AI-Powered Macroblocking Detection & Enhancement for Live Streaming

NVIDIA Dynamo: The AI Engine Powering the Next Wave of Intelligence

LLMs That Reason: Transforming Communications, Media, and Tech

360° Defense Framework for LLMs

Generative AI Value Creation in Technology Consulting: Ten Key Dimensions

Bridging Minds and Machines – The New Wave of LLM Research

Ambient AI: Shaping Smart Spaces

The Assembled Future: How Agentic AI is Redefining Telecom’s Architecture of Possibility

DeepSeek-R1: Building Better AI for Less

社区洞察

其他会员也浏览了

Platform Engineering - The Backbone of Modern Software Systems

A Deep Dive into the Role of SRE in Automated Testing Pipelines

Automate DevOps – Tasks DevOps Should Automate

Bridging the Gap: A Practical Guide to Platform Engineering for IT Decision Makers

Benefits of DevSecOps,its working,components,culture, best practices,common tools,challenges

Seamless Security: Integrating DevOps and DevSecOps for Robust Application Development

Comprehensive Guide to DevSecOps Challenges

Day 10: Security in DevOps - DevSecOps and Best Practices

Enabling Engineers to Detect and Resolve Issues 10x Faster: Our Investment in Checkly

Embracing DevSecOps: A Paradigm Shift in Secure Software Development