- Everything should be completely automated.
- If an existing process cannot be automated, it will be replaced.
- If a proposed process cannot be automated, it will be rejected.
- The SRE’s job is to automate themselves out of a job. In practice this means constantly automating menial tasks and moving on to solve more interesting problems.
- Servers are ephemeral. They can and will go away at any time.
- Servers live in auto-scaling groups that self-heal.
- Servers have health checks that assert the health of their process(es).
- Servers boot from images that are fully equipped and operational.
- Configuration management is not run against existing servers. It is used only to create images.
- Application servers are stateless.
- Engineers are ephemeral. They can and will go away at any time.
- Engineering workloads are shared. There are no individual silos.
- Engineering practices are documented. Documentation is up to date.
- All engineers have access to all codebases including Architects.
- All code changes are made via pull requests, verified, and approved.
- All code is functionally tested, unit tested, and linted.
- Linters are extremely opinionated. Engineers should feel empowered to propose changes to the rules in isolated discussions and pull requests.
- Unit tests and linters run on every pull request, preventing merges when the build fails.
- Functional tests run on every deploy, preventing (or rolling back) deploys when the build fails.
- Deploys are easy, fast, safe, and frequent.
- Changes are deployed on every merge.
- Deploys do not require any human interaction or approval.
- Deploy time matters and engineers should strive to make it faster.
- Deploys can be started manually with a single button. As many engineers as possible should have access to the button.
- Rollbacks happen automatically when a failed deploy is automatically detected.
- Rollbacks are held to all the same standards as deploys.
- The master branch is the only branch that gets deployed. All git branching is for the benefit of the engineer prior to merging the changes into master.
- It is easy to tell which commit is deployed.
- There is no such thing as a code freeze.
- Features are released by feature flags. Flipping a flag does not require a deploy. A “flip freeze” is acceptable.
- SRE’s operate as software engineers, not system administrators.
- Everything is managed in code. Any change to a system is a code change.
- Code is written to be read by other engineers. It is self-documenting.
- All processes are automated with software.
- CI/CD principles apply to all SRE code.
- The entire engineering team has access to all SRE code.
- Services are small, well defined, and isolated.
- Services are reasonably small and single purpose. If a service cannot be summarized succinctly, it is too big.
- Services run in isolation. Excessive resource usage in one service does not affect other services.
- They are independently deployable to any environment.
- A service going down affects other services minimally or not at all.
- They do not share data stores.
- Their infrastructure is homogeneous.
- All services are deployed the same way, from the same interface.
- Services communicate with each other through APIs or well-defined pub/sub mechanisms.
- Implementing and deploying a new service is trivial.
- Service discovery is highly available and held to microservice standards.
- All systems are monitored for critical metrics.
- Metrics are easily available and consumable in a single interface.
- Critical metrics are displayed on dashboards for each system.
- The system that does the monitoring is monitored by a separate system.
- When self-healing fails, engineers are intelligently notified.
- Alerts summarize the problem succinctly and include suggested actions.
- Engineers are only paged off-hours for production. Other environments may alert engineers during business hours.
- After resolving the alert as quickly as possible, the next step (during business hours) is to ensure the same alert never fires again.
- Excessive alerting is unacceptable. It is addressed immediately.
- On-call engineers (both SRE’s and SE’s) feel empowered to respond in a timely manner.
- SE’s are on-call for the systems they create and own.
- SRE’s are on-call for low level systems and to assist developers.
- All escalation policies have backups or fallbacks.
- All escalation policies have rotations. No engineer is on-call for a system full time.
- Escalating is acceptable if needed. Escalation generates a follow-up task to understand why the on-call engineer could not solve the problem.
- All user-facing incidents require a postmortem.
- Postmortems are blameless.
- The process for a postmortem is easy to conduct and has very little overhead. A few sentences is sometimes sufficient. A meeting is not always required.
- Postmortems are conducted reasonably soon after the incident is resolved.
- A repository of postmortems is easily accessible.
- Security is automated and baked into everything.
- Security checks run as part of CI/CD.
- Intrusion detection systems are in place.
- Identity and access management is used to gate all actions.
- As few infrastructure components as possible are publicly accessible, ideally zero.
- Client applications only use public APIs.
- Engineers are trusted but verified.
- Credentials are not stored in plain text, especially not in code.
- Credentials can be easily rotated.
- Access is revoked in a single place, which propagates to all systems.
Offload security to managed services.
- Servers receive requests through managed load balancers.
- All data stores receive requests from inside the network only.
- Static content is delivered through a CDN. Buckets are private.
- SRE’s are financially conscious in all aspects of their work.
- Costs measurements include engineering time and effort.
- Tooling is used to monitor all engineering costs an SRE can affect.
- An externally managed cloud is the default place to run services. Running services by any other means requires justification.
- Multi-region is appropriate when downtime vs cost is properly measured.
- Multi-cloud (for redundancy) is almost never worth the effort and loss of features.
On-premise solutions are appropriate when:
- A modern cloud front-end is in place (OpenStack, etc).
- IT, capacity planning, and system administration are all top-notch.
- The increased overhead is drastically cost-effective when engineering time is considered, and is projected to remain this way for the foreseeable future
- SRE’s are not expected to physically interact with the data center.
Containerized orchestration is appropriate when:
- Services are shown to successfully run in containers.
- Services are in a healthy state and sufficiently modularized.
- The increased overhead is deemed acceptable.
- The company is willing to invest heavily in tooling.
Serverless solutions are appropriate when:
1.?? Tooling and automation are used to managed serverless functions.
2.?? Service owners are willing to accept the limitations of serverless.
- The default option for supporting services (logging, monitoring, alerting, etc) is externally managed and hosted. Running these services internally requires justification.
- SRE’s are constantly evaluating supporting service options, new and old. The ability to consolidate is a factor.
- Supporting services are secure, cost effective, and useful to engineers.
- SRE’s and SE’s are on the same team. They are all engineers.
- SRE’s are not blockers and allow access to as many systems as possible.
- SE’s own their services and do not “throw code over the wall.”
- SRE’s are willing and able to contribute to and debug application code.
- SRE’s use and contribute to open source, if possible.
- SE’s and SRE’s work together to plan new services and architectures.
- SRE’s strive to make the lives of all engineers better through automation.