课程: DevOps Foundations: Site Reliability Engineering

Reliability engineering basics

- There's that 2:00 AM moment when you realize you've made some bad life decisions - You roll over and you hear this ringing in your ears. - After questioning how it got to this point in the relationship, with dread, you look at your phone. - Oh no, not again. Two nodes of the cluster are down, and the others look to be failing as well. - You're not surprised though. This is the fourth night in a row this has been going on. - Now, if this story sounds too real, or you want to make sure your life doesn't end up like this, then this is the course for you. - Howdy, I'm Ernest Mueller. - Hi, and I'm James Wickett. Welcome to our course on another DevOps foundation, site reliability engineering. - We met while implementing DevOps in a large enterprise. Together, we've run the DevOpsDays Austin Conference and blog at theagileadmin.com. - I'm the Head of Research at Signal Sciences, which provides application security defense solutions for APIs, microservices, web APIs. At Signal Sciences, we implemented DevOps and SRE practices from very beginning. - And I'm Director of Engineering Operations at AlienVault, a maker of cybersecurity management and threat intelligence solutions, where I optimize our infrastructure and software delivery pipeline. - Site reliability engineering, or SRE, is central to delivering software. - Since the term SRE was coined by Google, It's grown in popularity. While SRE and DevOps aren't exactly the same, they fit together as complimentary approaches. - In this course, you'll learn the basics of reliability engineering, including self-service automation and dealing with releases. - And handling crisis situations through incident response. - We also cover how form post-incident evaluations. - The SREs core tenant is reliability, and we dissect how to define SLAs and SLOs, as well as how to handle performance engineering and troubleshooting. - We discuss adding adversity and chaos to your system, as well as how to design for distributed systems. - And finally, we'll explore concepts on scaling systems and your team. - All right, let's get started.

内容