The CTO Playbook
Accelerate pt. II - Institute a Metrics Driven Culture
This is part two of Accelerate, in which I'm covering what I regard to be one of the most important areas in which you as CTO should be focussing your effort.
These particular metrics drive the correct behaviours throughout your organisation, and those behaviours are pre-requisite to building a high-performing technology capability.
In the last article (https://www.dhirubhai.net/pulse/cto-playbook-rob-hill-qcnpc) I covered the first two of the four metrics (Cycle Time and Deployment Frequency), and below I'll discuss the remaining two: Change Failure Rate (CFR) and Mean time to Restore (MTTR).
Then I'll talk about how you can go about establishing these in your teams.
(Note - Google have adopted these metrics as part of their DORA framework, and added a fifth - Reliability - which I won't cover here, although it's absolutely something that should be implemented.)
The first two metrics are largely about establishing and monitoring 'flow' i.e. flow from ticket creation through to production.
The next two focus primarily on quality, but also have an impact on capability.
CFR
CFR is an ongoing measure of how many of your production changes (releases, deployments, modifications) had a negative impact. In general these would require a rollback, roll-forward or other mitigating action in order to correct the fault. These almost always relate to a quality gap somewhere upstream, potentially as far back as the original requirements.
As an example, let's say a change was rolled out to production, and from a technical perspective it was 'flawless' however the change was such that one of your key customers can no longer see a particular piece of data. You subsequently discover that they were relying on this data to conduct their business. The Product Owner wasn't aware that the customer was using this, so naturally, removing it wasn't expected to cause an issue.
Who's fault is this? Well, it's yours, ultimately, as CTO. This is your system to fix, right back to the requirements stage. In this instance ongoing demos, a closer relationship to the customer and other methods of determining impact of changes like this are clearly required. It's probably never going to be possible to drive CFR to zero for issues like this, but it is possible to drive it right down so that these types of issues are vanishingly rare.
Let's talk about other areas that monitoring CFR starts to have an impact. We talked about automation of pipelines in order to reduce Cycle Time; automated deployments are faster and they free up staff time to work on more important things (like automating everything else!!).
Importantly, from a CFR perspective, this automation reduces (or even eliminates) administration errors relating to automations. These automations should also be extended to notifications - they should push information out to the business when a deployment occurs, with links to detailed release notes etc. Do you have staging or pre-production deployments? Same deal. This is where stakeholders can review and test upcoming changes so that you can catch issues early, and further reduce your CFR.
Naturally if you're monitoring CFR, the number of issues that make it through the QA stage comes under review; do you have the right metrics in place from a QA perspective that show where there may be issues? Can we automate tests for all of this? These are the types of questions and conversations that you and your team will start to have once you start monitoring the more high level 'golden signals' such as CFR.
If your CFR is trending downwards, it means your teams are having the right conversations, and implementing the right behaviours - changing processes, tooling, coding practices in a way that's having a positive impact.
It's a good idea to think of these Accelerate metrics as indicators on your car dashboard - for example the engine temperature gauge. If the gauge remains steady, you're probably good. If you're watching the gauge creep up into the red, it doesn't tell you exactly what's wrong (do you have a fan failure? A broken radiator? Low coolant levels?), but it does tell you that something's wrong, and allows you to start to diagnose the issue.
MTTR
MTTR is a measure of how quickly you resolve or mitigate the issues that are counted in the CFR metric.
领英推荐
For some teams this might involve physically logging onto a server, and carrying out whatever steps are required to remove the current version of the software, and then install the previous version, and restart the service. There may be configuration and/or database changes required. Hopefully your database changes weren't so extensive that additional remedial work is required on the actual database schema or the data in order to restore service.
Naturally, all of this manual work is time consuming and has a significant impact (in the negative) on the team's MTTR. It points to the fact that these deployments - and the subsequent roll-back of these deployments (when required) - must be automated.
When this is automated, teams start to learn how to push changes to production in a manner that ensures that the vast majority of releases are consistently-sized, rote changes that ideally have no negative impact. This needs to be like clockwork.
Subsequently, for non-standard releases (e.g. large or architecturally-significant changes) they understand the additional caution and processes that will be required when making changes that can't be easily rolled-back (ideally these are rare), as their new benchmark for a 'normal release' has changed so dramatically.
From a product perspective, these conversations should lead to smaller stories being created, evening out the overall 'flow' of work through the system (consistency here is reliability and predictability - and we all want that).
Feature-switching (the ability to 'hide' a feature or functionality in production, and then enable or disable it easily using a software 'switch') is a tremendous boon here - and one of the first things that teams start to talk about once they've automated the actual deployment processes.
Your teams will find that being able to feature-switch makes work easier to test and has positive impacts on both CFR and MTTR, not to mention Deployment Frequency and Cycle time, as you're able to consistently produce smaller batches of work.
Incident Management
I've skipped something that negatively contributes to MTTR, which is the amount of time from when an issue is detected in production - either by your organisation or by your customer - to when the remediation work starts the process of resolving the fault, and ultimately resolving the issue.
Ideally, if an issue does make it to production, you have appropriate monitoring and alerting in place to the extent that you're aware of issues before your customer is.
This - effective incident management - is a key factor and often doesn't get appropriate focus when discussion MTTR. MTTR isn't just about how quickly you can roll back a change once you're aware of it.
At its core, an effective incident management process is whatever gets the right people communicating and working together to resolve the incident, as quickly as possible. This requires the right tools (for communication and working on the problem), the right data (tracing, logging, performance metrics etc) and appropriate access to systems.
But that's not enough - your team then needs frequent PRACTICE tackling incidents in order to be effective.
There's a lot more to incident management as a discipline than can be covered here - perhaps that will be another post.
You'll note that we're not talking much about the product that you're actually building, or any details relating to what you're actually coding as a software development team - this is because the focus should be on 'building the machine that builds the machine' as Elon Musk says. Almost all of your DevOps/SRE/Infrastructure team's efforts should be dedicated to getting the basic deployment pipeline automation in place, and then corresponding effort from Product, QA and Development teams as required.
I think I'll have to leave the discussion on how to establish these metrics in teams to my next post.... (another cliffhanger!!! I know!! I bet you can't wait!).
Well now you don't have to - here it is:
Account Manager- at A23
1 个月Powerful stuff Rob
Talent Recruiter | 100K+ followers | Top Voice | Speaker | Investor
1 个月Building a metrics-driven culture feels like setting the stage for success. Got any specific examples to share?