DevOps for the rest of us - Almanac

DevOps for the rest of us - Almanac

Purpose

This is a reference to the most commonly used metrics for a DevOps transformation. It will be most useful to those who have decided they are moving to DevOps but don’t know how to measure progress. 


Engineering Team Lead-time 

Why it matters

From Concept to Cash” is the subtitle of one of Mary Poppendieck's classic books on lean software development. It succinctly captures why minimizing the time to go from an idea to customers using it (lead-time) is essential. Features in various design and development stages have cost money but aren’t yet generating income. The larger the investment needed before discovering if a product or feature generates value, the fewer opportunities an organization has to “get it right” before running out of money. A further issue with long lead times is that the value of a concept is often diminishing while everyone waits. Boardroom favorites such as “first to market”, “differentiator”, and “responsive to customers” all boil down to minimizing lead time.

What to measure

Finding an absolute measure can be surprisingly complex. There are many stages a concept goes through, from it first occurring in someone's mind to a customer taking advantage of it. Including all stages in the measure encourages a holistic view but it is best to, initially, focus on the steps that the team has direct control over. A good measure needs to provide relatively rapid feedback and be consistent enough that only a limited number of samples are required to establish that a change is real and not just noise. Including a lot of stages will slow down the feedback and add to the noise, requiring more samples and so increasing the time to get feedback even further.

The standard used in DORA’s State of DevOps Report is to only focus on the time from when a commit takes place to it being deployed. They then break this into four buckets: less than a day, less than a week, less than a month, and more than a month. 

In the spirit of DevOps, I prefer to start the clock when features are accepted for a sprint or Kanban board. This should align with the domain that a team has control over. If Operations is not embedded then monitoring only the deployment pipeline is more appropriate. If a team has product management embedded, it may wish to expand the scope to start when the product manager commits to a feature.

Faster delivery can change the economics of experimentation and enable whoever is responsible for the life of a feature or product before it reaches active development to shorten cycle times. If they can try twice as many approaches, they may be able to save on some of the upfront analysis or increase their odds of success.

Dangers

It is possible to minimize the lead time in a few undesirable ways (in order of likelihood)

  • Generating tech-debt
  • Delivering unfinished features, this includes buggy features
  • Avoiding significant features and only accepting minor changes 

Tech debt is the hardest to measure as it is very closely related to measuring productivity; how to do this is still a very open question (See Martin Fowler’s Cannot Measure Productivity post for a quick rant on this). Despite this, there are proxies, and some estimation is possible. The key metric for detecting the build-up of technical debt is that it becomes slower to deliver features with time. Obviously, as tech debt slows delivery, any reduction in lead time is temporary. There may be points where technical debt makes sense, but my experience has been that it is very hard to predict these. If the team has difficulty predicting for sprints I’d advise getting good at that and maintaining a high code standard before considering which corners to cut.

Unfinished and buggy features are best caught by general quality metrics of usability and bugs filed. In reality, the measure should be the time until customers get value, an unusable feature doesn’t produce value, so the clock should continue ticking until the feature is usable.

Avoiding significant features only improves lead time if the feature isn’t being developed in small deployable chunks. For this reason, it is the least worrying behavior. Product owners should be consulted regularly to make sure that the most valuable features are going through the pipeline and both they and developers will need support to develop skills in breaking down problems and managing dark launches. Dark launches make it possible to deliver small, testable chunks regularly but delay making a feature visible until it is meaningfully usable by a customer.

Balancing measures

  • Bug rates
  • Product owner satisfaction
  • The value produced


Deployment frequency

Why it matters

Continuous Delivery doesn’t necessarily require deploying continuously, so why measure deployment frequency? The risk per release is proportional to the time since the last release. While the business may not see a strong need to release every merge or every day, there are certainly reasons to strive to get the gap between releases as short as possible. The ability to release rapidly is important to ensure that hotfixes and patches go through the same level of rigor as any other release. Maintaining a high frequency of deployments is the best way to ensure that urgent patches and fixes go through the same process as all other changes.

It also has the advantage of being a very honest measure. When practicing Continuous Delivery, rather than Continuous Deployment it is easy to measure the frequency of deployable artifacts, but I would still recommend measuring actual releases. Getting value to the customers is the real outcome. 

Attempting to increase the frequency of releases is also a good way to discover pain points. Problems that were tolerable when they only occurred once a month are taken more seriously when they might happen every day. Typical issues that have to be addressed are:

  • Disruption caused by downtime (if zero-downtime releases haven’t been implemented)
  • Customer trust hasn’t yet been earned
  • Communication about releases and changes is taking significant time

If Continuous Delivery is producing the promised improvements, it is common to find that businesses that initially only planned to release weekly move to daily or even deploy every commit.

DORA’s State of DevOps Report breaks this into 4 buckets: on-demand, once a day to once a week, once a week to once per month, and less than once a month.

Note on “deployment” with non-SaaS software

For products where customers or another 3rd party carries out the deployment, measuring "deployed" as available for download is common.  It may even seem odd to attempt to get customers to update more often. However, if we focus on getting value into customers' hands and reducing risk, it should become apparent to the customers that updating more frequently is in their interest.  Like SaaS, it will expose parts of the process that are too manual or painful but they can be worked around. Companies such as Microsoft have been leading the charge in increasing the deployment rate in these environments., They have had to deal with the same problems as SaaS services, minimizing disruption to the user and building trust. The release rate of updates to software has risen dramatically in the past few years, as have the users’ expectations. Some have managed to transition to this higher rate successfully, while others have left their customers in fear every time a new patch comes out, huddling in forums to share old versions that work.

Externally deployed software has an added challenge regarding how many versions should be supported. Does a customer have to upgrade to today’s release before receiving support for their bug? A truly repeatable build process is essential if support is going to be offered to multiple versions. 

If 3rd parties deploy your software into cloud environments raising the deployment rate may involve re-considering how the product is delivered. Some products now carry out the provisioning and management of their environment rather than just releasing an executable with pages of setup instructions. This makes deployment and updates far simpler for the user and reduces bugs caused by differences between the environment in which the product was tested compared to where it is used.

What to measure?

Record when releases go out, including if they bypassed any standard processes.

Dangers

Raising the deployment rate should improve quality and reliability if it is crucial to ensure that these aren't being compromised.

Balancing measures

  • Bug rates
  • Downtime
  • Customer satisfaction


Change failure rate

Why it matters 

This gets at the heart of the transformation that DevOps and Continuous Delivery are attempting to deliver. Increasing frequency should not result in this going up. It is essential to ensure that you aren’t trading risk for speed. We could all release every merge if we just dropped testing and didn’t worry about bugs. 

What to measure

Many businesses only measure the number of deployments that cause outages or serious service impairments. I’d hope this is a rare occurrence, but I know that many companies with slower release cadences rush out hotfixes for most of their releases. Certainly, it is unlikely to be useful unless counted over the length of a quarter or even a year.

The DORA State of DevOps Report defines it as:

“For the primary application or service you work on, what percentage of changes to production or released to users result in degraded service (e.g., lead to service impairment or service outage) and subsequently require remediation (e.g., require a hotfix, rollback, fix forward, patch)?” 

The data they received from this question didn’t show a clear difference between Elite, High, and Medium performing companies. However, only low performers had much higher failure rates. There could be several reasons for this: 

  • What counts as a “service impairment” depends on a company's quality standards. As standards rise and rapid fixes become easier, the severity of an issue that falls into this category falls, but the one-dimensional number is unable to reflect this.
  • Introducing automated testing is often done early in building out CI/CD. This dramatically drops the bug rate but the increase in release frequency doesn’t make a significant difference.

There is another situation that I would consider a “failed release” that isn’t covered in the definition. How often is a release delayed or canceled because a feature isn’t read or a bug was discovered at the last minute? As releases become more common and tests are better automated, this should rapidly drop to near zero, but when starting, it can be an important indicator of issues.

Initially, measuring the number of releases that failed to occur, required a rollback, necessitated a hotfix, and caused downtime as separate statistics should cause minor confusion and best expose the form of the issue.

Alternatives

  • Bug rates
  • Downtime
  • Rejection rate at various points in the pipeline


Time to restore 

Why it matters

In all but a few cases, a down system doesn’t make money and even then it certainly doesn’t improve customer satisfaction.

Limitations

Unless you have a lot of products or are averaging over a long period, rare events will dominate this statistic.

The DORA state of DevOps report classifies these into the groups: less than an hour, less than a day, more than a day. 

Alternatives

  • Number of “game days”
  • Recovery time during practice across a range of scenarios

Shifting to measuring preparation makes it much easier to get more measurements, but it must be remembered that these are only proxies. There is a danger that they could cause complacency. Game days should always contain new scenarios. It is also essential that those carrying out the ‘recovery’ do not know what scenario they are facing to ensure that the crucial step of identifying the failure is covered. It is easy to waste time going down the wrong path in a real emergency.

In many areas, there is the concept of “near misses”. By measuring near misses it is possible to gather data about risk without waiting for actual disasters. I don’t advise creating a statistic from near misses, but it is valuable data to gather and analyze. Ideally, it should take multiple mistakes to bring down the production environment.


How many times does a feature visit QA before it is released?

Why it matters

Having work handed off from development to QA should be a step that eventually disappears in favor of integrated QA and team ownership of quality. At the start of the journey, while QA is still a separate group, measuring how often QA “rejects” work can be valuable. Features being passed back and forth are a good sign of waste and a  “throw it over the wall” attitude to development.

What to measure

The number of times a feature is rejected by QA. This can often be derived from the issue tracking system. For instance, JIRA has a plugin called “time in status” which can generate a report showing how many times the issue entered the QA status.

Limitations

When “Modern Testing” practices are implemented, the hard boundary between QA and Development becomes much less clear. There may no longer be a step in the process called QA, so it won’t be possible to measure this. At the same time, the waste this is a proxy for should also have diminished dramatically.

Dangers

Blame is the largest risk, if there are fights over who’s at fault, that is a strong sign that quality hasn’t yet shifted left. It also indicates that the teams feel the metric will be used to punish them. While neither situation is good, attempting the measurement has at least provided feedback as to what needs addressing.

Alternatives

  • Bug rates
  • Rejection rate at various points in the pipeline


Bug rates

Why it matters

Bugs range from minor annoyances to blocking customers from getting value from the product. The number of bugs found in the software is a straightforward measure of quality. Measuring purely quantity may miss important signals owing to the number of possible confounding factors, such as ease of filing bugs, the number of features the customers commonly use, and the severity of the bugs.

It isn’t always necessary to force users to report bugs. The software can include automated warnings and send debug data back. This is often neglected as a measure but again, it can provide useful information. It is important before using this information that teams and developers feel safe. This is covered more in “Dangers”.

What to measure

Most organizations will already have bug trackers and some policies on how bugs are categorized. If your organization doesn’t yet have these processes, it should take at most 

a day or two to set up a bug tracker and copy someone else's categorization policy. There are a vast number of bug trackers on the market; I would suggest selecting one that someone you know is using. This isn’t going to find the ideal system for you, but it will allow you to get going more quickly.

Rough estimates can initially measure the number of customers impacted, but tracking feature usage should eventually be implemented. Knowing how frequently a feature is used is the first step to determining feature value.

Automated failure metrics should include de-duplication for a single user experiencing the same fault multiple times. While not ideal as it loses some information, it is still preferable compared to a problem experienced by a user who tries three times being considered more critical than a problem experienced by two users who each tried once.

Limitations

When there is a large backlog of bugs, it can be hard to tell if a bug is new or a duplicate. Equally, some new bugs will be written off as duplicates when they break something that had a bug in a new way.

Dangers

It is trivial for developers to reduce the number of warning messages and the amount of feedback from client-side errors. This is precisely the opposite of what is wanted. The developers must believe in the importance of improving quality, and these statistics should never be used to bully anyone. As the quality of the warnings and client-side reporting improves, there will be frequent changes where the stats should be reset, as it would be meaningless to compare old results to those after increased logging.

Alternatives

This is a basic measure that should always be used but can be augmented by other more direct business outcome measures.

  • Customer satisfaction
  • Revenue / Conversion rates / Customer activity


Regression test run restarts per release

Why it matters

It is common to hear from managers that it takes one or two weeks to run the full suite of regression tests, but it is claimed only to take three days when talking with QA. This disparity is often caused by each party discussing a different process but using the same name. The manager is talking about the time from starting regression testing until it is complete. The QA is talking about the time needed to carry out one full suite run. 

With a sizable manual test suite, it is very rare for all tests to pass after a feature change. The failed tests are reported back to developers who create fixes. After the fixes are merged, the failed tests are re-run. Once all of the fixes are confirmed to fix the bugs found, the full test suite is re-run. This normally produces new bugs that the fixes introduced. These are sent back to development, and the cycle continues. In an ideal world, there would eventually be a run with no new bugs found. In reality, the typical outcomes are: 

  • Under time pressure, the final run only re-checks the bugs found in the previous run, and then it is released with everyone crossing their fingers. 
  • A full regression is run before release, but the bugs found are considered small enough that the release goes ahead and the bugs get added to the backlog.

What to measure

A basic measure is how many times it is restarted. Slightly more nuanced is how many manual tests are carried out per release vs. how many are in the suite.

Limitations

This is only a meaningful measure when a large manual test suite is available. 

Dangers

It is likely that many in the business don’t know how the sausage-making works and could react poorly to discovering that releases take place with known bugs or that the final release version of the code may well have not gone through the entire test suite.

Alternatives

  • Bug rates
  • The number of times a feature visits QA before it is released




要查看或添加评论,请登录

Peter Edworthy的更多文章

社区洞察

其他会员也浏览了