Production Readiness Reviews
Abstract
Software development organizations often have processes to ensure software services are ready for production, including operational readiness reviews
Overview and Motivation
Software development organizations often have processes to ensure systems are ready to operate reliably in production. These may also include a security review or use a separate process for that.
Terminology varies, but the processes share many concepts. At Amazon, we called these operational readiness reviews (ORRs), and at Google SRE, we had a related process for when we took over production support from developer teams that we called production readiness reviews (PRR).
Conducting such a review is pivotal for mitigating risks
These are also applicable for data pipelines, machine learning systems, and mobile or desktop applications with dependencies on external, online services though specifics appropriate vary between types of system.
Some groups prepare these as documents, while others share them as presentations, or tracked set of work items in something like JIRA. If you use slides, be sure to link to more detailed references for others to dig into if desired, and share the slides with the audience.
Operational readiness reviews serve several purposes.
Readiness reviews are also used to ensure uniformity for production related standards such as logging formats, metric naming, and IaaC coding standards. Uniformity of control surfaces makes reusing automation and monitoring much easier across teams.
These reviews also produce several other desirable outcomes beyond their immediate goals.
While these processes and the artifacts and learning produced are valuable, they are not free.
There is a significant amount of work required to prepare for these reviews. There are also follow ups you commit to after the review itself has been conducted. The work required is reduced if a prior review has content that can be updated and reused.
These reviews should be viewed as an iterative process that is stared during development. As you refine your understanding of where your gaps are, you should improve your understanding of risks and priorities, refine your project roadmap to balance operational readiness with development velocity, and tackle the highest-priority opportunities first. You should prioritize reliability-related work with the biggest return on investment for reliability improvements and risk reduction versus cost and time to implement.
This also provides valuable inputs into system design for observability, deployment safety, scalability, and performance sooner so changes can be made earlier when they are less costly to implement.
Areas and Topics Covered
I would recommend that an ORR or PRR cover a number of specific areas of concern.
System Purpose, Use Cases, and Requirements
System and Product Roadmap
Major changes to the system, its usage, or the product(s) it forms part of are an important context for teams developing and operating software systems. For example, if you know that there is a major marketing event or expected usage spike coming, that is important for capacity planning
System Monitoring and Observability
Monitoring a system for availability and performance is critically important to providing a reliable product experience.
This part of the review should include a breakdown of the dashboards, instrumentation, metrics, monitors, alarms, and alerts that are in place for the system or are planned to be put in place before launch. This should also cover any instrumentation, such as logs, emitted metrics, traces, or crash collection, for troubleshooting and providing observability into the system.
For a concise introduction to monitoring concepts and considerations, I recommend“Chapter 6 - Monitoring Distributed Systems” from the SRE Book(Beyer, et al., 2016). I also recommend reading through the OpenTelemetry site’s Observability Primer, and exploring the AWS Observability Best Practices site.
Oncall Response and Rotation
Include a discussion of what oncall, pager, business hours, or other rotation or staffing for dealing with production issues is in place or being put in place. This should include process, rotation, mechanics of delivering pages, bugs, or notifications.
Reliability and Availability
The short version: reliable means it is available, returns the correct result, and delivers it in a timely manner.
Availability is usually used to mean whether the system is reachable, but I prefer to use the term reliability, which I define as the property of a system that:
The review should evaluate the system’s reliability, including its ability to recover from failures and maintain data integrity.
The system and its operators must also safeguard any data the user entrusts to the system in terms of integrity, durability, security, privacy, and compliance.
Capacity Management
Provide an evaluation of the current capacity of the system in terms of how much traffic, throughput, or other appropriate measure of scale. This should be coupled with what performance or scalability benchmarking or load testing has been done or is planned before launch to measure the system’s scalability. If possible, this should be compared against any available historical usage or traffic data and future projections to ensure accurate and reliable results.
This also includes considerations around the cost of the system in terms of efficiency and trade-offs between availability goals and cost efficiency.
System Dependencies
Provide a complete list of dependencies, whether first or third party, that are used by the system. In software services, these are most often other services called over the network. In data processing pipelines or workflows, these are often jobs that provide their output as input to a later job in the workflow. Important dependencies are not necessarily limited to things you call over the network. Anything with a realistic risk of breaking due to the hardware, network, environment, bug, scale, or operator error should be included.
A Review of Operational History
For systems with an operational history, a review of the prior issues, all recent postmortems, and an overall analysis of alerts generated and handled for the system should be included. This should ideally cover.
Known Issues
Often, we do not have the luxury of addressing all the issues with the current system while planning for new functionality or even ongoing support. Everything has an opportunity cost, but it is good to take on technical debt that leads to known risks, operational toil, reduced observability, or other issues with our eyes open while prioritizing this work against other work to address different issues or add new functionality or features.
This is something that should be addressed in a review, but I think it works better to keep it as a running tally updated periodically in your backlog as a specific category of work and summarize it at the time of a relevant review.
Risk Analysis
This should consist of a detailed risk analysis discussing all significant risks to the product, its users, and the company developing, maintaining, and supporting it. This should cover the likelihood of each risk, the impact if it occurs, possible mitigations to reduce the likelihood or impact of occurrence, and what mitigations are done, planned, or considered but not planned.
The system’s uptime history and incident response plans should also be scrutinized to ensure that any issues can be quickly addressed and rectified.
In many ways, this is the primary point of the exercise. Decide what are the highest priority risks by auditing the system and its supporting artifacts and processes to determine actions to take now, in the near future, and later in the roadmap to reduce these risks to an acceptable level for the business.
By proactively identifying and addressing risks, organizations can reduce the likelihood and impact of potential disruptions, helping to ensure the smooth operation and success of their business.
Risk Management Matrix
The risk management matrix is a tool used to identify, assess, and prioritize risks. It is a valuable tool for organizations of all sizes, as it helps them to proactively manage risks and reduce the potential for losses.
The risk management matrix typically includes these columns:
Here is a short example of a risk management matrix:
Risk Impact Severity Likelihood Mitigations Auth system dependency outage Inability for users to sign-in to the product. High Possible Legal counsel review, clear contracting, insurance Service overload Increased latency, potential reduced availability Medium Likely Load testing, capacity planning, autoscaling for services, monitoring dependency performance and error rate. Data Breach Loss of user trust, legal liability High Possible Data encryption, security audits, staff training, penetration testing
Disaster Recovery and Business Continuity
Testing, CI/CD, and change management
I talked about why CI/CD is important in one of my earlier articles, Strategy for Effective and Efficient Developer Teams, so here we will focus more on how to review where a team and their systems are at for addressing gaps.
领英推荐
Test-driven development (TDD) emphasizes the creation of automated tests before writing the actual code. This approach helps ensure that the code is correct, designed for testability, and meets the requirements. TDD has several important benefits, including:
Reviewing Test Coverage
By reviewing or auditing a software system’s automated testing
Continuous integration and continuous deployment (CI/CD) are critical aspects of modern software development. The integration of CI/CD practices into software development has a significant impact on operational readiness and production reliability by reducing human error, enabling automated testing, and promoting small, frequent changes that are easier to test and troubleshoot.
High-Level Checklist for CI/CD Practices
Deployment Strategies
It is important to avoid a “big-bang” or all-at-once approach to updating software. Phased or incremental deployment approaches have several benefits that can improve availability.
Phased deployment, also known as incremental or staged rollout, is a deployment strategy where new features, updates, or entire applications are gradually released to a subset of users or production environments. This allows teams to test the impact of changes in a controlled manner, monitor performance, and quickly address any issues before full-scale deployment, which reduces the impact on users.
Canary deployment is a technique where a new version is rolled out to a small subset of users or servers to validate the reliability and performance of the change before being deployed to the rest of production.
Blue-green deployments are another strategy where you use load balancing configuration to shift traffic between two deployments progressively after you have first deployed to one deployment while it is drained from live traffic. This approach allows for a very rapid rollback by just redirecting traffic. The downside is having to maintain two environments that can support production traffic, but if you set this up in conjunction with auto-scaling for the service(s) involved, you can scale the inactive environment down when not being used to deploy.
Backwards Compatibility: There are special considerations around user sessions with web apps, compatibility between different layers in your service stack, and also schema changes to data stores. For example, if you roll out a new version of your frontend that depends on a new feature in your API layer that requires a new column in your data store’s schema to be present, then you will likely have problems. You also have to be able to roll back changes safely, which is why having rollback testing in some pre-production phase of deployment is useful.
Feature Flags and Dark Launches: You should have at least a simple mechanism for deploying changes to your system behind feature flags, which you can change more quickly than a full deployment. This lets you rollback more easily in case of problems. It decouples launch from deployment, and it gives you a mechanism to work around otherwise problematic backward compatibility issues (expected or otherwise). Feature flags can also be coupled with the measurement of user analytics data and performance metrics to perform A/B experiments to evaluate the effect on user experience and behavior of new features and other changes. Ideally, your feature flag mechanism should itself support gradual rollout so you can test a change on a subset of users and allow for internal users to force a flag setting to test behavior in production before enabling it for actual end users.
Team training and incident management
The importance of the human element of operations should not be underestimated. How communications are managed during an incident, how team members are trained, and how follow-ups such as postmortems are conducted play a significant role in the reliability of a system.
Training for the staff who will operate and maintain the system should be reviewed to ensure they are adequately prepared. Operational procedures, including incident response, deployment processes, change management, and support protocols, should be well-defined and tested. The training should include several areas, including:
For an introduction to runbooks, see “What is a Runbook” by PagerDuty.
For more information on incident management in practice, I recommend starting with the“Incident Response” chapter of the SRE Book(Beyer, et al., 2016).
Security
Security is paramount in today’s digital landscape. If there is not a separate process being followed for security, including threat modeling and security risk management, then you should perform one as part of the ORR process.
The security assessment should cover the system’s security posture, ensuring that all data is protected against unauthorized access and breaches. This involves reviewing authentication mechanisms, access controls, encryption standards, and security protocols. Compliance with relevant regulations and standards, such as GDPR or HIPAA, must also be verified to avoid legal and financial repercussions.
Human review should be combined with automated scanning and, preferably, outside auditing and penetration testing when feasible.
We are not going to cover security or threat modeling in any detail here, not because it is not important, but because it would significantly increase the scope and length of this article to do it justice. If you need a starting point for threat modeling and mitigation I recommend starting with the Thread Modelling Process provided by OWASP. Another good resource is "Threat Modeling: 12 Available Methods" (SEI, 2018) from the Software Engineering Institute blog.
Customer Support
The ORR should assess the readiness of the support team to handle customer inquiries and technical issues. Service level agreements (SLAs) and support response times should be evaluated to ensure they meet business requirements.
Legal and financial considerations
The ORR should not overlook legal and financial aspects, such as licensing agreements, intellectual property rights, and budget allocations. It is crucial to ensure that the system’s launch does not expose the organization to legal vulnerabilities or unexpected costs.
Sample Questions
If you are going to structure your PRR/ORR as a document, then one way is to create a survey style list of questions for the service team to answer to address the important topics.
Here is a non-exhaustive list of sample questions to cover during a readiness review. You should tailor the list to your situation, organization, and needs. A template of questions like this can serve as a checklist while being much more concise than a document (such as this article) covering the entire process.
Conclusion
An operational readiness review allows organizations to identify and address any potential risks before launch, and is an essential step in the software development lifecycle. Proactive review ensures the system’s performs as expected and provides a positive customer experience.
References
Books
Beck, K., Test-Driven Development by Example. Boston Addison-Wesley, 2014.
Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2016).Site Reliability Engineering: How Google Runs Production Systems. Oreilly.
Beyer, B., Murphy, N. R., Rensin, D. K., Kawahara, K., Thorne, S. (2018).The Site Reliability Workbook: Practical Ways to Implement SRE. United States: O’Reilly Media.
Forsgren, N., Humble, J., & Kim, G. (n.d.). Accelerate: the science behind DevOps: building and scaling high performing technology organizations.
Humble, J., Farley, D. Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation Addison Wesley; 1 edition, 27 July 2010
Kim, G., Humble, J., Debois, P., Willis, J., Forsgren, N. (2021).The DevOps Handbook: How to Create World-Class Agility, Reliability, & Security in Technology Organizations. United States: IT Revolution Press.
Martraire, C. (2019). Living documentation: Continuous knowledge sharing by design. Addison-Wesley
Articles
The 6 Pillars of the AWS Well-Architected Framework | Amazon Web Services. (2022, March 1). Amazon Web Services. https://aws.amazon.com/blogs/apn/the-6-pillars-of-the-aws-well-architected-framework/.
Amazon Web Services, (n.d.). “Using synthetic monitoring”, Amazon CloudWatch User Guide, https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatchSyntheticsCanaries.html. Accessed 25 Dec 2022.
Amazon Web Services, (n.d.). AWS Observability Best Practices. Retrieved February 4, 2024, from https://aws-observability.github.io/observability-best-practices/
Cocca, G. (2023, April 7). What is CI/CD? Learn Continuous Integration/Continuous Deployment by Building a Project. freeCodeCamp.org. https://www.freecodecamp.org/news/what-is-ci-cd/
Davidovi?, ?., & Beyer, B. (2018). Canary analysis service. Communications of the ACM, 61(5), 54-62. https://dl.acm.org/doi/10.1145/3190566
DORA | DevOps Research and Assessment. (n.d.). https://dora.dev/
Dodd, R. (2023, January 30). Four common deployment strategies. LaunchDarkly. https://launchdarkly.com/blog/four-common-deployment-strategies/
Liguori, C., “My CI/CD pipeline is my release captain”, Amazon Builder’s Library, https://aws.amazon.com/builders-library/cicd-pipeline/. Accessed 23 Dec 2022.
Observability Primer. (2024, January 30). OpenTelemetry. https://opentelemetry.io/docs/concepts/observability-primer
Production Readiness Review. (2023, December 6). The GitLab Handbook. https://handbook.gitlab.com/handbook/engineering/infrastructure/production/readiness/
(2021, June 7). Synthetic Testing: What It Is & How It Works. Datadog. Retrieved February 13, 2024, from https://www.datadoghq.com/knowledge-center/synthetic-testing
Threat Modeling: 12 Available Methods. (2018, December 3). SEI Blog. https://insights.sei.cmu.edu/blog/threat-modeling-12-available-methods/
Threat Modeling Process | OWASP Foundation. (n.d.). https://owasp.org/www-community/Threat_Modeling_Process
What is a Runbook? | PagerDuty. (2023, March 21). PagerDuty. Retrieved February 4, 2024, from https://www.pagerduty.com/resources/learn/what-is-a-runbook/