Strategy for Effective and Efficient Developer Teams
All I want for Christmas is my full CD
Disclaimer
The opinions expressed here are my own, not my employers, and anything I have wrong is entirely my own, not that of my sources.
Introduction
I have spent a lot of time interacting with different developer teams helping design, plan, operate, and maintain various software engineering efforts.
One of the things I have observed is that investing in the quality of code early in the development process pays back dividends later. Some specific ways I believe this is true include: 1) test-driven development (TDD), 2) continuous integration/continuous delivery (CI/CD), and 3) investing in a team culture of rigor and transparency.
In this article, I am going to touch on the TDD and team culture aspects but will focus most on CI/CD. There are a lot of related operational aspects of software development for online systems beyond the final deployment to production that I will not really discuss here to keep things focused.
Test-Driven Development
Test-driven development relies on developing the test cases before the code to be tested. In practice, I often do this in small iterations simultaneously. The creation or at least popularization of TDD is generally attributed to Kent Beck.
Defects that are caught earlier in development are cheaper to fix, and obviously, affect your customer less if at all.
Writing unit tests before or at least with the code for the functionality they test enables you to immediately find bugs when you are in the process of creating a small change list and the problem will be easy to identify. It also gives you an immediate setup to step through your code in the debugger. This is especially important for web services and distributed systems because they do not often allow you to do this very easily with the product in its final form.
Test coverage should be measured, because if you don’t measure what matters upholding the standard will be harder to prioritize, and developers will be more likely to let things slip until later. Like most metrics though, code coverage is a sign of where to look deeper rather than the absolute goal. You can still achieve high test coverage metrics with brittle, badly designed tests that do not provide much value.
Unit tests are not the only type of tests that are important, but they are the ones that cover the most cases cheaply and catch defects the soonest, before you even send code out for review or wait on slower integration tests to run.
Writing the tests earlier means that you find design decisions which can make the code difficult to test soon enough to change them easily.?For example, did you have interfaces that allow you to use dependency injection rather than calling something over the network or file system directly?
Creating the tests also creates a sort of living documentation of how things are expected to behave that is always up to date because it breaks when it becomes outdated.
Writing tests at the start also forces you to think more carefully about the interfaces and APIs you are providing. If things are decomposed more cleanly you will end up with fewer challenges creating the tests, such as not needing to have as many mocks or spies in the tests.
CI/CD
I have been port of various efforts to improve the level of testing and deployment automation in different organizations over the years. I have learned that investing early in a product’s developing into effective CI/CD is critical. Because of this I am willing to push back against product feature scope in early versions so the team can create the capabilities to delivery rapidly later on. There are lots of tradeoffs between short-term feature delivery and long-term capabilities in software development, but getting CI/CD right early on will pay back the time invested in improved velocity and reduced likelihood of defects in production. Some of the other trade-offs are often about problems you think you may have later, like more general design, additional features you are not sure customers will use, and prevention of operational issues you may not encounter as soon as you think. However, setting up so your team can deliver system changes faster is virtually guaranteed to save you time when you need to address the product changes from customer feedback or the technical changes needed to due to unforeseen problems.
My conviction in CI/CD’s value is based on a combination of my own experiences, the experiences of teams I have advised and observed, existing publications by various experts, and published research. This investment does need to be appropriately sized in proportion to your other efforts. It is possible to over-index, since delivering rapid changes for non-existent customers will not help the big picture very much.
One of the publications which I have cited a number of times while advocating for CI/CD investments is the State of DevOps report, begun in 2014 by DORA (DevOps Research and Assessment). The report was previously done jointly including Google and Puppet, but has branched into multiple versions in recent years. The one I have cited most often is the 2019 Accelerate State of DevOps report. It lists four metrics found from the prior six years of the authors’ research which they found related to software delivery performance. These four metrics were:
There is a fifth metric in addition to the original four, Availability (what percentage of the time availability targets were hit).
These same five metrics were again discussed in the findings of the Puppet 2020 State of DevOps report.
These are the metrics the researchers found to be most important to high performing organizations across a multi-year study with survey participants from engineering across many organizations and industries.
The first two, deployment frequency and lead time for changes, are both directly related to getting to full CD for your software system. The other two metrics are indirectly related to CI/CD. The time to restore service in an outage is reduced if you have automated testing and deployment for hot-fixes and the ability rollback easily (ideally automatically). For the fourth metric, the change failure rate, it’s easy to argue that it would be reduced by the testing you need in place to get to full CD or meaningfully engage by test driven development.
In addition to the metrics from the State of DevOps report findings, another important aspect of continuous delivery in practice and at scale is phased deployment.
Phased deployment means deploying to different portions of your fleet, progressively to find problems sooner and limit the collateral damage of an issue that is not discovered before production.
Deployment phases typically include: ?
Your test pyramid and monitoring systems should integrate your phased deployment pipelines well. For example:
Treat configuration as code. By this I mean configuration typed somewhere by an engineer, not end user settings.
Phased deployment should be baked in not just for code deployments, but for how you manage configuration and infrastructure automation as well. Phased deployment for code changes is not much consolation when someone knife-switched a production configuration change that triggers a production issue across your entire fleet. The same applies for automating infrastructure changes. You have even more leverage than the code within the service. For example, if you break the automated production configuration for load balancers and black hole all the client requests to your system, it will not make much difference if the service code itself was bug-free.
If you are working with machine learning models or other types of generated datasets used in production then you also want to support phased deployment, some form of validation, and rollback for those models and datasets.
Configuration and policy changes should have an audit history and change control just as code in Git does. This can be done using GitOps style by checking in the configuration to a repo. Alternatively, you can do this by adding things like review of draft changes, audit history, rollback support, and phased deployment to configuration management systems used by your software.
During phased deployment, it is important to have observability and automated alerting that will detect a problem before it reaches all of production. For example, canary monitoring or service health checks can be used to examine a service endpoint specific to the first portion of your full production fleet (and a similar same for preproduction) you deploy to so you can alert and automatically stop further deployment if the test fails.
Similarly, you want to be able to separate metrics, logs, and traces by the sub-fleet or software version so that you can alert on problems that are caused by the new version but might not be affecting most of the fleet yet.
These types of tests and alerts let you detect a problem before most customers are affected in many situations even if a defect makes it all the way to production.
A Culture of Rigor and Transparency
Rigor includes establishing practices of review and knowledge sharing. I try to purposefully limit the confidence I place in any code, document, idea, or design I have authored until someone else with relevant expertise has reviewed it. This is part of disconfirming your own beliefs; in terms of Amazon leadership principles this is part of “Are Right, A Lot”.?Being right requires seeking out evidence likely to expose any confirmation bias or knowledge gaps that may have affected your thinking.
Getting value out of a review requires being able to have a meaningful conversation that contains constructive criticism. You must leave as much of your ego at the door as you can, and remember that you are not your work. You are definitely not a specific artifact produced by your work. If you let yourself be too prideful you will either resist legitimate feedback by being defensive or engage in perfectionism.
When people are prideful or anxious about feedback on their work, they delay sharing it with others. Sharing the 90% work-in-progress version of your code or document with your colleagues sooner is more likely to get to the best answer more often and more efficiently. This does not mean you should not do your best, but when you reach a point where you need others’ knowledge, an objective viewpoint, or lack any obvious way to verify or improve your work further, it’s time to send it out and see what comes back.
领英推荐
To reduce the effects of anxiety and imposter syndrome on productivity, you have to be civil when providing feedback and be willing to let people make their own mistakes. Let folks experiment and learn from their experience when the stakes are not too high. If they are doing it differently than you would, share your thoughts, but let it go if it is not a critical decision. Experience is a more effective teacher than most of us. If things are too adversarial, especially for newcomers to an organization, then team members will not have enough psychological safety to perform at their best.
Transparency reinforces knowledge sharing, enables collaboration, and helps cut down on communication overhead and meetings. Discoverability of information is nearly as important as transparency. It does not benefit anyone if information is available to people who have no idea if it exists or where to find it.
Rigor and transparency are two aspects of creating a learning culture. Another important mechanism for organizations to learn effectively is to have blameless postmortems for issues and retrospectives for sprints and launches.
There is some truth to the idea that you only learn from failure, but to succeed you need to learn efficiently from failure as an organization. This means if you learned something, share it, make it discoverable to those you did not share it with directly, and describe how you are going to avoid similar problems in the future. As Socrates said, "The unexamined life is not worth living."
The corollary required to get value from this is to actually look at the prior system design art and previous postmortems and retrospectives to learn from them before you repeat the past.
Conclusion
The key takeaways from this article are:
The information here is only a small part of how to efficiently develop and deploy quality software, but it represents a picture of a few of the strategic priorities that I think are important to make software development and DevOps teams successful.
Credits
Photo by Christopher Gower on Unsplash
For More Information
Clare Liguori recently published a new CI/CD article in the Amazon Builder’s Library which I highly recommend for further reading. It also links to several related articles in the library.
A favorite of mine related to phased deployment and automating rollback because I managed the team that owned it for ~ a year is Google’s paper on the Canary Analysis Service.
References
Amazon Web Services, “Using synthetic monitoring”, Amazon CloudWatch User Guide, https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html. Accessed 25 Dec 2022.
“Guide To GitOps”. WeaveWorks. https://www.weave.works/technologies/gitops/. Accessed 23 Dec 2022.
Humble, J., “Evidence and Case Studies”, contiuousdelivery.com, https://continuousdelivery.com/evidence-case-studies/. Accessed 23 Dec 2022.
Liguori, C., “My CI/CD pipeline is my release captain”, Amazon Builder’s Library, ?https://aws.amazon.com/builders-library/cicd-pipeline/. Accessed 23 Dec 2022.
Wells, D. “Code the Unit Test First”, ExtremeProgramming.org, https://www.extremeprogramming.org/rules/testfirst.html. Accessed 23 Dec 2022.
Yanacek, D. “Implementing Health Checks”, Amazon Builder’s Library, https://aws.amazon.com/builders-library/implementing-health-checks/. Accessed 25 Dec 2022.
State of DevOps Reports
“2020 State of DevOps Report: Presented by Puppet and CircleCI”, https://circleci.com/resources/state-of-devops-report-2020/. Accessed 25 Dec 2022.
“Accelerate State of DevOps 2019”, https://services.google.com/fh/files/misc/state-of-devops-2019.pdf. Accessed 23 Dec 2022.
“Accelerate State of DevOps 2021”, https://services.google.com/fh/files/misc/state-of-devops-2021.pdf. Accessed 23 Dec 2022.
“Announcing the 2022 Accelerate State of DevOps Report: A deep dive into security”, https://cloud.google.com/blog/products/devops-sre/dora-2022-accelerate-state-of-devops-report-now-out. Accessed 23 Dec 2022.
“Download the 2021 State of DevOps Report | Puppet by Perforce”. Puppet.com. https://www.puppet.com/resources/state-of-devops-report. Accessed 23 Dec 2022.
Papers
Beetz, F., and Harrer, S., "GitOps: The Evolution of DevOps?," in IEEE Software, vol. 39, no. 4, pp. 70-75, July-Aug. 2022. DOI: 10.1109/MS.2021.3119106
Cheng, L., et al. "What improves developer productivity at google? code quality." Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2022. https://dl.acm.org/doi/abs/10.1145/3540250.3558940
Davidovi?, ?., & Beyer, B. (2018). Canary analysis service. Communications of the ACM, 61(5), 54-62. https://dl.acm.org/doi/10.1145/3190566
Forsgren, N., and Humble, J.. "The role of continuous delivery in IT and organizational performance." Forsgren, N., J. Humble (2016)." The Role of Continuous Delivery in IT and Organizational Performance." In the Proceedings of the Western Decision Sciences Institute (WDSI) (2016). https://dx.doi.org/10.2139/ssrn.2681909
Jaspan, C., et al. "Advantages and disadvantages of a monolithic repository: a case study at Google." Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice. 2018. https://research.google/pubs/pub45424/
Lenberg, P., and Robert F.. "Psychological safety and norm clarity in software engineering teams." Proceedings of the 11th international workshop on cooperative and human aspects of software engineering. 2018. https://dl.acm.org/doi/abs/10.1145/3195836.3195847?
Parnin, C., et al., "The Top 10 Adages in Continuous Deployment," in IEEE Software, vol. 34, no. 3, pp. 86-95, May-Jun. 2017, DOI: 10.1109/MS.2017.86. https://ieeexplore.ieee.org/abstract/document/7927896
Rahman, A. A. U., Helms E., Williams L., and Parnin, C., "Synthesizing Continuous Deployment Practices Used in Software Development," 2015 Agile Conference, 2015, pp. 1-10, DOI: 10.1109/Agile.2015.12. https://ieeexplore.ieee.org/abstract/document/7284592
Tosun, A., et al. "An industry experiment on the effects of test-driven development on external quality and productivity." Empirical Software Engineering 22.6 (2017): 2763-2805. https://doi.org/10.1007/s10664-016-9490-0
Books
Beck, K., Test-Driven Development by Example. Boston Addison-Wesley, 2014.
Forsgren, N., Humble, J., & Kim, G. (n.d.). Accelerate: the science behind DevOps: building and scaling high performing technology organizations.
Humble, J., Farley, D. Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation Addison Wesley; 1 edition, 27 July 2010
Smith, S. (2017). Measuring Continuous Delivery. Retrieved from https://leanpub.com/measuringcontinuousdelivery?
Terrific list of best practices, many familiar from my time at Google.
Sr Product Manager, Amazon
2 年Automating CI/CD is like creating a natural selection process. It takes time to do it the right way, but when you got it, your service and your team thrives.
Senior Software Development Manager at Amazon
2 年Do you find a lot of teams running perf/load tests in earlier environment stages like Beta? In practice it seems more likely to see game days or testing in later stages and even prod depending on the scenario. Wondering if you have opinions on that.
Director of Engineering at Pinterest.
2 年This is a great article Richard. Thank you for sharing it.