登录查看更多内容

Strategy for Effective and Efficient Developer Teams

Richard Anton

Principal Engineer @ Snowflake | Streamlit, Reliability, ML/AI

发布日期: 2022年12月26日

+ 关注

All I want for Christmas is my full CD

Disclaimer

The opinions expressed here are my own, not my employers, and anything I have wrong is entirely my own, not that of my sources.

Introduction

I have spent a lot of time interacting with different developer teams helping design, plan, operate, and maintain various software engineering efforts.

One of the things I have observed is that investing in the quality of code early in the development process pays back dividends later. Some specific ways I believe this is true include: 1) test-driven development (TDD), 2) continuous integration/continuous delivery (CI/CD), and 3) investing in a team culture of rigor and transparency.

In this article, I am going to touch on the TDD and team culture aspects but will focus most on CI/CD. There are a lot of related operational aspects of software development for online systems beyond the final deployment to production that I will not really discuss here to keep things focused.

Test-Driven Development

Test-driven development relies on developing the test cases before the code to be tested. In practice, I often do this in small iterations simultaneously. The creation or at least popularization of TDD is generally attributed to Kent Beck.

Defects that are caught earlier in development are cheaper to fix, and obviously, affect your customer less if at all.

Writing unit tests before or at least with the code for the functionality they test enables you to immediately find bugs when you are in the process of creating a small change list and the problem will be easy to identify. It also gives you an immediate setup to step through your code in the debugger. This is especially important for web services and distributed systems because they do not often allow you to do this very easily with the product in its final form.

Test coverage should be measured, because if you don’t measure what matters upholding the standard will be harder to prioritize, and developers will be more likely to let things slip until later. Like most metrics though, code coverage is a sign of where to look deeper rather than the absolute goal. You can still achieve high test coverage metrics with brittle, badly designed tests that do not provide much value.

Unit tests are not the only type of tests that are important, but they are the ones that cover the most cases cheaply and catch defects the soonest, before you even send code out for review or wait on slower integration tests to run.

Writing the tests earlier means that you find design decisions which can make the code difficult to test soon enough to change them easily.?For example, did you have interfaces that allow you to use dependency injection rather than calling something over the network or file system directly?

Creating the tests also creates a sort of living documentation of how things are expected to behave that is always up to date because it breaks when it becomes outdated.

Writing tests at the start also forces you to think more carefully about the interfaces and APIs you are providing. If things are decomposed more cleanly you will end up with fewer challenges creating the tests, such as not needing to have as many mocks or spies in the tests.

CI/CD

I have been port of various efforts to improve the level of testing and deployment automation in different organizations over the years. I have learned that investing early in a product’s developing into effective CI/CD is critical. Because of this I am willing to push back against product feature scope in early versions so the team can create the capabilities to delivery rapidly later on. There are lots of tradeoffs between short-term feature delivery and long-term capabilities in software development, but getting CI/CD right early on will pay back the time invested in improved velocity and reduced likelihood of defects in production. Some of the other trade-offs are often about problems you think you may have later, like more general design, additional features you are not sure customers will use, and prevention of operational issues you may not encounter as soon as you think. However, setting up so your team can deliver system changes faster is virtually guaranteed to save you time when you need to address the product changes from customer feedback or the technical changes needed to due to unforeseen problems.

My conviction in CI/CD’s value is based on a combination of my own experiences, the experiences of teams I have advised and observed, existing publications by various experts, and published research. This investment does need to be appropriately sized in proportion to your other efforts. It is possible to over-index, since delivering rapid changes for non-existent customers will not help the big picture very much.

One of the publications which I have cited a number of times while advocating for CI/CD investments is the State of DevOps report, begun in 2014 by DORA (DevOps Research and Assessment). The report was previously done jointly including Google and Puppet, but has branched into multiple versions in recent years. The one I have cited most often is the 2019 Accelerate State of DevOps report. It lists four metrics found from the prior six years of the authors’ research which they found related to software delivery performance. These four metrics were:

Deployment frequency (how often is new code deployed to production/provided to users)
Lead time for changes (how long for committed code to run successfully in production)
Time to restore service (basically the same as MTTM: mean time to mitigation, how long to stop customer effect impact of an outage)
Change failure rate (what % of changes required a rollback, hotfix, or patch).

There is a fifth metric in addition to the original four, Availability (what percentage of the time availability targets were hit).

These same five metrics were again discussed in the findings of the Puppet 2020 State of DevOps report.

These are the metrics the researchers found to be most important to high performing organizations across a multi-year study with survey participants from engineering across many organizations and industries.

The first two, deployment frequency and lead time for changes, are both directly related to getting to full CD for your software system. The other two metrics are indirectly related to CI/CD. The time to restore service in an outage is reduced if you have automated testing and deployment for hot-fixes and the ability rollback easily (ideally automatically). For the fourth metric, the change failure rate, it’s easy to argue that it would be reduced by the testing you need in place to get to full CD or meaningfully engage by test driven development.

In addition to the metrics from the State of DevOps report findings, another important aspect of continuous delivery in practice and at scale is phased deployment.

Phased deployment means deploying to different portions of your fleet, progressively to find problems sooner and limit the collateral damage of an issue that is not discovered before production.

Deployment phases typically include: ?

Local testing on the developer’s machine or an environment dedicated to the use of a single developer.
Continuous integration environments: where automated tests are run whenever a new change is committed.
Shared beta environments: where no production systems are involved and beta systems talk to other beta versions of their service/software dependencies. (often goes by various other names),
Preproduction environments: where production data and some production services are used for the dependencies of a new version of a software system as the last phase before production. These are often used for manual or automated user acceptance testing since they are typically set up to be as close to what will happen in production as teams are able to achieve.
Limited scope of production: this takes different forms and may include blue/green deployments, one-box deployments, single cell or single data center, or different kinds of canary deployments.
Progressive production rollout: this is a progressive rollout across all production regions and data centers involved in your service. How this works depends on what kind of compute infrastructure and cloud provider you are using. It can be as simple as one data center or cell at a time with a fixed interval of time between or as advanced as increasing sizes of waves of different cells in different data centers or regions.

Your test pyramid and monitoring systems should integrate your phased deployment pipelines well. For example:

Your code review system should run unit (and preferably integration) tests automatically and alert the author and reviewer of any failures.
Your continuous integration system should run unit and integration tests as part of the build process.
The first appropriate shared phase of your phased deployment, i.e., beta or whatever it gets called where you work, should run any performance benchmarks, load testing, automated UI tests, any automated acceptance tests, and all end-to-end or canary tests you have for your system.
Each phase of your phased deployment should continue to run tests that are safe (i.e., will not affect real customer data), especially those that could break due to incompatible versions of your upstream clients or downstream dependencies.
You should utilize some sort of feature flag or experiment dialup system that lets you launch new code in a disabled state (a dark launch). Besides giving you an extra lever to remediate issues, this also makes it safer to deploy code supporting incomplete or unannounced features. Be careful to include a process to safely track and retire feature flags from prior launches because they can become future problems waiting to happen when some well-meaning developer deletes a feature flag still in use by a launched feature.

Treat configuration as code. By this I mean configuration typed somewhere by an engineer, not end user settings.

Phased deployment should be baked in not just for code deployments, but for how you manage configuration and infrastructure automation as well. Phased deployment for code changes is not much consolation when someone knife-switched a production configuration change that triggers a production issue across your entire fleet. The same applies for automating infrastructure changes. You have even more leverage than the code within the service. For example, if you break the automated production configuration for load balancers and black hole all the client requests to your system, it will not make much difference if the service code itself was bug-free.

If you are working with machine learning models or other types of generated datasets used in production then you also want to support phased deployment, some form of validation, and rollback for those models and datasets.

Configuration and policy changes should have an audit history and change control just as code in Git does. This can be done using GitOps style by checking in the configuration to a repo. Alternatively, you can do this by adding things like review of draft changes, audit history, rollback support, and phased deployment to configuration management systems used by your software.

During phased deployment, it is important to have observability and automated alerting that will detect a problem before it reaches all of production. For example, canary monitoring or service health checks can be used to examine a service endpoint specific to the first portion of your full production fleet (and a similar same for preproduction) you deploy to so you can alert and automatically stop further deployment if the test fails.

Similarly, you want to be able to separate metrics, logs, and traces by the sub-fleet or software version so that you can alert on problems that are caused by the new version but might not be affecting most of the fleet yet.

These types of tests and alerts let you detect a problem before most customers are affected in many situations even if a defect makes it all the way to production.

A Culture of Rigor and Transparency

Rigor includes establishing practices of review and knowledge sharing. I try to purposefully limit the confidence I place in any code, document, idea, or design I have authored until someone else with relevant expertise has reviewed it. This is part of disconfirming your own beliefs; in terms of Amazon leadership principles this is part of “Are Right, A Lot”.?Being right requires seeking out evidence likely to expose any confirmation bias or knowledge gaps that may have affected your thinking.

Getting value out of a review requires being able to have a meaningful conversation that contains constructive criticism. You must leave as much of your ego at the door as you can, and remember that you are not your work. You are definitely not a specific artifact produced by your work. If you let yourself be too prideful you will either resist legitimate feedback by being defensive or engage in perfectionism.

When people are prideful or anxious about feedback on their work, they delay sharing it with others. Sharing the 90% work-in-progress version of your code or document with your colleagues sooner is more likely to get to the best answer more often and more efficiently. This does not mean you should not do your best, but when you reach a point where you need others’ knowledge, an objective viewpoint, or lack any obvious way to verify or improve your work further, it’s time to send it out and see what comes back.

领英推荐

Think Like A Tester And Modify The User Stories

LambdaTest 11 个月前

Refactoring: The Art of Polishing Code

KWAN 5 个月前

How To Write Clean Code Quickly

LambdaTest 2 年前

To reduce the effects of anxiety and imposter syndrome on productivity, you have to be civil when providing feedback and be willing to let people make their own mistakes. Let folks experiment and learn from their experience when the stakes are not too high. If they are doing it differently than you would, share your thoughts, but let it go if it is not a critical decision. Experience is a more effective teacher than most of us. If things are too adversarial, especially for newcomers to an organization, then team members will not have enough psychological safety to perform at their best.

Transparency reinforces knowledge sharing, enables collaboration, and helps cut down on communication overhead and meetings. Discoverability of information is nearly as important as transparency. It does not benefit anyone if information is available to people who have no idea if it exists or where to find it.

Rigor and transparency are two aspects of creating a learning culture. Another important mechanism for organizations to learn effectively is to have blameless postmortems for issues and retrospectives for sprints and launches.

There is some truth to the idea that you only learn from failure, but to succeed you need to learn efficiently from failure as an organization. This means if you learned something, share it, make it discoverable to those you did not share it with directly, and describe how you are going to avoid similar problems in the future. As Socrates said, "The unexamined life is not worth living."

The corollary required to get value from this is to actually look at the prior system design art and previous postmortems and retrospectives to learn from them before you repeat the past.

Conclusion

The key takeaways from this article are:

Test-driven development and deployment automation improve development quality and velocity so get them in place early.
Small batch sizes and short lead times for production changes improve the effectiveness of software development teams.
Foster a culture of transparency, rigorous review, retrospectives, and postmortems to improve organizational learning.

The information here is only a small part of how to efficiently develop and deploy quality software, but it represents a picture of a few of the strategic priorities that I think are important to make software development and DevOps teams successful.

Credits

Photo by Christopher Gower on Unsplash

For More Information

Clare Liguori recently published a new CI/CD article in the Amazon Builder’s Library which I highly recommend for further reading. It also links to several related articles in the library.

A favorite of mine related to phased deployment and automating rollback because I managed the team that owned it for ~ a year is Google’s paper on the Canary Analysis Service.

References

Amazon Web Services, “Using synthetic monitoring”, Amazon CloudWatch User Guide, https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html. Accessed 25 Dec 2022.

“Guide To GitOps”. WeaveWorks. https://www.weave.works/technologies/gitops/. Accessed 23 Dec 2022.

Humble, J., “Evidence and Case Studies”, contiuousdelivery.com, https://continuousdelivery.com/evidence-case-studies/. Accessed 23 Dec 2022.

Liguori, C., “My CI/CD pipeline is my release captain”, Amazon Builder’s Library, ?https://aws.amazon.com/builders-library/cicd-pipeline/. Accessed 23 Dec 2022.

Wells, D. “Code the Unit Test First”, ExtremeProgramming.org, https://www.extremeprogramming.org/rules/testfirst.html. Accessed 23 Dec 2022.

Yanacek, D. “Implementing Health Checks”, Amazon Builder’s Library, https://aws.amazon.com/builders-library/implementing-health-checks/. Accessed 25 Dec 2022.

State of DevOps Reports

“2020 State of DevOps Report: Presented by Puppet and CircleCI”, https://circleci.com/resources/state-of-devops-report-2020/. Accessed 25 Dec 2022.

“Accelerate State of DevOps 2019”, https://services.google.com/fh/files/misc/state-of-devops-2019.pdf. Accessed 23 Dec 2022.

“Accelerate State of DevOps 2021”, https://services.google.com/fh/files/misc/state-of-devops-2021.pdf. Accessed 23 Dec 2022.

“Announcing the 2022 Accelerate State of DevOps Report: A deep dive into security”, https://cloud.google.com/blog/products/devops-sre/dora-2022-accelerate-state-of-devops-report-now-out. Accessed 23 Dec 2022.

“Download the 2021 State of DevOps Report | Puppet by Perforce”. Puppet.com. https://www.puppet.com/resources/state-of-devops-report. Accessed 23 Dec 2022.

Papers

Beetz, F., and Harrer, S., "GitOps: The Evolution of DevOps?," in IEEE Software, vol. 39, no. 4, pp. 70-75, July-Aug. 2022. DOI: 10.1109/MS.2021.3119106

Cheng, L., et al. "What improves developer productivity at google? code quality." Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2022. https://dl.acm.org/doi/abs/10.1145/3540250.3558940

Davidovi?, ?., & Beyer, B. (2018). Canary analysis service. Communications of the ACM, 61(5), 54-62. https://dl.acm.org/doi/10.1145/3190566

Forsgren, N., and Humble, J.. "The role of continuous delivery in IT and organizational performance." Forsgren, N., J. Humble (2016)." The Role of Continuous Delivery in IT and Organizational Performance." In the Proceedings of the Western Decision Sciences Institute (WDSI) (2016). https://dx.doi.org/10.2139/ssrn.2681909

Jaspan, C., et al. "Advantages and disadvantages of a monolithic repository: a case study at Google." Proceedings of the 40th International Conference on Software Engineering: Software Engineering in Practice. 2018. https://research.google/pubs/pub45424/

Lenberg, P., and Robert F.. "Psychological safety and norm clarity in software engineering teams." Proceedings of the 11th international workshop on cooperative and human aspects of software engineering. 2018. https://dl.acm.org/doi/abs/10.1145/3195836.3195847?

Parnin, C., et al., "The Top 10 Adages in Continuous Deployment," in IEEE Software, vol. 34, no. 3, pp. 86-95, May-Jun. 2017, DOI: 10.1109/MS.2017.86. https://ieeexplore.ieee.org/abstract/document/7927896

Rahman, A. A. U., Helms E., Williams L., and Parnin, C., "Synthesizing Continuous Deployment Practices Used in Software Development," 2015 Agile Conference, 2015, pp. 1-10, DOI: 10.1109/Agile.2015.12. https://ieeexplore.ieee.org/abstract/document/7284592

Tosun, A., et al. "An industry experiment on the effects of test-driven development on external quality and productivity." Empirical Software Engineering 22.6 (2017): 2763-2805. https://doi.org/10.1007/s10664-016-9490-0

Books

Beck, K., Test-Driven Development by Example. Boston Addison-Wesley, 2014.

Forsgren, N., Humble, J., & Kim, G. (n.d.). Accelerate: the science behind DevOps: building and scaling high performing technology organizations.

Humble, J., Farley, D. Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation Addison Wesley; 1 edition, 27 July 2010

Smith, S. (2017). Measuring Continuous Delivery. Retrieved from https://leanpub.com/measuringcontinuousdelivery?

Fred Wiesinger

2 年

Terrific list of best practices, many familiar from my time at Google.

2 次回应

Eve Rubin

Sr Product Manager, Amazon

2 年

Automating CI/CD is like creating a natural selection process. It takes time to do it the right way, but when you got it, your service and your team thrives.

2 次回应

Kate Tsoukalas

Senior Software Development Manager at Amazon

2 年

Do you find a lot of teams running perf/load tests in earlier environment stages like Beta? In practice it seems more likely to see game days or testing in later stages and even prod depending on the scenario. Wondering if you have opinions on that.

3 次回应

Michael A.

Director of Engineering at Pinterest.

2 年

This is a great article Richard. Thank you for sharing it.

3 次回应

查看更多评论

要查看或添加评论，请登录

Richard Anton的更多文章

Engineering Mentoring

2024年10月23日

Engineering Mentoring

You can also find this article at https://www.ranton.
Managing Your Career in Software Engineering

2024年10月12日

Managing Your Career in Software Engineering

Success as a software engineer is not just about writing code—it's about continuously evolving as a professional…
Hit the Ground Running: Strategies for Technical Leaders to Accelerate Onboarding

2024年6月26日

Hit the Ground Running: Strategies for Technical Leaders to Accelerate Onboarding

Ramping up as an engineering leader This article is also available at https://www.ranton.

1 条评论
Production Readiness Reviews

2024年2月18日

Production Readiness Reviews

Abstract Software development organizations often have processes to ensure software services are ready for production…
Tools of Choice: Making Better Decisions

2023年10月16日

Tools of Choice: Making Better Decisions

“I used to be indecisive, but now I am not quite sure.” - Tommy Cooper This article presents some mental models, tips…
Interview and Job Search Tips for Software Engineers

2023年1月20日

Interview and Job Search Tips for Software Engineers

Disclaimer The opinions expressed here are my own, not my employers, and anything I have wrong is entirely my own, not…

2 条评论
Some OpenGL Projects

2023年1月7日

Some OpenGL Projects

This post is also available at ranton.org/blog.

1 条评论
Algorithms and Fractals

2022年12月16日

Algorithms and Fractals

Why I Love Algorithms I have always loved algorithms. They let me take something ephemeral akin to math and turn it…

1 条评论
Kids, Lockdown, and Pygame Zero - Fruit[bat]s of our labor

2022年12月8日

Kids, Lockdown, and Pygame Zero - Fruit[bat]s of our labor

My previous articles have been about serious work-related things, but this one is a bit different. I finally got around…

1 条评论
Influencing as a Technical Leader

2022年10月24日

Influencing as a Technical Leader

This article summarizes what I have learned about how we make decisions, and how to influence the decisions of others…

See all articles

Strategy for Effective and Efficient Developer Teams

Richard Anton

Principal Engineer @ Snowflake | Streamlit, Reliability, ML/AI

All I want for Christmas is my full CD

Disclaimer

Introduction

Test-Driven Development

CI/CD

A Culture of Rigor and Transparency

领英推荐

Conclusion

Credits

For More Information

Richard Anton的更多文章

社区洞察

其他会员也浏览了

December 2024 Roundup: Kickstart your TDD Project in a Sandbox, Expert Insights, and Must-Attend Online Events

Refactoring and Unit Testing: The 'Flow Skills' of Premium Engineering Systems

TDD & Refactoring in Legacy Systems

Test-Driven Development (TDD): Building Robust Software Through Iterative Testing and the Benefits It Brings

The Cost of Fixing Bugs Increases Exponentially the Later They’re Found in the Development Process

The Developer’s Roadmap:20 Essential Principles for Building Better, More Efficient Software. A Developer’s Guide to Excellence

Drive Speed, Build Confidence & Increase Transparency with Test Driven Development

SonarQube: Unleashing the Power of Code Quality

The Crucial Role of Git in CI/CD Pipelines

Enhancing Software Quality with TypeScript and Test-Driven Development (TDD)

All I want for Christmas is my full CD

Disclaimer

Introduction

Test-Driven Development

CI/CD

A Culture of Rigor and Transparency

领英推荐

Conclusion

Credits

For More Information

Richard Anton的更多文章

Engineering Mentoring

Managing Your Career in Software Engineering

Hit the Ground Running: Strategies for Technical Leaders to Accelerate Onboarding

Production Readiness Reviews

Tools of Choice: Making Better Decisions

Interview and Job Search Tips for Software Engineers

Some OpenGL Projects

Algorithms and Fractals

Kids, Lockdown, and Pygame Zero - Fruit[bat]s of our labor

Influencing as a Technical Leader

社区洞察

其他会员也浏览了

December 2024 Roundup: Kickstart your TDD Project in a Sandbox, Expert Insights, and Must-Attend Online Events

Refactoring and Unit Testing: The 'Flow Skills' of Premium Engineering Systems

TDD & Refactoring in Legacy Systems

Test-Driven Development (TDD): Building Robust Software Through Iterative Testing and the Benefits It Brings

The Cost of Fixing Bugs Increases Exponentially the Later They’re Found in the Development Process

The Developer’s Roadmap:20 Essential Principles for Building Better, More Efficient Software. A Developer’s Guide to Excellence

Drive Speed, Build Confidence & Increase Transparency with Test Driven Development

SonarQube: Unleashing the Power of Code Quality

The Crucial Role of Git in CI/CD Pipelines

Enhancing Software Quality with TypeScript and Test-Driven Development (TDD)