Chasing the elusive Continuous Deployment
Thomas (Tom C) Chmielewski
Vice President of Product Management - Improving Existing Portfolios, and Designing & Launching New Products & Services
How many of my Product Management colleagues still deliver releases on a quarterly (or even longer) release cycle. Most of the teams I managed did just that.
Yet a surge of companies have moved to continual releases outdistancing their competitors. Google, Macy’s, Amazon, Facebook, Etsy, Target, Nordstrom, and Netflix routinely and reliably deploy code into production hundreds, or even thousands of times per day.
What? Not possible! HOW? THAT CAN’T BE !!
The DevOps Handbook explains how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT, Ops, and, InfoSec to improve delivery and better win in the marketplace.
There are three Phase to accomplish this:
- The principle of FLOW
- The principle of FEEDBACK
- The principle of CONTINUAL LEARNING AND EXPERIMENTATION
Here is my three part article summary of the process.
Part one is about the principle of FLOW.
Phase 1=THE PRINCIPLE OF FLOW – the theory of constraints
Most of us start off in our status quo job of long hours, weekend work, a backlog of technical debt, never seeming to catch up, too many requirements, not enough sprints. It seems like we work towards opposing goals, feeling powerless, followed by burnout, with the associated feeling of fatigue, cynicism, and even helplessness and despair.
Let’s talk about HP’s LaserJet Firmware division – they have 400 developers. They weren’t getting much new development done. Marketing/Product Management had hundreds of product ideas each year. Development said ‘we can do two – we only have capacity for two of your ideas.’ Does this sound like a familiar tune? HP went through the DevOps transformation. They moved from working on just 5% of features to being able to allocate 40% of capacity working on new features. We would all love that delivery velocity.
Let’s start.
Phase 1 - The principle of FLOW
The first step is to map out the entire sequence of events from identifying a feature or customer request all the way to delivery to the client (NOT simply deploy, but include the client implementation/use as well, because deployed code not used is like merchandise on a shelf/not sold – it is not in the customer’s hands yet). We typically follow the standard process – get ideas from customers, maybe at the annual user conference, sometime later we write them up and turn them into requirements (Epics) then break them down into stories, then put them into the backlog, then prioritize them, then groom them, start to code them and then get asked a ton of clarifying questions, then test, then merge, then test again, then deploy. And then have a fix-it / maintenance release. There are ways to speed this up.
Google had this scenario. They had infrequent code deployments. After the transformation they went from infrequent code changes, to 40,000 code commits a day – 50,000 builds a day! And we all know Google stuff works in production every day. We all use it…. If they can make the change, you can too.
Step two is to ensure your environment is consistent. In order to make the process work you will need production like environments at every step of the way. QA servers exactly match production servers. Dev matches QA. NO excuses. This isn’t the 1970s; hardware is inexpensive compared to teams of developers and production failure possibilities. Get the team to build scripts which in turn build environments automatically. Servers should be built in 5 minutes (I worked for a company where it took two months to get a server built – insane). Version control is more important for operations than for development due to a magnitude more configuration settings. Then take any server fix’s (fix-forwards) and always move them back into trunk. It should be easier to build a new server than to fix one. This is the puppies / cattle discussion. Some companies treat a server like a puppy, and try everything they can to get the server right, and to keep it right. Other groups treat a server like cattle – just shoot it & build a new one. (No animals were harmed in this discussion, and, it’s not my metaphor so please don’t send complaints to me). If you have the standard scripts along with version control it is far easier, quicker, and safer, to build a new server than try and fix one. How long do your servers live for? How long does it take to get a new server?
I get movies through Netflix. The average life of a Netflix AWS server is 24 days, most of them just a week. Netflix routinely kills and replaces production instances of servers, jut to prevent configuration drift. That ensures that the servers are all the same (no snowflakes servers (every snowflake is different)). This ensures manually applied changes/fixes aren’t propagated forward and persisted.
Now that you have your environment set, step three is to build a fast and reliable automated test suite. That is what Google had to do. Google has over 120,000 automated test scripts – they run 75M test cases daily. None of the “log in and play around and see if anything looks funky and make check marks on a spreadsheet and we will see if it is a bug or not and try and fix it” stuff that many companies do. You need Test Driven Development; TDD.
You need to catch errors via automated testing as early as possible. Run the tests quickly and in parallel if possible. In Test Driven Development you write the automated tests before you write the code. Automate as many of the existing manual tests as possible. You need to integrate the performance testing into the test suite as well. As an example, non DB indexing page loads could grow from milliseconds to thirty seconds and if the code has multiple DB calls the network traffic could increase tenfold. And, contrary to popular believe, TDD coding is efficient. IBM Almaden Labs determined that TDD code was 60%-90% better in terms of defect density than non TDD code, while only taking 15%-35% longer time. So 15% longer time = 90% better code. A big win. Macy’s went from executing 1,300 manual tests every ten days to ten automated tests for every code commit. Yes, Macy’s as in the department store. If their IT shop can make the transformation, your company could too.
Step four – move from a monolithic code base to a modular code base. I thought we learned this in the 1990s with object oriented programming. Coupled architecture can impede everyone’s productivity and ability to make changes safety. You know, the scenario of “if I make a change here, I am not sure what else it will affect, and where it will be affected” paradox. A loosely couple architecture with well-defined APIs that enforce how modules connect with each other promotes production safety. I know that if I stay within the API for this module, and with my automated test suite, I can safely make changes without screwing up anything else. THIS is how we get to multiple deployments in a day. Etsy starts their process at 8AM using a chat room for coordination. They run 4,500 unit tests in one minute; 7,000 automated regression tests in about eleven minutes. Etsy practices continuous development/continuous deployment.
Step five – finally when you deploy, do what Facebook does – run a canary test – deploy to a small set of live servers. If it works well for X period of time, then deploy to the rest of the thousands of servers. CSG International, one of largest bill printing companies in United States, runs their services hundreds of times a day with realistic data and traffic before going into production. They got their ‘development to production’ time down from two weeks, to daily. Eventually, deployments became so routine that the Operations team was playing video game at end of day. Production incidents were down 91%, and MTTR was down 80%.
So, Phase 1 is THE PRINCIPLE OF FLOW – the theory of constraints. Understand your ideation to deployment flow. Remove constraints. Work towards a loosely coupled architecture, with Test Driven Development, and the automation of building of servers. Test early and often, and work towards deploying frequently. I hate to say it but this isn’t rocket science – I have been in this industry for over twenty years – what your management team needs to accomplish this is a conviction to do it, and discipline to execute it.
Phase 2 – the Second Way – The Technical Practices of Feedback is next week.
For more information go to :
2017 State of DevOps report
Infrastructure Leader (DataCenter & Cloud Operations, Azure/AWS/GCP/OCI Certified, Security, FinOps, Service Delivery Management, SAP Basis)
6 年I already order this Devops handbook ...waiting with impatience its delivery !
Vice President of Product Management - Improving Existing Portfolios, and Designing & Launching New Products & Services
6 年For those of you who haven't taken the time to read the book....