Chasing the elusive Continuous Deployment Part 2 The Second Phase
Thomas (Tom C) Chmielewski
Vice President of Product Management - Improving Existing Portfolios, and Designing & Launching New Products & Services
Second in a series of three articles on DevOps
The Phoenix Project and the DevOps Handbook
How many of my Product Management colleagues still deliver releases on a quarterly (or even longer) release cycle. Most of the teams I managed did just that.
Yet a surge of companies have moved to continual releases outdistancing their competitors. Google, Macy’s, Amazon, Facebook, Etsy, Target, Nordstrom, and Netflix routinely and reliably deploy code into production hundreds, or even thousands of times per day.
What? Not possible! HOW? THAT CAN’T BE !!
The DevOps Handbook explains how to replicate these incredible outcomes, by showing how to integrate Product Management, Development, QA, IT, Ops, and, InfoSec to improve delivery and better win in the marketplace.
There are three Phases to accomplish this:
- The principle of FLOW
- The principle of FEEDBACK
- The principle of CONTINUAL LEARNING AND EXPERIMENTATION
Here is my three part article summary of the process.
Part one was about the principle of FLOW; 1) Identify and eliminate constraints, 2) Ensuring consistent environments are used (dev, test, prod), 3) automating the testing, 4) driving towards a loosely coupled architecture, 5) running canary tests.
Phase two is about the principle of FEEDBACK.
Phase 2=THE PRINCIPLE OF FEEDBACK
I have worked in organizations where systems (development, core systems, end user applications, etc.) were not well monitored. We just didn’t know what was going on, and delays in identifying issues magnified early small problems into giant ones.
At one company I found a job that had run since the day before; the exception count was over 336 million and the output of the job had occupied 194 tape volumes and counting, and Operations sent out a blanket email asking that maybe someone should look into it as it caused all the daily batch jobs to stop. At another company we had a job that had a looping call; over 1.8B entries were made before someone noticed all the storage had filled up. These are not good examples of the principle of feedback.
The Knight Capital failure - a fifteen minute deployment error resulted in a $440 million trading loss, during which the engineering teams were unable to disable the production services. The financial losses jeopardized the firms operation and forced the company to be sold over the weekend so they could continue operations without jeopardizing the entire financial system.
As we work toward a continual development / continual deployment process, we need to have a solid feedback process in place, throughout the process, to help enable it.
You need to have feedback from the development process, the builds, the testing, the deployment, the configuration level (DB, NW, Storage), the business application itself, and the users/customers process.
Let’s start with the feedback in the development processes.
Some companies think that lots of controls and signatures are a good feedback mechanism. These companies have a low trust, command and control culture which put in lots of process and control and testing. This often adds more friction to the deployment process by multiplying the number of steps, and approvals and increasing batch sizes and by creating approval steps from people who are located further and further away from the work. Approvals need to be tracked down, forms need to be signed, and no one thoroughly understands what the code is actually going to do (just understanding what they were told that the code was supposed to do). This results in the feedback loop being moved further and further away from the work, which, results in a lack of catching problems early on. Catching problems early on is far less expensive than having your entire operation shut down because of something like all your storage filled up.
High performing teams have a couple of key traits. Good coding technique is key.
Pair programming is a good coding practice and it has several benefits. With pair programming you typically get better code design, as in two heads are better than one. You get higher quality code - reports have shown that paired programming may be 15% slower that single coding, but error-free code increased from 70% to 85%. Companies who employ pair programing say up to 96% of the staff enjoyed their work more. And pair programming ensures no one person knows one thing – this should be a no-no. If one person ends up being the only one who knows certain code – that could set up for nefarious behavior; fraud may be limited but when it does happen it can have far-reaching, serious consequences: Insidious: How Trusted Employees Steal Millions and Whys It’s So Hard to Stop Them. Companies don’t think this happens until it happens to them.
Some teams rely on peer review of code. That is choice, but not a great one, and it depends how early the peer reviews take place (and by whom). At one company peer reviews only took place after the coding was done and after the unit testing was done and after the full QA testing was done. The code was reviewed just before signoff, so the developer could get the ‘peer reviewer signature’ for their low trust command & control culture I referenced earlier.
The problem was twofold – first the review happened so late in the process that reviewers were reluctant to recommend the developers make changes because it would certainly force the code submission to be delayed. Second, the effort to review a bunch of changes all at once all at the last step of the process naturally had some people skim verses really inspect. One developer I know had 300 files to review in 3 hours – what do you think the quality of that peer review was? Pair programming is done in hours, in real-time, peer reviews are done in days and is not conducive to continuous deployment.
Continuing with the development process, automated code feedback is an important feedback tool. We are human and even with peer reviews and pair programming, we make mistakes. I witnessed a production job fail because the “!” was inadvertently inserted in to the code, and the developer, and the peer reviewer, and the architectural review, and the QA review all missed it – it went into production, and it didn’t work. Use code review tools such as APPscreener, Silverthread, Eclipse, RIPS, Veracode, PVS, or Gamma. Remember the example given in the first paragraph, the code with the looping call that generated 1.8B entries and filled up all the storage? That would have been caught before it became a problem. And don’t forget to check for security threats as well with tools such as Checkmarx, OWASP tools, or FindSecBugs. The last thing you want is having to disclose to the public that your systems were breached. All these tests should be automated and part of the routine development process. Feedback early, feedback often.
One of my pet peeves – comments. Make sure developers explain in the code what they are trying to do, how, and why and for whom and when. Each time a developer opens up the module, they have to take the time to investigate what the code is intended to do, reverse engineer it, understand how it works, and if their changes will affect or step over these procedures. It is a recurring, perpetual time-sink every time someone opens up any module. Nick Galbreath, VP of Engineering at Right Media, said they needed to respond to market conditions within minutes. They needed to quickly deploy code. Small frequent changes that anyone can inspect and easily understand is key. Documented code is key.
Full Stack developers are able to switch between front end and back end development based on the requirement of the project. This is a big saver of time and money as complexities and problems can be solved by the same person. This allows everyone in the value stream share the downstream responsibilities of handling operational incidents. As Patrick Lightbody, SVP of Product Management at New Relic, observed, ‘we found that when we woke up developers at 2 a.m., defects were fixed faster than ever’. In fact defect introduction goes down.
Make sure you have a solid launch process that you follow. Google uses a launch readiness review and the hand-off readiness review (LRR and HRR) respectively to ensure their launches are successful.
Measure and manage the defect count and severity, security fails, the deployment process, to help get to good production hygiene.
Next let’s talk about feedback in the customer process.
How often do you interact with customers? Do you utilize a User Group? Do you have a UX team? How often do you just sit and watch users play with your software? We think we collect all the right requirements and build the right product but the feedback look is important.
All too often in s/w projects developers work on features for months or years spanning multiple releases, without ever confirming whether the desired business outcomes are being met.
At one company I observed the payment reconciliation process – the users had to go to twenty-three different screens to get the necessary information they needed for each record they had to reconcile; it was an insane waste of time for data that could be displayed on just one or two screens.
Ronny Kohavi, Distinguished Engineer and GM at the Analysis and Experimentation group at Microsoft observed that after “evaluating well-designed and executed experiments that were designed to improve a key metric, only about one-third were successful at improving the key metric”. In other words, two-thirds of features either had a negligible impact or, actually made things worse. Kohavi goes on to note that all these features were originally thought to be reasonable, good ideas. The implication from the data was staggering. Taken to an extreme, Microsoft would have been better off giving the entire Development team a vacation instead of building “non-value-adding” features.
Gene Kim, one of the authors of The DevOps Handbook and former CTO at Tripwire, noted that one of the worst moments in his professional career was when he spent an entire morning watching one of his customers use their product. He was watching them perform an operation that they expected customers to do weekly, and, to his horror, he discovered that it required sixty-three clicks. The customer kept apologizing saying things like “Sorry, there’s probably better way to do this”.
Feedback early and feedback often is key.
Use techniques such as A/B testing to get early feedback. As Scott Cook, the founder of Intuit said, “Instead of focusing on the boss’s voice, the emphasis [should be] on getting real people to really behave in real experiments and [base] your decision on that”. In fact, they do 165 experiments during the three months of the US tax season. They do production experiments during peak traffic seasons. Many companies that haven’t embraced DevOps can’t do this, and instead implement a ‘production freeze’ during their busy months…
Jim Stoneham was the GM of the Yahoo! Communication Group that included Answers. Twitter, Facebook, Zynga were doing experiments at least twice per week. He was running the largest Q&A site in the market but couldn’t release any faster than once every 4 weeks. Competitors had a 10x feedback loop on them. He moved to multiple deployments a week and monthly visits increased 72%. The faster you can iterate and integrate feedback into the product that faster we can make an impact with your customers.
Feedback from the physical environment – configuration
We rely on our environments to run 24x7. But hardware fails (it is a proven MTBF fact). And software can get corrupted, despite good coding techniques, and good DevOps configuration control.
See and solve problems as they occur. Given a herd of cattle that should all look and act the same, which cattle look different from the rest? Or more concretely, if we have a thousand –node computer cluster, all running the same software and subject to the same approximate traffic load, our challenge is to find any nodes that don’t look like the rest of the nodes. The server outlier detection process by Netflix has massively reduced the effort of finding sick servers and more importantly massively reduced the time required to fix them.
You need tools to monitor the environment and provide meaningful, useful information, and not get lost in the white noise. There are different levels of alerts; debug, info, warn, error, fatal. The creation of production metrics should be part of the daily work. StatsD can generate timers and counters with one line of code. Some other good tools include Puppet, Chef, Rudder, Ansible, Solarwinds, or AppDynamics. Use Health Check tools such as Sensu, agios, Zappix, Splunk, DataDog, Reimann. Use Application Performance such as Monitors Splunk, Zabbix, sumoLogic, DataDog, Sensu, RRDTool . Collect all server information by Ganglia, or AppDynamics, or New Relic, Solarwinds. Display all information into an open source tool like Graphite. At Etsy they went from monitoring 200,000 in 2011 to tracking 800,000 metrics in 2014. There is more information here in The Art Of Monitoring
At one company I was at we had a production issue when the cybersourse secure acceptance key had expired causing all tokenization attempts for a client to fail. The problem was fixed, but not solved, as the next month the operations team ran into another certificate that expired. The simple question I asked was “Don’t we have a list of ALL certificates and expiration dates that we monitor weekly?” The answer was that we had multiple lists, some maintained and some not. We put a single list of ‘truth’ together for the company and assigned an owner/team to manage it. Then we added it to the knowledgebase.
You should have a knowledge base for the teams to look up and identify problems. I happened to be monitoring production tickets when I realized a new problem was similar to the problem that popped up a couple of months back. I asked the engineer (who had been working on the problem for five hours already) if he was aware of the solution identified two months back – no was the answer. Each engineer worked on their own issue – the knowledge (problem, isolation, identification, fix) was not being shared. We built a Knowledge-Base tool (KEDB) to assist operations and the development teams in identifying, triaging, solutioning, and solving production problems.
Business Application feedback
If your systems are all up and running, how do you measure the activity of the clients using the system? As a Product Manager how do you know your clients/users are exercising your product like you expect them to?
At one company clients sent us files to process every day. One month one client began sending a file with just one header record (verses the 3,000 lines typically sent) but Operations proudly processed the file everyday until about thirty days later the client complained that none of their customers were loaded in the system – not our fault for them sending a blank file but, yes, our fault for not catching it and informing the customer. We didn’t have volume monitoring in place. If we had, we would have known another client’s volume was dropping at a regular pace – and might have been able to salvage the account before they moved to our competitor.
At another company I worked at the billing job (as in sending invoices to our customers so we could get paid – kind of a big deal) failed to send the e-invoices – yes they were generated but just not sent (that part of the system broke due to a recent code update); we didn’t notice until our customers called us asking us if we wanted to get paid or not (embarrassing). In both cases the systems were working – just not doing what we expected.
Monitoring the activity of the business application is as important as monitoring the health of the systems. You built a system to perform activities. Measure and report anomalies for those activities. For the client that typically sent 3,000 records, an alert should have been sent when a pre-determined threshold was broken – if the client suddenly dropped by more than 20% find out why. Is there a problem with the files? Is the client losing business? Is the client shifting business to a competitor? You don’t know you have a need to investigate unless you are monitoring all the aspects of the business activity. How many users log in? How many files are processed? How many transactions are made? How many cases opened and closed? How many rejected? This is the dashboard for the business – for the Product Manager – to manage their business and P&L. The Product Manager needs to measure sales, but also usage of the product.
Finally make sure the information is visible. Implement an information radiator so everyone at all times can see the status of the dev, build, and systems.
Back in the heyday of the Toyota Production System (TPS), when Toyota was rewriting the record books for car manufacturing, they implemented many unique tools and processes. One of those tools was called the Andon Cord. The Andon Cord was a rope. Just a rope. Like you’d use to tie down luggage to the roof of your car. But this rope was game-changing. When pulled, the rope would instantly stop all work on the assembly line. And the craziest thing about it? Anyone had the right to pull the cord at any time.
If the Toyota assembly line found a problem and could not solve it in 55 seconds, the chord was pulled, assembly stops and the team ‘swarms’ to fix the problem. Swarming involves learning, plan, do, check, act, measure. Swarming allows smaller problems to be discovered & fixed quickly, before they become big problems. At one company we had a red flashing light, sort of an Andon chord alert, when something broke. At first, some of the team viewed this with skepticism and a few scoffed at it. A lot of comments were made. But it worked, and within 3 months we had a different response - everyone immediately tuned into, and worked to fix, the problem when the light went off. Our engineering velocity surged forward.
By implementing feedback loops you can enable everyone to work together toward shared goals, see problems as they occur, and with quick detection and recovery, ensure that features not only operate as designed in production (all systems functioning), but also achieve business goals (is the app processing volumes like we expect) and organizational learning. The benefit of using these techniques to preserve employee sanity, work life balance and service quality cannot be over stated.
Next week’s summary will be the principle of CONTINUAL LEARNING AND EXPERIMENTATION.
For more information go to :
2017 State of DevOps report
#productmanagement #computersoftware #management #managementconsulting