DevOps - Culture
A recent training I attended on DevOps is the genesis for this post. The trainer focused on throwing names of different tools in each phase of the CI/CD DevOps pipeline. A beginner might mistake DevOps as just another set of tools. Tools are essential in inculcating good habits and behavior - i.e culture, but DevOps is more than just tools. In continuation with my previous post, i want to highlight how the DevOps cultural practices differ.
Hiring
- DevOps is not just making Dev and Ops teams to sit together and ask them to collaborate. The real collaboration happens only when they have mutual respect and when they can speak and understand the same vocabulary and language. In general, companies treat Dev as Strategic and Ops as something that can be outsourced. With this thinking, the hiring practices also do not test Ops with same level of scrutiny. DevOps requires a shift in this thinking and companies need to hire people with capabilities and intentions to automate the Ops activities. Hiring the best minds in Ops will invariably shatter the pecking order among Dev, Test and Ops. This is another invisible wall that will be broken by DevOps.
- In this interview, Ben Treynor of Google's SRE (Site Reliability Engineering) explains how they use close to passing the SWE bar for hiring of SREs.
- Instead of hiring specialists always, companies should be ready to hire generalists who can acquire any skill depending on the need. Practices like market-orientation of teams (instead of pure functional orientation) and 2 Pizza Rule Teams are all some way related to this.
Improvement Cycles
- Product owners should always reserve 20% of cycles for improving and optimizing things. At least 20% of time should be allocated to pay down the technical debt.
- Processes do not remain same over time and they degrade due to entropy. Finding ways to improve daily work is more important than just doing daily work. This culture should be ingrained in every developer, test engineer and Ops so that we can continuously improve and develop a culture of innovation.
- For teams and products with serious issues and more technical debt, 30% or even 40% can be reserved for improvement. This ensures that defects are found and fixed early when they are cheap rather than blowing up in the Production.
Quality, Operations and Security is everyone's job
- User Stories are not 'DONE' the moment Dev checks-in the code on to the trunk. Stories are DONE only when they are running successfully in the production without breaking any other functionality and customer realizes the value. To achieve this, everyone in the value stream is equally responsible. Quality and security are part of everyone's daily job - not something that's taken care only at the end of release.
Empathy and Optimization for downstream teams
- In addition to the customer (external) requirements, everyone should focus on optimizing work for their immediate downstream team members (i.e internal customers).
- When the design team does their job, they should take into account how easy is to Test, Deploy and Operate the solution. When the development team does their coding, they should take into account, how easy is it to automate testing, how easy is to configure/toggle features during run-time. In this way, when work is optimized for downstream teams, rework will be avoided and non-functional requirements like testability, deployability, security, operability and maintainability are taken care always and by all streams.
- Metrics like Percent Complete and Accurate (%C/A) can be obtained by asking downstream customers what percentage of time they receive work that is usable without requiring any corrections and clarifications.
- If the upstream teams are not optimizing for downstream teams, like in the case of Google SRE in above interview, Ops team members should be allowed to move on from an Ops Disaster project. When the Ops team falls below a certain size, Dev should be tasked with managing Ops as well, thus there will incentives for Dev team to design and develop something that is smooth for Ops as well. Shared Pain as a way to inculcate Shared Goals.
Telemetry and Instrumentation
- If you don't measure it, you can't manage it and you can't improve it. Metrics should be decided and coded for everything that can be influenced and improved. This requires budget for infrastructure and easier libraries so that developers can integrate metrics and monitoring into their code effortlessly.
- Metrics should be all encompassing - business level, application level and infrastructure (DB, OS, Network)level. A business level metric might allow the Product owners to understand the feature usage, similarly application and infrastructure metrics allow problems to be seen in real time as they are occurring or building up, thus be proactive before the customer even sees the problem.
- Deciding of metrics and usage should be taken care during activities like peer review. When troubleshooting production issues, teams should focus on what metric could have been added, to warn about this issue or show this trend/buildup of the problem.
- Even to use automatic deployment techniques like canary deployment, we need to identify, measure and monitor the key metrics after deploying changes.
- Metrics collected should be accessible to all the stakeholders in the value stream. Information radiators can be made visible to all. This increases trust among teams and with customer as well.
Swarm and Solve problems
- Whenever a problem occurs, it should be swarmed and solved immediately. This prevents putting things on back burner and starting new tasks, thus reducing work in progress. Also by solving the problem immediately, there are no fading memories and no context is lost. Also, there's no increase of technical debt, and defects are solved earlier when they are cheaper.
- Developers working always on trunk or check-in at least once daily to the trunk instead of long-lived feature branches is related to this practice only - i.e team's productivity is prioritized over individual productivity. Borrowing from Lean practice and the Toyota way, an Andon Cord can be agreed to and used when work gets stuck.
Humane work conditions and happier work force
- Though this sounds like an altruist objective, whole of Devops revolves around this simple point. Build safe systems that can be deployed routinely and frequently instead of a weekend or a graveyard shift maintenance window, reduce rework by optimizing for downstream teams, automate as much as possible to prevent manual, repetitive and boring work that can cause burnout and mistakes, encourage culture of innovation - all these are related to this single cultural practice only.
Reference : The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations - Gene Kim, Jez Humble, and Patrick Debois