Our DevOps Journey: 120 to 1 hour Deployments to Production in 1?year.
Over the last 12 months our DevOps Chapter embarked on a journey to drive our Development Squads to be more efficient. This was a clear directive from the Business. We needed to be able to react at pace, and ensure we were focusing on high value work.
The project we called “The Golden Path” was based around our vision to enable our squads to deploy from the pub at a Friday. (Im channeling Jez Humble here — thanks for the inspiration Jez!).
There were a number of building blocks for our Golden Path, some technical, some cultural. The following is an overview from a Squad perspective as to what we did and how we achieved it over the period of a year. It doesn’t cover the business alignment that happened previous to this, in which we were empowered and enabled to make these changes.
For platform enablement we looked to Backstage to enable fast scaffolding of service eco-systems, through templates - more on Backstage in a follow-up post. Our DevOps framework and ways of working are structured around the DORA framework. We measure our success with the DORA Metrics. We set ourselves a goal to move all of the teams we are guiding into “High Performance” and a stretch goal into “Elite”.
On the 25th March, our Integration Modernisation squad moved into the Elite range as measured by the DORA metrics. At 2:11 pm they merged a final code cut of a middleware component to be deployed into Kubernetes. At 3:01pm they were in production and still met governance commitments for ITSM, Testing and Code Coverage.
This 1 hour time frame compares to the 120 hours average previous to the Golden Path and is now repeatable every-time they deploy this component into production.
How did we get there?
Upfront alignment:
The Golden Path Framework isn’t just about automation, we bring a philosophy and way of working into how squads deliver value, and its based on the Dora Framework.
A key component of success is the ability to shift left, and this applies to our planning as well. To augment Agile planning we have asked the Integration Modernisation squad to consider the following additional for every new component
Architectural/Feature flags planning
Key to decoupled Microservices was to consider the solution design and architecture. Our Microservice Framework, and API Framework are still in their infancy but our general approach was to ensure that services were decoupled. This in addition to clear well defined contracts at our boundaries were key to enable testing of any service we were building.
We found some limitations, and these limitations were fed back into our E.A. Framework for future improvement. This is an art-form in itself and is perhaps a topic for another day.
We also discussed how might we use Feature Flags to provide us a means to test in production, or test in isolation through mechanisms as circuit breakers.
Three Amigos Planning:?
The intention is to align the different functions in the squad with regard to developing the user story. The Business, Dev and QA lens all need to align and agree what is being delivered. In our planning exercise we added a 4th Amigo, that being an SRE as we felt the operational aspect for delivery was important to consider as well.
We have found this exercise:
Testing Triangle Exercise:
Our Testing Triangle Exercise is loosely based on Martin Fowlers test pyramid. Its objective is to identify all tests, achieve alignment around what we need to test, and then how might we look to push the testing down the pyramid to efficient unit tests.?
We have found this exercise:
For the Integration Modernisation Squad it was tough going at first. The exercises were seperate and we struggled through it as it was new. Over time, we found that combining these into 1 event is most efficient, and for the squad this is completed in around 90 minutes.
As an indication the uptake for the squad for these new processes happened over a 2 month period and required constant reminding/re-enforcing. The outcomes are now valued as part of delivering at pace.
The Outcomes of the exercises are as follows:
A Testing exercise that looked to drive tests down the pyramid. This was part of our definition of done. The service would not be ready for production until these scenarios were complete. The results of which are fed into
Outcomes are a greater cohesion between parties in the squad as to what to deliver and test. The Entire LifeCycle of the App is discussed, for example, security, observability and testing.
Jira tasks that spoke to documentation, testing, security and observability.
Alignment across squad as to the outcomes, and now a clear definition of Done. In the next section, I’ll relate an example of the testing exercise for the middleware component.
Imbed Quality from the?outset.
We are a big fan of this sentiment from W. Edwards Deming
We didn’t limit ourselves to just testing here.?
Security, Code Coverage, Observability and Governance were factored into provide a level of quality that was acceptable to the business. Testing however, through our modeling of the squads we have been in so far, contributes to the biggest efficiency gains across the dimensions we look at. Getting testing right, is more than 60% of potential efficiency unlocks as we mature our squads. It comes as no surprise to reveal as a result that more than 60% of our time is spent on getting this right.
The key to the unlock for teams was invested in automating our testing frameworks and imbedding these into our Golden Path Templates.
This means that our testing triangle is a little bespoke based on some of the architectural decisions we have made, but nevertheless:
The Integration Modernisation squad went through a number of these exercises, and heres how it happened:
Similar to the 3 Amigos exercise, we relied on the dev, QA, and PO to identify everything that we needed to test.
After 5 mins we look to group these, and then move them onto the pyramid as far down the pyramid as possible.
Anything high up was challenged in terms of “Could we break this test into smaller components” or “move it down the triangle” in order to ensure there was a drive to the fast efficient unit tests. Heres an example of what we delivered as part of the exercise (this was about 30 mins in time)
These tests were then fed into the Jira story pertaining to tests. We would ensure that everything covered here was complete before moving into Production.
Gone in 60?Minutes…
Our Golden Path framework has quality at its core. With the upfront alignment we now know what we need to build and how to test.?
These test results are automated at build and the results flow through the pipeline into ServiceNow and..
We also know where our logs and metrics are thanks to the linkage in Backstage and..
Code Coverage via SonarCloud is front and center in the code merge, and..
We track our workflow using automation to pull the Jira ticket numbers from our commit message and publish these into ServiceNow.
Let’s track our service and what was in place to allow a sub 60 minute delivery of this service.
The path to prod started with all that previous upfront alignment, and pushing testing left. The Devs now have fast feedback as our testing suites are configured to run locally, such that they have confidence to merge.?
领英推荐
Here is what happened last week:
2:11 pm Merge happens
The Conventional Commit Merge message has a link to the Jira ticket (this will be passed into ServiceNow via Nexus)
Our Code Coverage is now visible as part of the commit.
Jenkins builds & tests the committed code in 7 mins. Test results and the change Log are written to Nexus. Jenkins then triggers our Harness deployment pipelines.
2:18 pm Deployment starts
Harness will deploy into AKS Dev and then Test environments
Dev is done in 12 mins, Test kicks off after that another 12 mins elapses
2:42: SmokeTests
Smoke tests begin & verification happen in 6 mins.
2:48: Prod Deployment
Prod approval was kicked off and another 12 mins later we were in prod.
This process included the ServiceNow change — which contains Jira Information, test results, and rollbacks and change log which are automatically pulled from Nexus.
3:01 pm We are in Prod!
I received a comment from the Senior Dev working on this particular component, which made my day. He said he has never done this before, and was thrilled that he can now spend his time on more valuable work, and be trusted to own this.
But… We were TOO?fast
Early on in our development cycle, as this team matured quicker than others, our downstream and upstream boundaries weren’t ready.
We wanted to seperate our ability to deploy our service into production from releasing the feature. In this case our downstream was not ready for the service to send through updates.
This is where feature flags come in. By utilising LaunchDarkly, we can circuit break our ability to send downstream.
Why do this?
We now are decoupled from upstream or downstream dependancies. We can deploy and learn. We test our deployment pipelines — improving confidence.
We optionally now have the ability to test in production if we need.
When the downstream is ready, it’s simply a flick of a toggle to start processing data, reducing the complexities and co-ordination around deployment of complex systems.
Conclusion
What were the keys to success?
Empowerment to experiment for us was key to getting starting. DevOps is a multifaceted beast, and we had to take a holistic view here. Ive listed what I consider the keys to our success, in an approximate order for our journey.
A Holistic Framework
Adopting the DORA Framework was one of the easiest choices in our journey. It was easy to talk about an approach where such an extensive framework existed. Those fact based discussions made it easier to gain buy-in, and explain the benefits.
Mindset & Quality First
Squads/Teams needed to engage. Squads needed to be empowered to change and experiment. There was a fundamental shift in mindset require at all levels of the business. Its key was that our squads were empowered. With that empowerment comes ownership. We needed to ensure that the processes we put in place aligned to that ownership drive by providing them the information and tooling to make informed decisions. The benefits are now becoming clear across the business as our Integration Modernisation squad has moved from a 120 hour path to prod after code merge to 1 hour. If we repeat this across all of the middleware components they are shipping this year, then this is 10,000 annualised hours saved.
Platform Enablement
We spent 6 months on our platforms before we went into squads to help Guide on the new ways of working. The following were Key Components:
Backstage: Orchestrates, via self service Templates repeatable Governed onboarding experience that on its own found immediate success for enabling squads to get up and running quickly. We put a lot of effort to ensure that the our testing Framework was embedded within these Templates and that it was easy for our dev teams to run these locally. We had to partner for success here and engaged Clearpoint to do a-lot of that heavy lifting. I’ll cover off what Backstage does in the next post, but its now 35 seconds to orchestrate Jenkins, Harness, Bitbucket, SonarCloud and ServiceNow with those testing frameworks built in to allow our Dev Squads to get going immediately.
Jenkins: Our existing CI Tool
Harness: We chose Harness for our CD platform. We love its integration with ServiceNow.
ServiceNow: Our ITSM platform.
LaunchDarkly: Our Feature Flag platform allows us to test in Production and to decouple deployments from release and downstream dependancies.
Measure, Guide and Feedback Loops.
To measure our success we adopted the DORA metrics, and the dimensions we were addressing were measured as well — they can be seen below
We knew we had to be a part of squads to make this real. We gave ourselves six months for our chapter to embed this approach into squads. Some teams found a faster path to success, and this was due to mindset and architecture.
We also discovered a lot that was wrong with our approach, and have fed this back into our Guide Strategy. We were very upfront about ensuring that the squads knew that this was in part an experiment, and we wanted feedback about what did or didn’t work. Our Testing Triangle for example had 4 iterations.?
The journey over the last year has certainly been a rollercoaster, but with real benefits now bubbling up, the investment in effort and time, will be quickly repaid.
Whats Next?
We have achieved success with one of our squads, We have a dozen more to work on.. we are going to be busy for a while yet.
Next post will be about backstage, and how its turned our Middleware Service bootstrap journey from 4 weeks into 35 seconds!
Raewyn Walker ServiceNow and Golden Path working in sync. It's a beautiful thing.
Lead Enterprise Architect @ The Warehouse Group | Strategic Lead in Essential Architecture, Resilient and Creative
2 年Excellent narrative Matt. Well done to the very talented team, partners and significantly to yourself, for having the belief, vision, the right leadership style and tenacity to drive this through. Looking forward to future posts, as if I didn't already know what next!
Engineering Leader
2 年This is a great article Matt. Thanks for sharing. I love a good stat and what stood out to me was “Getting testing right, is more than 60% of potential efficiency unlocks”
Head of Digital Innovation Enablement at Sasol
2 年Vusi Sithole, MBA looks like we are on the same path as Matthew Law. Great article, inspiring
Innovative Solution Architect & CTO | Leading High-Performing Teams | Accelerating Growth
2 年These are outstanding results Matthew Law, and having seen the early work being done on both tech and cultural alignment, I had no doubts it would be a success. Congratulations to you and the whole TWG team!