Time to Rethink DevOps Economics? The Path to Sustainable Success
Andrew Mallaband
Helping Tech Leaders & Innovators To Achieve Exceptional Results
As organizations transform their IT applications and adopt cloud-native architectures, scaling seamlessly while minimizing resource overheads becomes critical. DevOps teams can play a pivotal role in achieving this by embracing automation across various facets of the service delivery process.
Automation shines in areas such as infrastructure provisioning and scaling, continuous integration and delivery (CI/CD), testing, security and compliance, but the practice of automating root cause analysis remains elusive.?
While automation aids observability data collection and data correlation, understanding the relationships between cause and effect still requires the judgment and expertise of skilled personnel. This work falls on the shoulders of developers and SREs who have to manually decode the signals - from metrics, traces and logs - in order to get to the root cause when the performance of services degrades.
Individual incidents can take hours and even days to troubleshoot, demanding significant resources from multiple teams. The consistency of the process can also vary greatly depending on the skills that are available when these situations occur.?
Service disruptions can also have significant financial consequences. Negative customer experiences directly impact revenue, and place an additional resource burden on the business functions responsible for appeasing unhappy customers. Depending on the industry you operate in and the type of services you provide, service disruptions may result in costly chargebacks and fines, making mitigation even more crucial.????
Shining A Light On The Root Cause Analysis Problem In DevOps
While decomposing applications into microservices through the adoption of cloud-native architectures has enabled DevOps teams to increase the velocity with which they can release new functionality, it has also created a new set of operational challenges that have a significant impact on ongoing operational expenses and service reliability.
Increased complexity: With more services comes greater complexity, more moving parts, and more potential interactions that can lead to issues. This means diagnosing the root cause of problems becomes more difficult and time-consuming.
Distributed knowledge: In cloud-native environments, knowledge about different services often resides in different teams, who have limited knowledge of the wider system architecture. As the number of services scales, finding the right experts and getting them to collaborate on troubleshooting problems becomes more challenging. This adds to the time and effort required to co-ordinate and carry out root cause analysis and post incident analysis.??
Service proliferation fuels troubleshooting demands: Expanding your service landscape, whether through new services or simply additional instances, inevitably amplifies troubleshooting needs which translate into more resource requirements in DevOps teams for troubleshooting overtime.
Testing regimes cannot cover all scenarios: DevOps, with its CI/CD approach, releases frequent updates to individual services. This agility can reveal unforeseen interactions or behavioral changes in production, leading to service performance issues. While rollbacks provide temporary relief, identifying the root cause is crucial. Traditional post-rollback investigations might fall short due to unreproducible scenarios. Instead, real-time root cause analysis of these situations as they happen is important to ensure swift fixes and prevent future occurrences.
Telling The Story with Numbers
As cloud-native services scale, troubleshooting demands also grow exponentially, in a similar way to compounding interest on a savings account. As service footprints expand, more DevOps cycles are consumed by troubleshooting versus delivering new code, creating barriers to innovation. Distributed ownership and unclear escalation paths can also mask the escalating time that is consumed by troubleshooting.?
Below is a simple model that can be customized with company-specific data to illustrate the challenge in numbers. This model helps paint a picture of the current operational costs associated with troubleshooting. It also demonstrates how these are going to escalate over time, driven by the growth in cloud-native services (more microservices, serverless functions etc).???
The model also illustrates the impact of efficiency gains through automated root cause analysis? versus the current un-optimized state. The gap highlights the size of the opportunity available to create more cycles for productive development while reducing the need for additional headcount into the future, by automating troubleshooting.?
领英推荐
Beyond the cost of human capital, there are a number of other costs that have a direct impact on troubleshooting costs. These include the escalating costs of infrastructure and third-party SaaS services, dedicated to the management of observability data. These are well publicized and highlighted in a recent article published by Causely Founding Engineer Endre Sara that discusses avoiding the pitfalls of escalating costs when building out Causely’s own SaaS offering.?
While DevOps operations have a cost, we must also not forget they pale in comparison to the financial consequences of service disruptions. With automated root cause analysis, DevOps teams can mitigate these risks, saving the business time, money and reputation.?
Future iterations of the model will account for these additional dimensions.???
If you would like to put your data to work and see the quantifiable benefits of automated Root Cause Analysis in numbers, complete the short form to get started.
Translating Theory into Reality??
Did you know companies like Meta and Capital One automate root cause analysis in devops, achieving 50% faster troubleshooting? However, the custom solutions built by these industry giants require vast resources and deep expertise in data science to build and maintain, putting automated root cause analysis capabilities out of reach for most companies.?
The team at Causely are changing this dynamic. Armed with decades of experience - ?applying AI to the management of distributed systems, networks, and application resource management - they offer a powerful SaaS solution that removes the roadblocks to automated root cause analysis in DevOps environments. The Causely solution enables;?
By understanding cause-and-effect relationships, the platform also allows architects and engineers to explore hypothetical scenarios. They can ask questions like, "What would happen if elements of a service degraded or failed?" This proactive approach enables continuous improvement of performance and resilience of business services.
Wrapping Up?
In this article we discussed the key challenges associated with troubleshooting and highlighted the cost implications of today’s approach. Addressing these issues is important because the business consequences of today's approach are significant.???
If you don't clearly understand your future resource requirements and costs associated with troubleshooting as you scale out cloud-native services, the model we’ve developed provides a simple way to capture this.?
Want to see quantifiable benefits from automated Root Cause Analysis?
Turn "The Optimized Mode of Operation" in the model from vision to reality with Causely.
The Causely service enables you to measure and showcase the impact of automating RCA in your organization. Through improved service performance and availability, faster resolution times, and improved team efficiency, the team will show you the "art of the possible."?
Passionate about fulfilling the promise of Continuous Application Reliability. Placing human empathy at the center. Key contributor to three successful SaaS exits
8 个月Really interesting. The APM/Observability industry has been mentioning 'RCA' for a long long time, but true RCA remains a very rare occurrence. I applaud the effort to finally deliver on a promise made a long time ago by an entire industry that remains mostly unfulfilled. I have been part of this Journey since at least 2003 (starting my work in APM then, the precursor of Observability) and I have a passion for this topic: a message that is repeated so much and yet rarely ever done well. Onwards and thank you!
NYU Stern MBA Candidate | Venture Capital Analyst
8 个月Wow the Avg Cost of an unplanned outage or downtime at $855k is enough right there to consider implementing the proper precautions. Aside from the stress of downtime or an outage, the cost financially adds to the pain!
Director @ Sumo Logic | AWS Certified Solutions Architect
8 个月Very good points Andrew!!
? Infrastructure Engineer ? DevOps ? SRE ? MLOps ? AIOps ? Helping companies scale their platforms to an enterprise grade level
8 个月Exciting read for all DevOps professionals out there, can't wait to dive in! ??? #DevOps #FinOps
Business Growth Specialist | Driving Your Success Through Proven Methods | Maximise Your Growth | Maximise Your Business Growth | Be The Best Business (Owner) You Can Be
8 个月Andrew Mallaband It’s shocking that 71% of IT ops team are saying that there monitoring data is not actionable, these teams need a way to sort the noise and symptoms from the root cause of the problem. I can see how the ability to quickly determine the root cause eliminates a lot of wasted troubleshooting time and allows the IT ops team to get on with resolving vs. identifying the problem.